Real-Time SQL On Event Streams

Mark Payne -
@dataflowmark

Apache NiFi has grown tremendously over the past 2 and a half years since it was
open sourced. The community is continuously thinking of, implementing, and contributing
amazing new features. The newly released version 1.2.0 of NiFi is no exception. One of the most
exciting features of this new release is the QueryRecord Processor and the Record Reader and
Record Writer components that go along with it. If you aren't familiar with those components,
there's a blog post
that explains how those work.

This new Processor, powered by Apache Calcite,
allows users to write SQL SELECT statements to run over their data
as it streams through the system. Each FlowFile in NiFi can be treated as if it were
a database table named FLOWFILE. These SQL queries can be used to filter specific columns
or fields from your data, rename those columns/fields, filter rows, perform calculations
and aggregations on the data, route the data, or whatever else you may want to use SQL
for. All from the comfy confines of the most widely known and used Domain Specific Language.

This is a big deal! Of course, there are already other platforms that allow you to run SQL over arbitrary
data, though. So let's touch on how NiFi differs from other platforms that run SQL over arbitrary data
outside of an RDBMS:

Queries are run locally. There is no need to push the data to some external service,
such as S3, in order to run queries over your data. There's also no need to pay for
that cloud storage or the bandwidth, or to create temporary "staging tables" in a database.
Queries are run inline. Since your data is already streaming through NiFi, it is very
convenient to add a new QueryRecord Processor to your canvas. You are already
streaming your data through NiFi, aren't you? If not, you may find
Our Docs Page helpful to get started learning
more about NiFi.
Query data in any format. Write results in any data format. One of the goals of this effort was
to allow data to be queried regardless of the format. In order to accomplish this, the Processor
was designed to be configured with a "Record Reader" Controller Service and a "Record Writer"
Controller Service. Out of the box, there are readers for CSV, JSON, Avro, and even log data. The results
of the query can be written out in CSV, JSON, Avro, or free-form text (for example, a log format)
using the NiFi Expression Language.
If your data is in another format, you are free to write your own implementation of the Record Reader
and/or Record Writer Controller Service. A simple implementation can wrap an existing library to return
Record objects from an InputStream. That's all that is needed to run SQL over your own custom data
format! Writing the results in your own custom data format is similarly easy.
It's fast - very fast! Just how fast largely depends on the performance of your disks and the number of disks that you have
available. The data must be read from disk, queried, and then the results written to disk. In most scenarios, reading of the data
from disk is actually avoided due to operating system disk caching, though. We've tested the performance on a
server with 32 cores and 64 GB RAM. We ran a continuous stream of JSON-formatted Provenance data through the Processor.
To do this, we used the
NiFi Provenance Reporting Task
to send the Provenance data back to the NiFi instance. Because we wanted to stress the Processor, we then
used a DuplicateFlowFile processor to create 200 copies of the data (this really just creates 200 pointers to the same piece of data;
it doesn't copy the data itself). We used a SQL query to pull out any Provenance "SEND" event (a small percentage
of the data, in this case). Using 12 concurrent tasks, we were able to see the query running at a consistent rate of about 1.2 GB per second
on a single node - using less than half of the available CPU!
Data Provenance keeps a detailed trail of what happened. One of the biggest differentiators between
NiFi and other dataflow platforms is the detailed Data Provenance that NiFi records. If the data that
you end up with is not what you expect, the Data Provenance feature makes it easy to see exactly what
the data looked like at each point in the flow and pinpoint exactly what when wrong - as well as understand
where the data come from and where the data went. When the flow has
been updated, the data can be replayed with the click of a button and the new results can be verified.
If the results are still not right, update your flow and replay again.

How It Works

In order to get started, we need to add a QueryRecord Processor to the graph. Once we've added the Processor
to the graph, we need to configure three things: the Controller Service to use for reading data, the service
to use for writing the results, and one or more SQL queries. The first time that you set this all up, it may
seem like there's a lot going on. But once you've done it a couple of times, it becomes pretty trivial. The
rest of this post will be dedicated to setting everything up. First, we will configure the Record Reader Controller
Service. We'll choose to create a new service:

We are given a handful of different types of services to choose from. For this example, we will use CSV data, so
we will use a CSVReader:

We will come back and configure our new CSVReader service in a moment. For now, we will click the "Create" button and then choose to create a new
Controller Service to write the results. We will write the results in JSON format:

We can again click the "Create" button to create the service. Now that we have created our Controller Services, we will click the "Go To" button on our reader service to jump
to the Controller Service configuration page:

Configuring the Reader

We can click the pencil in the right-hand column to configure our CSV Reader:

The CSV Reader gives us plenty of options to customize the reader to our format, as can be seen in the above image. For this example, we will
leave most of the defaults, but we will change the "Skip Header Line" Property from the default value of "false" to "true" because our data will contain a header
line that we don't want to process as an actual record.

In order to run SQL over our data and make sense of the columns in our data, we need
to also configure the reader with a schema for the data. We have several options for doing
this. We can use an attribute on the FlowFile that includes an Avro-formatted schema. Or we
can use a Schema Registry to store our schema and access it by name or by identifier and version.
Let's go ahead and use a local Schema Registry and add our schema to that Registry. To do so,
we will change the "Schema Access Strategy" to "Use 'Schema Name' Property." This means that
we want to look up the Schema to use by name. The name of the schema is specified by the "Schema Name"
Property. The default value for that property is "${schema.name}", which means that we will just use
the "schema.name" attribute to identify which schema we want to use. Instead of using an attribute,
we could just type the name of the schema here. Doing so would mean that we would need a different
CSV Reader for each schema that we want to read, though. By using an attribute, we can have a single
CSV Reader that works for all schemas.

Next, we need to specify the Schema
Registry to use. We click the value of the "Schema Registry" property and choose to "Create new service..."
We will use the AvroSchemaRegistry. (Note that our incoming data is in CSV format and the output will
be in JSON. That's okay, Avro in this sense only refers to the format of the schema provided. So we will
provide a schema in the same way that we would if we were using Avro.) We will click the "Create" button
and then click the "Go To" arrow that appears in the right-hand column in order to jump to the
Schema Registry service (and click 'Yes' to save changes to our CSV Reader service).

This will take us back to our Controller Services configuration screen. It is important
to note that from this screen, each Controller Service has a "Usage" icon on the left-hand
side (it looks like a little book). That icon will take you to the documentation on how to
usage that specific Controller Service. The documentation is fairly extensive. Under the
"Description" heading, each of the Record Readers and Writers has an "Additional Details..."
link that provides much more detailed information about how to use service and provides
examples.

We will click the Edit ("pencil") icon next to the newly created AvroSchemaRegistry and go to
the Properties tab. Notice that the service has no properties, so we click the New Property ("+") icon
in the top-right corner. The name of the property is the name that we will use to refer to the schema.
Let's call it "hello-world". For the value, we can just type or paste in the schema that we want to use
using the Avro Schema syntax.
For this example, we will use the following schema:


{
  "name": "helloWorld",
  "namespace": "org.apache.nifi.blogs",
  "type": "record",
  "fields": [
      { "name": "purchase_no", "type": "long" },
      { "name": "customer_id", "type": "long" },
      { "name": "item_id", "type": ["null", "long"] },
      { "name": "item_name", "type": ["null", "string"] },
      { "name": "price", "type": ["null", "double"] },
      { "name": "quantity", "type": ["null", "int"] },
      { "name": "total_price", "type": ["null", "double"] }
  ]
}

Now we can click "OK" and apply our changes. Clicking Enable (the lightning bolt icon) enables the service.
We can now also enable our CSV Reader.

Configuring the Writer

Similarly, we need to configure our writer with a schema so that NiFi knows how
we expect our data to look. If we click the Pencil icon next to our JSONRecordSetWriter,
in the Properties tab, we can configure whether we want our JSON data to be pretty-printed
or not and how we want to write out date and time fields. We also need to specify how to access
the schema and how to convey the schema to down-stream processing. For the "Schema Write Strategy,"
since we are using JSON, we will just set the "schema.name" attribute, so we will leave the default value.
If we were writing in Avro, for example, we would probably want to include the schema in the data itself.
For the "Schema Access Strategy," we will use the "Schema Name" property, and set the "Schema Name" property
to "${schema.name}" just as we did with the CSV Reader. We then select the same AvroSchemaRegistry service for
the "Schema Registry" property.

Again, we click "Apply" and then click the Lightning icon to enable our Controller Service and click the Enable button. We can
then click the "X" to close out this dialog.

Write the SQL

Now comes the fun part! We can go back to configure our QueryRecord Processor.
In the Properties tab, we can start writing our queries. For this example, let's
take the following CSV data:


purchase_no, customer_id, item_id, item_name, price, quantity
10280, 40070, 1028, Box of pencils, 6.99, 2
10280, 40070, 4402, Stapler, 12.99, 1
12440, 28302, 1029, Box of ink pens, 8.99, 1
28340, 41028, 1028, Box of pencils, 6.99, 18
28340, 41028, 1029, Box of ink pens, 8.99, 18
28340, 41028, 2038, Printer paper, 14.99, 10
28340, 41028, 4018, Clear tape, 2.99, 10
28340, 41028, 3329, Tape dispenser, 14.99, 10
28340, 41028, 5192, Envelopes, 4.99, 45
28340, 41028, 3203, Laptop Computer, 978.88, 2
28340, 41028, 2937, 24\" Monitor, 329.98, 2
49102, 47208, 3204, Powerful Laptop Computer, 1680.99, 1

In our Properties tab, we can click the "Add Property" button to add a new property.
Because we can add multiple SQL queries in a single Processor, we need a way to distinguish the results of each
query and route the data appropriately. As such, the name of the property is the name of the Relationship that
data matching the query should be routed to. We will create two queries. The first will be named "over.1000" and
will include the purchase_no and customer_id fields of any purchase that cost more than $1,000.00 and will also
include a new field named total_price that is the dollar amount for the entire purchase. Note that when entering
a value for a property in NiFi, you can use Shift + Enter to insert a newline in your value:


SELECT purchase_no, customer_id, SUM(price * quantity) AS total_price
FROM FLOWFILE
GROUP BY purchase_no, customer_id
HAVING SUM(price * quantity) > 1000

The second property will be named "largest.order" and will contain the purchase_no, customer_id, and total price
of the most expensive single purchase (as defined by price times quantity) in the data:


SELECT purchase_no, customer_id, SUM(price * quantity) AS total_price
FROM FLOWFILE
GROUP BY purchase_no, customer_id
ORDER BY total_price DESC
LIMIT 1

Now we will wire our QueryRecord processor up in our graph so that we can use it.
For this demo, we will simply use a GenerateFlowFile to feed it data. We will set the
"Custom Text" property to the CSV that we have shown above. In the "Scheduling" tab,
I'll configure the processor to run once every 10 seconds so that I don't flood the system with
data. We need to add a "schema.name"
attribute, so we will route the "success" relationship of GenerateFlowFile to an UpdateAttribute
processor. To this processor, we will add a new Property named "schema.name" with a value of
"hello-world" (to match the name of the schema that we added to the AvroSchemaRegistry service).
We will route the "success" relationship to QueryRecord.

Next, we will create two UpdateAttribute Processors and connect the "over.1000" relationship
to the first and the "largest.order" relationship to the other. This just gives us a simple
place to hold the data so that we can view it. I will loop the "failure" relationship back
to the QueryRecord processor so that if there is a problem evaluating the SQL, the data
will remain in my flow. I'll also auto-terminate the "original" relationship because once
I have evaluated the SQL, I don't care about the original data anymore. I'll also auto-terminate
the "success" relationship of each terminal UpdateAttribute processor. When we start the
QueryRecord, GenerateFlowFile, and the first UpdateAttribute processors, we end up with
a FlowFile queued up before each UpdateAttribute processor:

If we right-click on the "over.1000" connection and click "List queue," we are able to see
the FlowFiles in the queue:

Clicking the "information" icon on the left-hand-side of the FlowFile gives us the ability
to view the FlowFile's attributes and download or view its contents. We can click the "View"
button to view the content. Changing the "View as" drop-down at the top from "original" to
"formatted" gives us a pretty-printed version of the JSON, and we quickly see the results that we want:

Note that we have a null value for the columns that are defined in our schema that were not part of our results.
If we wanted to, We could certainly update our schema in order to avoid having these fields show up at all.

Viewing the FlowFiles of the "largest.order" connection shows us a single FlowFile also, with
the content that we expect there as well:

Of course, if we have already run the data through the flow and want to go back and inspect
that data and what happened to it, we can easily do that through NiFi's powerful
Data Provenance
feature.

Conclusion

Here, I have given but a small glimpse into the capabilities of the new QueryRecord Processor for NiFi.
The possibilities that it opens up are huge. I'd love to hear thoughts and ideas on how to make it even
better or questions that you may have! Feel free to reach out to the NiFi mailing list at users@nifi.apache.org
or dev@nifi.apache.org.