Say Good-Bye to Canned Data

Mark Payne - 
markap14@hotmail.com


We've all been there. After months of development and exhaustive testing, our killer new web service (or app or analytic or what have you) is ready for production!
We've tested it with all of the random data that we've mocked up, including all of the samples that we've concocted to ensure that it handles every bad
input that we can imagine. We've handled it all well, so it's time to deploy to production. So we do. And all is great!

Until an hour later, when our logs start filling with errors because, despite all of our due diligence in testing, we never could have
envisioned getting that as input. And now we're in a mad frenzy to fix our problem, because we're now responsible for all of the
errors that are happening in production.

If there's one thing that I've learned in my years as a software developer, it's that no matter how diligent we are in testing our code,
we get data in production that we just haven't accounted for.

So what can we do about it? Test with live production data!

Now I'm not suggesting that we skip the testing phase all together and go straight into production - quite the opposite really. I'm just suggesting that we test
"smarter, not harder." One of the benefits of Apache NiFi (incubating) is that it allows us to have real-time command and control of our data. It allows us to change our
dataflow in just a few minutes to send multiple
copies of our data to anywhere we want, while providing different reliability guarantees and qualities of service to different parts of our flow. So let's look at how we might
accomplish this.

Let's assume that we typically get our feed of data to our web service from NiFi in a flow that looks something like this:

Original Flow

Now let's say that we want to send a duplicate copy of this data feed to our "staging" environment that is running the new version of our web service - or a new web service all together. We can simply copy and paste the
InvokeHTTP processor that we're using to send the data to our production instance (select the processor and press Ctrl-C, Ctrl-V to copy and paste) and then right-click on the new one and choose "Configure..." In the Properties tab, we will change the URL
to point to our staging instance. Now, all we have to do is draw a second connection from the preceding processor to our newly created InvokeHTTP. We will give it the same relationship that feeds the production instance - "splits."
And now any data that goes to our production instance will also be sent to our staging environment:

Flow with second InvokeHTTP

Of course, since we know this is a staging environment, it may not be as powerful as the production instance and it may not handle the entire stream of live data.
What we really would like to do is send only about 20% of our data to our staging
environment. So how can we accomplish that? We can easily insert a DistributeLoad Processor just ahead of our new InvokeHTTP processor, like so:

Flow with DistributeLoad added in

We can now configure the DistributeLoad Processor to have two relationships. We want 80% of the load to go to relationship "1" and 20% to go to relationship "2." We can accomplish this by adding
two user-defined properties. We right-click on DistributeLoad and choose "Configure..." In the Properties tab, click the icon to add the first property.
We give it the name "1" and a value of 80. Then we click OK and add another property. This time, we give the property
the name "2" and a value of "20":

Configure DistributeLoad

Now, all we have to do is go to the Settings Tab
and choose to Auto-Terminate Relationship 1. This will throw away 80% percent of our data. (Not to worry, we don't actually make copies of this data just to throw away 80% of it. The actual work required
to "clone" a FlowFile is very small, as it doesn't actually copy any of the content but rather just creates a new pointer to the content.) Now we add a Connection from DistributeLoad to InvokeHTTP and
use Relationship "2." Start the Processors, and we've now got 20% of the data being pushed to our staging environment:

20% of data going to staging area

Now, we have just one more concern that we need to think about. Since we're sending data to our staging area, which may go down pretty often, as we are debugging and testing things, won't the data
backup in NiFi on our production dataflow? At what point is this going to cause a problem?

This is where my former comment about NiFi offering "different Quality of Service guarantees to different parts of the flow" comes in. For this endpoint, we just want to try to send the data and if it can't handle
the data, we don't want to risk causing issues in our production environment. To ensure that this happens, we right-click on the connection that feeds the new InvokeHTTP processor and click "Configure..." (You'll have to first stop
the connection's source and destination processors in order to modify it). In the settings tab here, we have an option for "File expiration." The default is "0 sec," which means that the data will not
age off. Let's change this value to 3 minutes.

Now, when we click Apply, we can see that the Connection's label has a small "clock" icon on it, indicating that the connection has an expiration set.

Age off after 3 minutes

Any FlowFile in the connection that becomes more than 3
minutes old will automatically be deleted. This means that we will buffer up to three minutes worth of data in our production instance to send to staging environment but no more. We still will not expire
any data that is waiting to go to the production instance. Because of this capability, we can extend our example a bit to perform load testing as well. While in this example we decided that we only wanted to
send 20% of our data to the staging environment, we could easily remove the DistributeLoad processor all together. In this way, we will send 100% of our production data to the staging environment, as long
as it is able to handle the data rate. However, if it falls behind, it won't hurt our production servers because they'll destroy any data more than 3 minutes old. If concerns were to arise, we can disable
this in a matter of seconds: simply stop the DistributeLoad processor and the SplitText processor feeding it, remove the connection between them, and restart the SplitText processor.

As always, I'd love to hear feedback in the Comments section about how we could improve, or how you've solved similar problems.