By Michael Russo

Based on recent testing at Apigee, the upcoming Apache Usergrid 2 release is set to be the most scalable open-source Backend as a Service available. We were able to drive Usergrid to 10,000 transactions per second and, more importantly, found that Usergrid can scale horizontally. Here's the story of how we got there.

What is Usergrid?

Apache Usergrid is a software stack that enables you to run a (Backend-as-a-Service) BaaS that can store, index, and query JSON objects. It also enables you to manage assets and provide authentication, push notifications, and a host of other features useful to developers—especially those working on mobile apps.

The project recently graduated from the Apache Incubator and is now a top-level project of the Apache Software Foundation (ASF). Usergrid is new at Apache, but Apigee has been using it in production for three years now as the foundation for Apigee's API BaaS product.

What's new in Usergrid 2?

Usergrid 2 features the same REST API as Usergrid 1, but under the hood just about everything has changed. Usergrid 2 includes a completely new persistence engine backed by Apache Cassandra and a query/indexing service backed by ElasticSearch. We brought ElasticSearch into Usergrid because the query/index service in Usergrid 1 was not performing well and was complex and difficult to maintain. ElasticSearch does a much better job of query/index than we could have done ourselves. Additionally, separating key-value persistence from index/query allows us to scale each concern separately.

As the architecture of Usergrid changed drastically, we needed to have a new baseline performance benchmark to make sure the system scaled as well as, if not better than, it did before. Let's talk about how we tested.

Our testing framework and approach

The Usergrid team has invested a lot of time building repeatable test cases using the Gatling load-testing framework. Performance is a high priority for us and we need a way to validate performance metrics for every release candidate.

As Usergrid is open source, so are our Usergrid-specific Gatling scenarios, which you can find here: stack/loadtests (on Github).

Usergrid application benchmark

One of our goals was to prove that we had the ability to scale more requests per second with more hardware, so we started small and worked our way up.

As the first in our series of new benchmarking for Usergrid, we wanted to start with a trivial use case to establish a solid baseline for the application. All testing scenarios use the HTTP API and test the concurrency and performance of the requests. We inserted a few million entities that we could later read from the system. The test case itself was simple. Each entity has a UUID (universally unique identifier) property. For all the entities we had inserted, we randomly read them out by their UUID:

GET /organization/application/collection/:entityUUID

First, we tried scaling the Usergrid application by its configuration. We configured a higher number of connections to use for Cassandra and a higher number of threads for Tomcat to use. This actually yielded higher latencies and system resource usage for marginally the same throughput. We saw better throughput when there was less concurrency allowed. This made sense, but we needed more, and immediately added more Usergrid servers to verify horizontal scalability. What will it take to get to 10,000 RPS?

# Usergrid Servers # Cassandra Nodes Peak Requests Per Second
6 4 1420
6 6 2248
10 6 3324
20 6 3820
Switch to nine c3.2xlarge instances for Cassandra
20 9 6321
Switch to nine c3.4xlarge instances for Cassandra
20 9 7237
30 9 9120
35 9 10268

Cassandra Performance

It was time to see if Cassandra was keeping up. As we scaled up the load we found Cassandra read operation latencies were also increasing. Shouldn't Cassandra handle more, though? We observed a single Usergrid read by UUID was translating to about 10 read operations to cassandra. Optimization #1: reduce the number of read operations from Cassandra on our most trivial use case. Given what we know, we still decided to test up to a peak 10,000 RPS in the current state.

RPS screen shot

RPS statistics screen shot/></p>
<p>The cluster was scaled horizontally (more nodes) until we needed to vertically scale (bigger nodes) Cassandra due to high CPU usage.  We stopped at 10,268 Requests Per Second with 35 c3.xlarge Usergrid servers and 9 c3.4xlarge Cassandra nodes. By this point numerous opportunities for improvement were identified in the codebase, and we had already executed on some. We fully expect to reach the same throughput with much less infrastructure in the coming weeks. In fact, we've already reached ~7,800 RPS with only 15 Usergrid servers since our benchmarking.</p>
<h3>Deployment/Runtime Architecture</h3>
<p>Here are the components that we used in our Usergrid performance testing:</p>
<li>Tomcat 7.0.62 where the Usergrid WAR file is deployed</li>
<li>Cassandra 2.0.15 w/ Astyanax client</li>
<li>Elasticsearch 1.4.4 (not utilized in these tests)</li>
<p>As part of benchmarking, we wanted to ensure that all configurations and deployment scenarios exactly matched how we would run a production cluster.  The main configurations that are recommended for production use of Usergrid are:</p>
<table style= Usergrid (Application) Tomcat (Container) Cassandra (Database) 1 LOCAL QUORUM read and write consistency set for Cassandra operations Blocking IO connector ( required in Usergrid) 6+ node cluster size 2 Configure separate Keyspace used for Locks vs. Main Usergrid application Use HTTP 1.1 and ensure keepAlive is configured 3 Configure max # of connections used per Cassandra node to be 15 Non-SSL connector (SSL typically handled by a load balancer) Replication Factor = 3

What's Next?

As part of this testing, not only did we identify code optimizations that we can quickly fix for huge performance gains, we also learned more about tuning our infrastructure to handle high concurrency. Having this baseline gives us the motivation to continually improve performance of the Usergrid application, reducing the cost for operating a BaaS platform at huge scale.

This post is just the start of our performance series. Stay tuned, as we’ll be publishing more results in the future for the following Usergrid scenarios:

  • Query performance - this includes complex graph and geo-location queries
  • Write performance - performance of directly writing entities as well as completing indexing
  • Push Notification performance - this is a combination of query and write performance

See you next time!