Scaling Usergrid to over 10,000 requests/second - Part 1
By Michael Russo
Based on recent testing at Apigee, the upcoming Apache Usergrid 2 release is set to be the most scalable open-source Backend as a Service available. We were able to drive Usergrid to 10,000 transactions per second and, more importantly, found that Usergrid can scale horizontally. Here's the story of how we got there.
What is Usergrid?
Apache Usergrid is a software stack that enables you to run a (Backend-as-a-Service) BaaS that can store, index, and query JSON objects. It also enables you to manage assets and provide authentication, push notifications, and a host of other features useful to developers—especially those working on mobile apps.
The project recently graduated from the Apache Incubator and is now a top-level project of the Apache Software Foundation (ASF). Usergrid is new at Apache, but Apigee has been using it in production for three years now as the foundation for Apigee's API BaaS product.
What's new in Usergrid 2?
Usergrid 2 features the same REST API as Usergrid 1, but under the hood just about everything has changed. Usergrid 2 includes a completely new persistence engine backed by Apache Cassandra and a query/indexing service backed by ElasticSearch. We brought ElasticSearch into Usergrid because the query/index service in Usergrid 1 was not performing well and was complex and difficult to maintain. ElasticSearch does a much better job of query/index than we could have done ourselves. Additionally, separating key-value persistence from index/query allows us to scale each concern separately.
As the architecture of Usergrid changed drastically, we needed to have a new baseline performance benchmark to make sure the system scaled as well as, if not better than, it did before. Let's talk about how we tested.
Our testing framework and approach
The Usergrid team has invested a lot of time building repeatable test cases using the Gatling load-testing framework. Performance is a high priority for us and we need a way to validate performance metrics for every release candidate.
As Usergrid is open source, so are our Usergrid-specific Gatling scenarios, which you can find here: stack/loadtests (on Github).
Usergrid application benchmark
One of our goals was to prove that we had the ability to scale more requests per second with more hardware, so we started small and worked our way up.
As the first in our series of new benchmarking for Usergrid, we wanted to start with a trivial use case to establish a solid baseline for the application. All testing scenarios use the HTTP API and test the concurrency and performance of the requests. We inserted a few million entities that we could later read from the system. The test case itself was simple. Each entity has a UUID (universally unique identifier) property. For all the entities we had inserted, we randomly read them out by their UUID:
GET /organization/application/collection/:entityUUID
First, we tried scaling the Usergrid application by its configuration. We configured a higher number of connections to use for Cassandra and a higher number of threads for Tomcat to use. This actually yielded higher latencies and system resource usage for marginally the same throughput. We saw better throughput when there was less concurrency allowed. This made sense, but we needed more, and immediately added more Usergrid servers to verify horizontal scalability. What will it take to get to 10,000 RPS?
# Usergrid Servers | # Cassandra Nodes | Peak Requests Per Second |
---|---|---|
6 | 4 | 1420 |
6 | 6 | 2248 |
10 | 6 | 3324 |
20 | 6 | 3820 |
Switch to nine c3.2xlarge instances for Cassandra | ||
20 | 9 | 6321 |
Switch to nine c3.4xlarge instances for Cassandra | ||
20 | 9 | 7237 |
30 | 9 | 9120 |
35 | 9 | 10268 |
Cassandra Performance
It was time to see if Cassandra was keeping up. As we scaled up the load we found Cassandra read operation latencies were also increasing. Shouldn't Cassandra handle more, though? We observed a single Usergrid read by UUID was translating to about 10 read operations to cassandra. Optimization #1: reduce the number of read operations from Cassandra on our most trivial use case. Given what we know, we still decided to test up to a peak 10,000 RPS in the current state.
What's Next?
As part of this testing, not only did we identify code optimizations that we can quickly fix for huge performance gains, we also learned more about tuning our infrastructure to handle high concurrency. Having this baseline gives us the motivation to continually improve performance of the Usergrid application, reducing the cost for operating a BaaS platform at huge scale.
This post is just the start of our performance series. Stay tuned, as we’ll be publishing more results in the future for the following Usergrid scenarios:
- Query performance - this includes complex graph and geo-location queries
- Write performance - performance of directly writing entities as well as completing indexing
- Push Notification performance - this is a combination of query and write performance
See you next time!