Announcing the release of Apache Samza 0.11.0

We are excited to announce that the Apache Samza 0.11.0 has been released.

Samza is a stable and mature Stream processing framework that has been powering real time applications across various companies in production for a few years now. Samza has industry leading support for stateful stream processing with cutting edge features like

Support for RocksDB based local state.
Incremental state checkpointing: This feature is unique compared to existing stream processing frameworks and allows Samza to support applications with large state very elegantly.
Minimal impact during application upgrades by minimizing state movement.

Deep support for local state allows a stateful application to scale up to 1.1 Million events/sec on a single SSD based machine.

The 0.11.0 release packs up several large improvements in runtime performance, operational stability and ease of use. Some of the key highlights include

Asynchronous API and processing (SAMZA-863, doc): Prior to this release, Samza only supported a synchronous single threaded process model. Increasing the number of containers (processes) to increase parallelism required a lot more memory resources. This inefficiency was more obvious for applications that make remote calls to external services/databases. With this new feature an application can increase parallelism very efficiently within a single container (process). In addition to a parallel processing model we now also support a purely asynchronous processing model which makes it a lot more efficient to perform remote I/O. In the absence of this support for async processing model, samza applications that wanted to process messages asynchronously would also had to handle the additional complexity of managing checkpointing (by disabling auto-checkpointing in Samza). With the new support for async processing, Samza will make sure that checkpointing is automatically performed only after the async calls have completed.
Separate Samza framework deployment from user jobs (SAMZA-849, doc): Typically in a large organization the team that manages the Samza cluster is not the same as the teams that are running applications on top of Samza. This feature allows upgrading the Samza framework without forcing developers to explicitly upgrade their running applications. With simple config changes, it supports canary, upgrade and rollback scenarios commonly required in organizations that run tens or hundreds of jobs.
Samza Rest API (SAMZA-865, doc): The REST API provides a rich set of operations for the users to interact with their running jobs. Samza REST API allows you to start, stop and list jobs, and also run periodic monitoring scripts. This API can be integrated with deployment tooling and job dashboard for better job management.
Disk monitoring (SAMZA-924): A Samza YARN cluster is used to run several stream processing applications on a shared set of physical machines. In such a multi-tenant environment it is critical to have some limits on the amount of disk space used by each job, especially to store application state. This feature introduces the measurement of the disk usage for selected job directories. The disk space usage will be gathered periodically and reported to Samza metrics. In the next release this feature will be extended to also enforce the disk quotas.
New metrics to troubleshoot and monitor performance issues: SAMZA-972 added holistic monitoring of memory in Samza applications. With SAMZA-963 we added the ability to troubleshoot performance issues better by isolating the time spent in the application from the time spent in accessing state.

Overall, 37 JIRAs were resolved in this release.

A source download of the 0.11.0 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.

Project Status

A total of 62 contributors have contributed to the Samza Project so far. In this release 21,473 lines of code were added/changed.

With this release we also add 3 new committers to the Apache Samza community.

Recent Community Activities

There has been a lot of activities from the community during this release time frame. Here are links to some of them.

Conferences:
- Stream processing Meetup @ LinkedIn
  - Scalable Complex Event Processing on Samza @Uber (Uber)
  - How to convert a legacy Hadoop Map/Reduce ETL systems to Samza Streaming (TripAdvisor)
  - Air Traffic Controller: Using Samza to Manage Communications with Members (LinkedIn)
  - Nearline Topic Tagging with Apache Samza (LinkedIn)
- Detailed list of links to other presentations can be found here
Blogs:
- Streaming Processing Hard Problems - Killing Lamda
- Streaming Processing Hard Problems - Data Access

Contribute!

There are a lot more exciting features to expect in our future release. Some of them are:

Samza operators API (SAMZA-914)
HDFS system consumer (SAMZA-967)
Support for standalone Samza jobs (SAMZA-516)
Disk quotas enforcement (SAMZA-956)

It’s a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs. Also, don’t miss out the upcoming meetup on November 2. Sign up now!

I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.