Announcing the release of Apache Samza 0.13.0

We are very excited to announce the release of Apache Samza 0.13.0.

Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber) for years now. Samza provides leading support for large-scale stateful stream processing with:

• First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.

• Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.

• A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).

• A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.

• Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.

New features

The 0.13.0 release contains previews for the following highly anticipated features:

High Level API

With the new high level API you can express your complex stream processing pipelines concisely in few lines of code and accomplish what previously required multiple jobs. This new API facilitates common operations like re-partitioning, windowing, and joining streams. Check out some examples to see the high level API in action here

Flexible Deployment Model

Samza now provides flexibility for running your application in any hosting environment and with cluster managers other than YARN. Samza can now also be run as a lightweight stream processing library embedded inside your application. Your processes can coordinate task distribution amongst themselves using ZooKeeper or static partition assignments out-of-the box.

See more details and code examples here.

Enhancements, Upgrades and Bug Fixes

This release also includes the following enhancements to existing features:

SAMZA-871 adds a heart-beat mechanism between JobCoordinator and all running containers to prevent orphaned containers.
SAMZA-1140 enables non-blocking commit in the AsyncRunloop.
SAMZA-1143 adds configurations for localizing general resources in YARN.
SAMZA-1145 provides the ability to configure the default number of changelog replicas.
SAMZA-1154 adds a tasks endpoint to samza-rest to get information about all tasks in a job.
SAMZA-1158 adds a samza-rest monitor to clean up stale local stores from completed containers.

This release also includes several bug-fixes and improvements for operational stability. Some notable ones are:

SAMZA-1083 prevents loading task stores that are older than delete tombstones during container startup.
SAMZA-1100 fixes an exception when using an empty stream as both bootstrap and broadcast.
SAMZA-1112 fixes BrokerProxy to log fatal errors.
SAMZA-1121 fixes StreamAppender so that it doesn't propagate exceptions to the caller.
SAMZA-1157 fixes logging for serialization/deserialization errors.

We've also upgraded the following dependency versions:

Samza now supports Scala 2.12.
Kafka version to 0.10.1.1.
Elasticsearch version to 2.2.0

Community Developments

We've made great community progress since the previous release. We showcased how Samza is powering stream processing at LinkedIn in Kafka Summit 2017 and O’Reilly Strata 2017. We also presented Samza use cases and case studies from several large companies in ApacheCon Big Data, 2017. In addition, the Samza talk in LinkedIn's Stream Processing Meetup in Sunnyvale was well-received with over 200 attendees. Here are links to some of these events:

March 15, 2017 - Processing millions of events per second without breaking the bank - Kartik Paramasivam (Video)
May 8, 2017 - Data Processing at LinkedIn with Apache Kafka and Apache Samza (Kafka Summit NYC 2017) (Slides)
May 16, 2017 - What it takes to process a trillion events a day? Case studies in scaling stream processing at LinkedIn - Jagadish Venkatraman (ApacheCon Big Data '17) (Slides)
May 16, 2017 - The continuing story of Batching to Streaming analytics at Optimizely, Michael Borsuk (ApacheCon Big Data’17) (Slides)
May 24, 2017 - Managed or stand alone, streaming or batch; Unified processing with the Samza Fluent API - Yi Pan (LinkedIn Stream Processing Meetup) (Slides)
May 25, 2017 - How companies are using Apache Samza - Jagadish Venkatraman (Apache Con podcast)

Future:

We'll continue improving the new High Level API and flexible deployment features with your feedback.

It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs. I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.