We are excited to announce that the Apache Samza 0.12.0 has been released.
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber) for a few years now. Samza provides leading support for large-scale stateful stream processing with features such as:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single SSD based machine.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- Minimal impact during application maintenance.
In addition to general stream processing capabilities, Samza also supports:
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and outputs (HDFS, Kafka, ElastiCache etc.). This allows applications to directly process data from various event sources without mandating that the data should be moved into Kafka.
- A fully async programming model. This allows applications that make remote calls to increase parallelism very efficiently.
- Features like canaries, upgrades and rollbacks that support extremely large deployments.
This 0.12.0 release adds several features to Samza to improve stability, performance and ease of use. Here are some highlights of this release.
Convergence of Batch and Real-time processing in Samza:
End of Stream support: Samza has always supported streaming input sources like Kafka. In such sources, it is assumed that the incoming stream of data is infinite. Samza will now have an ‘end-of-stream’ notion to support consuming from input sources that are finite (for example, on-disk files). This enables the Samza job to shut-down gracefully when it has finished consuming all data.
HDFS Consumer: Samza now provides first-class support for consuming data from HDFS files. This enables developers to define their processing logic once, and run it in both batch and streaming environments. This feature also allows for rapid experimentation with ETL’d HDFS data using Samza without the need to write a separate Hadoop job. (SAMZA-967)
Samza can now notify the SystemConsumer when performing a checkpoint. This can enable Samza to support consumers such as: Amazon Kinesis, Amazon SQS, Azure ServiceBus Queues/Topics, Google Cloud Pub-Sub, ActiveMQ, etc., which each manage checkpointing on their own. This also enables consumers to implement smart retention policies (such as deleting data once it has been consumed). (SAMZA-1042)
Support for Yarn Node Labels:
Often Samza YARN clusters have machines that are not homogenous. For example, nodes could have different memory hardware, CPUs, spinning disks or SSDs. With this feature, users can assign “labels” to nodes in their YARN cluster and use them to specify the where their Samza job should run. This feature allows flexibility in scheduling jobs based on trade-offs in resource requirements, performance and hardware costs. For example, stateful jobs can be configured to run on nodes with SSDs while stateless jobs can be configured to run on nodes with spinning disks. (SAMZA-1013)
This release also includes several critical bug-fixes and improvements for operational stability.
Some notable ones include:
- HttpFileSystem timeout for blocking reads when localizing containers (SAMZA-1079).
- SamzaContainer should catch all Throwables instead of only exceptions (SAMZA-1077).
- Deadlock between KafkaSystemProducer and KafkaProducer from kafka-clients lib (SAMZA-1069).
- Change the commit order to support at least once processing when deduping with local store (SAMZA-1065).
- Upgraded Kafka version to 0.10. This enables us to take advantage of the critical fixes and improvements in Kafka.
- Upgraded to Jetty 9 from Jetty 8.
- Full support for Scala 2.11. All Samza jars will now have the scala version as 2.11 as a part of their file name. For example, samza-yarn_2.11-0.12.jar.
- Samza is now source compatible with JDK 8 and above. Older JDKs are no longer supported.
We made great community progress since the last release. We had two successful meetups where we presented Samza’s roadmap, and how Optimizely uses Samza. Several Samza use-cases in Uber and LinkedIn were featured in QCon 2016.
- Conferences and talks:
- QCon November 2016 : Scaling up Near real-time Analytics
- Samza meetup Nov 2016: Apache Samza: Past, Present, and Future
- Samza meetup Feb 2017: Batch to Streaming analytics at Optimizely
- Samza meetup Feb 2017: Async processing and multi-threading in Samza
- The entire list of links to other presentations can be found here
There are a lot of exciting features to expect in our future release. Here are some highlights:
- Support for Disk quota enforcement and throttling (SAMZA-956)
- Support for high-level programming API for stream processing (SAMZA-1073)
- Support for running Samza in stand-alone mode (SAMZA-516)
It’s a great time to get involved. You can start by reviewing the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs.
I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.