I am excited to announce that the Apache Samza 0.10.1 has been released. This is our fourth release as an Apache Top-level Project!
Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It was originally created at LinkedIn and still continues to be used in production. The project is currently under active development with contributions from a diverse group of contributors and committers. Samza still continues to be used in production by many companies (such as Netflix, Uber, TripAdvisor etc. See PoweredBy) in the industry.
Overall, 72 JIRAs were resolved in this release. This is a minor release consisting of some bug-fixes and robust improvements to features like coordinator stream, host-affinity etc. Samza continues to require Java 1.7+ and Yarn 2.6.1+.
A few notable enhancements are:
- Support static partition assignment in ProcessJobFactory (SAMZA-41)
- Slow start of Samza jobs with large number of containers (SAMZA-843)
- Change log not working properly with In memory Store (SAMZA-889)
- Refactor and fix Container allocation logic (SAMZA-866)
- Detect partition count changes in input streams (SAMZA-882)
- Host Affinity - State restore doesn't work if the previous shutdown was uncontrolled (continuous offset) (SAMZA-905)
- Broadcast stream is not added properly in the prioritized tiers in the DefaultChooser (SAMZA-944)
Some notable performance improvements are:
- Improve the performance of the continuous OFFSET checkpointing for logged stores (SAMZA-964)
- Host Affinity - Minimize task reassignment when container count changes (SAMZA-906)
- Improve event loop timing metrics (SAMZA-951)
- Avoid unnecessary flushes in CachedStore (SAMZA-873)
Known issues in this release:
- Incompatible change in Kafka producer that does not honor custom partitioners (SAMZA-839)
We've also made a lot of community progress during this release:
- We had 2 successful meetups - one in February and the other in June. The upcoming meetup is scheduled for August 23.
- Apache Samza was presented at the Apache Big Data (North America) conference in May 2016 and at the Hadoop Summit in June 2016. Check out the content here.
- Samza paper/workshop was also accepted at notable academic conferences:
- SamzaSQL: Scalable Fast Data Management with Streaming SQL presented at IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) in May 2016
- Effective Multi-stream Joining in Apache Samza Framework in 5th IEEE International Congress on Big Data, June 27 - July 2, 2016, San Francisco, USA
- 380 emails sent to the developer mailing list in past 3 months
There are a lot more exciting features to expect in our future release. Some of them are:
- Support multi-threading in samza tasks (SAMZA-863)
- Disk Quotas: Add throttler and disk quota enforcement (SAMZA-956)
- REST API for starting and stopping Samza jobs (SAMZA-865)
- Samza standalone mode (SAMZA-516)
- High-level language for Samza (SAMZA-390)
It’s a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs. Also, don’t miss out the upcoming meetup on August 23. Sign up now!
I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.