We are thrilled to announce the release of Apache Samza 1.1.0

Today Samza forms the backbone of hundreds of real-time production applications across a multitude of companies, such as LinkedIn, VMWare, Slack, Redfin among many others.
This release of Samza adds a variety of features and capabilities to Samza’s existing arsenal, coupled with improved documentation, code snippets, examples.
Samza provides leading support for large-scale stateful stream processing with:

  • First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
  • Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
  • A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
  • High level API for expressing complex stream processing pipelines in a few lines of code.
  • Beam Samza Runner that marries Beam’s best in class support for EventTime based windowed processing and sophisticated triggering with Samza’s stable and scalable stateful processing model.
  • A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
  • A Table API that provides a common abstraction for accessing remote or local databases and allowing developers are able to "join" an input event stream with such a Table.
  • Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
  • Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.

New Features, Upgrades and Bug Fixes

This release brings the following features, upgrades, and capabilities: * We have created a new Samza Stream Processing video series on Youtube * New and improved documentation, code snippets, and examples for using the latest version of Samza with Apache Beam (Code samples are here: https://github.com/apache/samza-beam-examples)

API enhancements and simplifications:

  • SAMZA-1981 Consolidate table descriptors to samza-api.
  • SAMZA-1998 Table API refactoring.
  • SAMZA-1980 Rename LocalStoreBackedTable to LocalTable.
  • SAMZA-2043 Consolidate ReadableTable and ReadWriteTable.
  • SAMZA-2012 Add API for wiring an external context through to application processing.
  • SAMZA-2026 Refactor remote table API to separate retry policy settings.
  • SAMZA-2041 Add system descriptors for HDFS and Kinesis.
  • SAMZA-2081 Samza SQL: Type system for Samza SQL.
  • SAMZA-2106 Samza App & Job Config Refactor.

State Store Restoration:

  • SAMZA-2018 State restore improvements using RocksDB writebatch API.

Standalone Improvements:

  • SAMZA-1973 Unify the TaskNameGrouper interface for yarn and standalone.
  • SAMZA-1952 StreamPartitionCountMonitor for standalone.

Other Upgrades and Bug-fixes:

  • SAMZA-1638 Recreate SystemProducer on KafkaCheckpointManager.writeCheckpoint failure.
  • SAMZA-1946 Problem with Race between TimerListener initialization and timers fired from init().
  • SAMZA-2004 Add ability to disable table metrics.
  • SAMZA-2013 Account for cycles in graph traversal within Execution Planner.
  • SAMZA-2015 Refactor timer handling in tables to be consistent with stores.
  • SAMZA-2072 Update guava to 23.0.
  • SAMZA-2090 Fix flush behavior for remote and hybrid tables.
  • SAMZA-2108 Check for host affinity config before resolving preferred host matching.
  • SAMZA-2109 Reduce default-buffer sizes for per-partition queues.
  • SAMZA-2118 Improve the shutdown sequence of AsyncRunLoop.
  • SAMZA-2119 Upgrading yarn-client version to 2.7.1.
  • SAMZA-2122 Fix the task caught-up logic which doesn't handle no incoming messages

The complete list of resolved Jira tickets for this release is found here.

This release also includes improvements such as durable state in high-level API, Zookeeper-based deployment stability, and multi-stage batch processing, and bug fixes such as KafkaSystemProducer concurrent sends and flushes.

API Updates

The following imports for Table API have been updated:

  • Rename the import org.apache.samza.storage.kv.descriptors.BaseLocalStoreBackedTableDescriptor to org.apache.samza.storage.kv.descriptors.BaseLocalTableDescriptor
  • Rename the import org.apache.samza.table.remote.descriptors.RemoteTableDescriptor to org.apache.samza.table.descriptors.RemoteTableDescriptor
  • Rename the import org.apache.samza.table.caching.descriptors.CachingTableDescriptor to org.apache.samza.table.descriptors.CachingTableDescriptor

Configurations Updates

The job.name and job.id configs are now deprecated in favor of app.name and app.id configs respectively.

A source download of Samza 1.1.0 is available here, and is also available in Apache’s Maven repository. Samza’s download page for details and Samza’s feature preview for new features.

Community Developments

A Stream Processing with Apache Kafka & Apache Samza meetup/symposium that was held on March 20th which had following presentation for Samza:

  • Apache Samza 1.0: Recent Advances and our plans for future in Stream Processing

Contribute

It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs.

I’d like to close by thanking everyone who’s been involved in the project. It’s been a great experience to be involved in this community, and I look forward to its continued growth.