Announcing the release of Samza 1.0
We’re thrilled to announce to the release of Apache Samza 1.0.
Today Samza forms the backbone of hundreds of real-time production
applications across a multitude of companies, such as LinkedIn, VMWare,
Slack, Redfin among many others. This release of Samza adds a variety of
features and capabilities to Samza’s existing arsenal, coupled with new
and improved documentation, code snippets, examples, and a
brand-new website design! Here are a few selected highlights:
-
Stable high level APIs that allow creating complex processing
pipelines with ease. -
Beam Samza Runner now marries Beam’s best in class support for
EventTime based windowed processing and sophisticated triggering
with Samza’s stable and scalable stateful processing model. -
Table API that provides a common abstraction for accessing
remote or local databases. Developers are now able to “join” an
input event stream with such a Table. -
Integration Test Framework to enable effortless testing of Samza
jobs without deploying a Kafka, Yarn, or Zookeeper cluster. -
Support for Apache Log4j2 allowing improved logging performance,
customization, and efficiency. -
Upgraded Kafka client and consumer.
-
An interactive shell for Samza SQL for seamless formulation,
development, and testing of SamzaSQL queries. -
Side-input support that allows using log-compacted data sources
to populate KV state for Samza applications. -
An improved website with detailed documentation and lots of code
samples!
In addition, Samza 1.0 brings numerous bug-fixes, upgrades, and
improvements listed below.
New features
Samza 1.0 brings full-feature support for the following:
Improved Stable High Level APIs
Samza 1.0 brings Descriptor APIs that allows applications to specify
their input and output systems and streams in code. Samza’s new
Context APIs provide applications unified access to job-level,
container-level, task-level, and application-level context and
capabilities. This also simplifies Samza’s ApplicationRunner
interface.
This API evolution requires a few simple modifications to application
code, which we describe in detail in our upgrade steps
Beam Runner Support
Samza’s Beam Runner enables executing Beam pipelines over Samza. This
enables Samza applications to create complex processing pipelines that
require event-time based processing, varying types of event-time based
windowing, and more. This feature is supported in both the YARN and
standalone deployment models.
Joining Streams and Tables
Samza’s Table API provides developers with unified access to local
and remote data sources such as Key-Value stores or web services,
while providing features such as rate-limiting, throttling, and
caching capabilities. This provides first-class API primitives for
building Stream-Table join jobs. Learn more about the use, semantics,
and examples for Table API here.
Test Samza without ZK, Yarn or Kafka
Samza 1.0 brings a test framework that allows testing Samza applications
using in-memory input and output. Users can now setup test and
testing pipelines for their applications without needing to setup any
other services, such as Kafka, YARN, or Zookeeper.
Log4J2 support
Samza now supports Apache Log4j 2 for system and application logging.
Log4j 2 is an upgrade to Log4j that provides significant improvements
over its predecessor, Log4j 1.x, such as better throughput and latency,
custom log levels, and a pluggable logging architecture.
Kafka upgrade
This release upgrades Samza to use Kafka’s high-level consumer (Kafka
v0.11.1.62). This brings latency and throughput benefits for Samza
applications that consume from Kafka, in addition to bug-fixes. This
also means Samza applications can now better their utilization of the
underlying Kafka cluster.
SamzaSQL Shell
SamzaSQL now provides a shell for users to type-in their SQL queries,
while Samza does the heavy-lifting of wiring the inputs and outputs, and
sizing the application in the background. This is great for testing and
experimenting with queries while formulating your application-logic,
specially suited for data-scientists and tinkerers.
Side-inputs
Samza 1.0 brings the ability to leverage existing log-compacted data
sources (e.g., Kafka topics) to populate KV state for Samza
applications. If your data processing pipeline involves Hadoop-to-Kafka
push, this feature alleviates the need for your Samza job to create
separate Kafka-topics to back KV state.
Improved website, documentation, and samples
We’ve re-designed the Samza website making it easier to find details on
key Samza concepts and patterns. All documentation has been revised and
rewritten, keeping in mind the feedback we got from our customers. We’ve
revised and added sample application code to showcase Samza 1.0 and the
use of its new APIs.
Enhancements and Upgrades
This release brings the following enhancements, upgrades, and
capabilities:
API enhancements and simplifications
SAMZA-1789: unify ApplicationDescriptor and ApplicationRunner for high-
and low-level APIs in YARN and standalone environment
SAMZA-1804: System and stream descriptors
SAMZA-1858: Public APIs for shared context
SAMZA-1763: Add async methods to Table API
SAMZA-1786: Introduce the metadata store abstraction
SAMZA-1859: Zookeeper implementation of MetadataStore
SAMZA-1788: Add the LocationIdProvider abstraction
Upgrades and Bug-fixes
SAMZA-1768: Handle corrupted OFFSET file
SAMZA-1817: Long classpath support for non-split deployments
SAMZA-1719: Add caching support to table-API
SAMZA-1783: Add Log4j2 functionality in Samza
SAMZA-1868: Refactor KafkaSystemAdmin from using SimpleConsumer
SAMZA-1776: Refactor KafkaSystemConsumer to remove the usage of
deprecated SimpleConsumer client
SAMZA-1730: Adding state validation in StreamProcessor before any
lifecycle operation and group coordination
SAMZA-1695: Clear events in ScheduleAfterDebounceTime on session
expiration
SAMZA-1647: Fix race conditions in StreamProcessor
SAMZA-1371: Some Samza Containers get stuck at \“Starting BrokerProxy\”
SAMZA-1648: Integration Test Framework & Collection Stream Impl
SAMZA-1748: Failure tests in the standalone deployment
A source download of Samza 1.0 is available here, and in Apache’s Maven repository.
Community Developments
A symposium
on Stream processing with Apache Samza and Apache Kafka was held on July
19th and on October 23rd. Both were attended by more than 350
participants from across the industry. It featured in-depth talks on
Samza’s Beam integration, its use at LinkedIn for real-time
notifications, a talk on Kafka-replication at Uber, and Kafka cruise
control, and many others.
Samza was also the focus of a talk at Strange Loop'18,
focussing in depth on its scalability, performance, extensibility, and
programmability.