Blog ›Hazelcast Jet 0.5: Fault Tolerant Stateful Stream Processing Made Easy

By Ensar Basri Kahveci

Distinguished Engineer

Ensar’s primary areas of interest are distributed data, replication, consistency, and storage. He has more than seven years of hands-on expertise in designing, developing, and testing distributed algorithms, with solid experience in concurrency. He has authored a number of articles on distributed data and stream processing, and is a frequent speaker at industry conferences on topics such as replication and distributed systems. Several of his talks can be found on YouTube, including Replication Distilled, Distributed Systems for Mere Mortals, and Replication in the Wild. Ensar is a Ph.D. candidate in computer science at Bilkent University in Ankara, Turkey.

View all blogs by the author

Dec 6, 2017

Back to Blog

Hazelcast Jet 0.5: Fault Tolerant Stateful Stream Processing Made Easy

Stream processing is an emerging computational paradigm for on-the-fly processing of live data feeds, targeting low latency and high throughput. Streaming applications are usually deployed on multiple servers to achieve these requirements. Since even a single failure may lead to incorrect results or long interruptions in result delivery, fault tolerance is of paramount importance in such long-running distributed applications.

With the new 0.5 release, Hazelcast Jet capitalizes on the Hazelcast IMDG platform to simplify fault-tolerant execution of streaming computations. In this blog post, we briefly describe the new high availability and fault tolerance features of Hazelcast Jet.

Highly Available Streaming Computations

Stream processing clusters contain components with heterogeneous roles, wherein a master node coordinates the execution of submitted jobs and possibly multiple worker nodes perform the actual execution. In this setting, a number of availability concerns arise for different components. For instance, a set of master nodes are deployed to eliminate single point of failure. One of these nodes is elected as leader and other nodes wait in standby until the failure of the leader. Relatedly, multiple worker nodes are run for parallel and distributed execution of the jobs. In case of a worker failure, its workload is re-assigned to a standby worker or distributed to other active workers.

In this model, stream processing engines generally depend on external coordination services, such as Apache ZooKeeper, for leader election, and replication of cluster and job information. Although coordination services are a good fit for such use cases, they bring additional complexity into the system. Hazelcast Jet frees the developers from the burden of configuring and maintaining coordination services for different environments and deployment settings. Hazelcast Jet does not depend on any external service and maintains job information in internal Hazelcast IMDG data structures, namely IMap instances. When a job is submitted, the job information is distributed between a set of IMap instances. Since the IMap data structure replicates data entries to multiple servers, Hazelcast Jet enables fault tolerance for job information. The presence of the master node is not visible to the end users. Users do not designate any Hazelcast Jet node for job coordination. Hazelcast Jet clusters coordinate execution of the submitted jobs transparently and automatically. Internally, a single Hazelcast Jet node, called coordinator, takes the responsibility of coordinating job executions. If the coordinator node fails for any reason, another node is automatically promoted. Relatedly, jobs are run on multiple nodes. If a node fails during execution, the job is automatically restarted on the remaining nodes.

So with these mechanisms, we transparently handle job execution and worker failover. But what about data processing guarantees?

Enabling Fault Tolerant Stateful Computations

Hazelcast Jet realizes fault tolerant stateful streaming computations with a new snapshotting mechanism. The new mechanism takes a snapshot of the state of a running streaming job at regular intervals. A snapshot contains a consistent view of the state of all processors at some point in time. In case of a node failure, the coordinator stops the current execution and restarts the job with the latest successful snapshot. When a job is restarted from a snapshot, the nodes recover processor state efficiently from local memory. The processors are initialized with the state recovered from the snapshot and the input streams are rolled back. The snapshotting feature is an efficient implementation of Lamport and Chandy’s “Distributed Snapshots: Determining Global States of a Distributed System” paper.

New methods were added to the Processor API for snapshotting. Upon a snapshot persistence call, a processor passes its internal state to the engine as key-value objects. Then, the processor state is put into internal IMap data structures as part of the snapshot. The new fault tolerance mechanism provides a choice between, none, exactly-once and at-least-once semantics for event processing. Source and sink processors must cooperate with the snapshotting mechanism in order to achieve the exactly-once semantics.

Conclusions

The 0.5 release of Hazelcast Jet opens new doors for highly available and fault tolerant streaming systems. Hazelcast Jet does not depend on any other external storage system for high availability and snapshotting. Additionally, in-memory snapshots pave the way for minimally disruptive snapshotting and fast recovery after failures. For the next release, we will be working on improvements and optimizations in the snapshotting mechanism.

Keep Reading

Upcoming Webinar

Hazelcast Office Hours: Using the Hazelcast Platform to implement gRPC based microservices

Apr 24, 2024 / 8:00am PDT / 11:00am EDT / 3:00pm GMT

Blog

Announcing Hazelcast Platform 5.4 Release

Introduction The impact of solutions built on the Hazelcast Platform is visible in many aspects of our daily lives. It…

Upcoming Webinar

Embracing the demands of an AI-Centric future with Hazelcast Platform 5.4

May 9, 2024 / 10:00am PDT / 1:00pm EDT / 5:00pm GMT

Unlock AI’s future with Hazelcast! Join our webinar on May 9, 2024, to explore how Hazelcast Platform 5.4 transforms AI workloads. Learn to manage data with accuracy & innovate while reducing costs. Register now!

Blog

3 Techniques to Boost Event-Driven Microservices Architectures

In the ever-changing world of software development, the event-driven microservices architecture has emerged as a game-changer for its ability to…

Case Study

PSA Antwerp Cuts Operational Costs by 33% by Optimizing Their Business in Real Time

Webinar

/ Video

/ 60 min

Modernizing Payment Processing Architectures

In this webinar explore how payment processing architectures are adapting to meet customer demands and regulatory standards, and learn why businesses using outdated platforms risk losing market share.

Why Hazelcast?

Forrester names Hazelcast as a Strong Performer

Platform

Introducing Hazelcast Platform 5.4

Solutions

By Industry

By Use Case

By Architecture

Join us for a deep dive into Hazelcast Platform's capabilities

Resource Center

Learn

The Gartner®️ Market Guide for Event Stream Processing

Developers

Community

Learn

Toolbox

By Ensar Basri Kahveci

Hazelcast Jet 0.5: Fault Tolerant Stateful Stream Processing Made Easy

Highly Available Streaming Computations

Enabling Fault Tolerant Stateful Computations

Conclusions

Keep Reading

Hazelcast Office Hours: Using the Hazelcast Platform to implement gRPC based microservices

Announcing Hazelcast Platform 5.4 Release

Embracing the demands of an AI-Centric future with Hazelcast Platform 5.4

3 Techniques to Boost Event-Driven Microservices Architectures

PSA Antwerp Cuts Operational Costs by 33% by Optimizing Their Business in Real Time

Modernizing Payment Processing Architectures

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect

Why Hazelcast?

Forrester names Hazelcast as a Strong Performer

Platform

Introducing Hazelcast Platform 5.4

Solutions

By Industry

By Use Case

By Architecture

Join us for a deep dive into Hazelcast Platform's capabilities

Resource Center

Learn

The Gartner®️ Market Guide for Event Stream Processing

Developers

Community

Learn

Toolbox

By Ensar Basri Kahveci

Spread the Word

Hazelcast Jet 0.5: Fault Tolerant Stateful Stream Processing Made Easy

Highly Available Streaming Computations

Enabling Fault Tolerant Stateful Computations

Conclusions

Keep Reading

Hazelcast Office Hours: Using the Hazelcast Platform to implement gRPC based microservices

Announcing Hazelcast Platform 5.4 Release

Embracing the demands of an AI-Centric future with Hazelcast Platform 5.4

3 Techniques to Boost Event-Driven Microservices Architectures

PSA Antwerp Cuts Operational Costs by 33% by Optimizing Their Business in Real Time

Modernizing Payment Processing Architectures

Why Hazelcast

About Us

Platform

Solutions

Developers

Learn

Connect