In-Memory, Distributed Data Structures for the masses

*Rolando Santamaria Maso recently published a blog post entitled, In-Memory, Distributed Data Structures for the masses. *


This article is about In-Memory Data-Structures. But this time, about the “magic” ones: Lists, Maps, Queues, Locks, Streams… that works distributed across multiple nodes and across different runtimes (Java, C#, C++, Node.js, Python, Go).

Motivations

I had the opportunity to participate in the #AWSSumit2018 Berlin the past June 6. There were many great talks, awesome environment… with special people!

In one of the talks: “Reactive Microservices Architecture on AWS”, I noticed how the speaker was using the Vert.x framework on his examples, and promoting the use of multiple caching layers… Long story short, this was my favorite lesson:

“Even if you are rich enough, you should NOT waste computing resources…, use caching!”

But, what is the relation between Vert.x and the topic of this story? Hazelcast: The Operational In-Memory Computing Platform Meet the Hazelcast IMDG project.

Meet the Hazelcast IMDG project — A developer introduction https://www.youtube.com/watch?v=y6Bc3XOpJQ8

Running a Hazelcast single server instance using Docker: docker run -p 5701:5701 — rm -ti hazelcast/hazelcast

In-Memory Data Grids
In-memory processing has been a pretty hot topic lately. Many companies that historically would not have considered using in-memory technology because it was cost prohibitive are now changing their core systems’ architectures to take advantage of the low-latency transaction processing that in-memory technology offers. This is a consequence of the fact that the price of RAM is dropping significantly and rapidly and as a result, it has become economical to load the entire operational dataset into memory with performance improvements of over 1000x faster. In-Memory Compute and Data Grids provide the core capabilities of an in-memory architecture…

Leaving aside all the theory and extensive literature related to data structures, if you are a software developer, you are currently using data structures such as Maps, Lists/Arrays or Sets; those are simply part of any process of retrieval, mapping or aggregation of data.

In computer science, a data structure is a data organization and storage format that enables efficient access and modification.13 More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.4

https://en.wikipedia.org/wiki/Data_structure
With Hazelcast, you still have access to this kind of structures, with the big benefit that they are highly efficiently distributed through a network of nodes and accessible from multiple runtimes and programming languages.

Let’s describe the previous with an example: “The distributed counter”

I have the need to count the number of realtime notifications my services have pushed to remote Web clients. I need the counter retrieval to be fast, as multiple other services check it on a per-second basis…

Classic solution using a SQL/No-SQL persistent database: 1. Create a “PushLogs” table/collection to store the logs related to messagepush activity. The model schema will look like this: {senderId: String, room: String, timestamp: Date, msgId: String} 2. On every push request, log the action by adding a new entry to the collection. 3. On count request, execute a count query to the collection.

This solution is simple and even works when I scale my application to multiple distributed instances…, but it is not fast/efficient enough!

Extending the solution with Hazelcast: 1. On instance start, also create/reference a distributed “Atomic Long” counter. If value is 0, assign the existing PushLogs table/collection entries count. 2. On every push request, also increment the counter by 1. 3. On count request, return counter value.

This solution remains simple and even works when I scale my application to multiple distributed instances…, but this time it is fast/efficient enough!

// on application instance start...
const counter = hazelcast.getAtomicLong('distributed-pushes-counter')
await counter.compareAndSet(0, await PushLogs.count({}))

// on push...
await counter.incrementAndGet()

// on request
await counter.get()

Hazelcast AtomicLong example usage in Node.js

Hazelcast supported Data Structures per runtimes are summarized at https://hazelcast.org

Distributed Locks

A Distributed Lock is a very special Data Structure, it allow us to request and receive exclusive read/write access to a shared resource in a distributed environment. The following example is self-explained: “Nobody else touch the console!” Hazelcast Lock example using Node.js Near Cache (in-RAM synced Maps) “If you ask a mathematician how slow is a network I/O operation compared to RAM based data retrieval, he will just answer: It is infinitely slower!” Map or Cache entries in Hazelcast are partitioned across the cluster members. Hazelcast clients do not have local data at all. Suppose you read the key k a number of times from a Hazelcast client or k is owned by another member in your cluster. Then each map.get(k) or cache.get(k) will be a remote operation, which creates a lot of network trips. If you have a data structure that is mostly read, then you should consider creating a local Near Cache, so that reads are sped up and less network traffic is created. http://docs.hazelcast.org/docs/latest-development/manual/html/Performance/Near_Cache/index.html

Let’s describe the previous with an example: “The Locations Management API” “Within my X organization, almost all business processes revolve around locations. The locations API response time for data retrieval endpoints is required to be the lowest, so it does not impact the performance of dependent APIs…” The “locations” Map, is kept sync in RAM across all application instances. Read access is blazing fast… Polyglot Hazelcast Data Structures are “magically” accessible between different runtimes, this means: the Distributed Lock, the Near Cache Map and many others can be used/shared between the programming languages and runtimes you love. So, let’s get that Node.js Distributed Lock instance from C++ C++ Lock Data Structure example. Source: https://hazelcast.org Hazelcast supported Data Structures per runtimes are summarised at https://hazelcast.org Conclusions “Caching is the last mile for performance optimisations, and the one that brings you the biggest improvements… when implemented right!” For laziness, many developers(including me) tend to stress the SQL/No-SQL databases using them as a central storage for our application’s “distributed state” management. When using Hazelcast IMDG, “GOD” mode Data Structures just become available, everywhere. Give it a try, that is my invitation!

In-Memory, Distributed Data Structures for the masses This article is about In-Memory Data-Structures. But this time, about the “magic” ones: Lists, Maps, Queues, Locks, Streams… that works distributed across multiple nodes and across different runtimes (Java, C#, C++, Node.js, Python, Go). Motivations I had the opportunity to participate in the #AWSSumit2018 Berlin the past June 6. There were many great talks, awesome environment… with special people! In one of the talks: “Reactive Microservices Architecture on AWS”, I noticed how the speaker was using the Vert.x framework on his examples, and promoting the use of multiple caching layers… Long story short, this was my favorite lesson: Even if you are rich enough, you should NOT waste computing resources…, use caching!

But, what is the relation between Vert.x and the topic of this story? Hazelcast: The Operational In-Memory Computing Platform Meet the Hazelcast IMDG project — A developer introduction Getting Started with Hazelcast IMDG, https://www.youtube.com/watch?v=y6Bc3XOpJQ8 Running a Hazelcast single server instance using Docker: docker run -p 5701:5701 — rm -ti hazelcast/hazelcast In-Memory Data Grids “In-memory processing has been a pretty hot topic lately. Many companies that historically would not have considered using in-memory technology because it was cost prohibitive are now changing their core systems’ architectures to take advantage of the low-latency transaction processing that in-memory technology offers. This is a consequence of the fact that the price of RAM is dropping significantly and rapidly and as a result, it has become economical to load the entire operational dataset into memory with performance improvements of over 1000x faster. In-Memory Compute and Data Grids provide the core capabilities of an in-memory architecture…” https://www.gridgain.com/resources/blog/in-memory-data-grid-explained Distributed Data Structures “In computer science, a data structure is a data organization and storage format that enables efficient access and modification.13 More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.4…” https://en.wikipedia.org/wiki/Data_structure Leaving aside all the theory and extensive literature related to data structures, if you are a software developer, you are currently using data structures such as Maps, Lists/Arrays or Sets; those are simply part of any process of retrieval, mapping or aggregation of data. With Hazelcast, you still have access to this kind of structures, with the big benefit that they are highly efficiently distributed through a network of nodes and accesible from multiple runtimes and programming languages.

Let’s describe the previous with an example: “The distributed counter”

I have the need to count the number of realtime notifications my services have pushed to remote Web clients. I need the counter retrieval to be fast, as multiple other services check it on a per-second basis…

Classic solution using a SQL/No-SQL persistent database:

  1. Create a “PushLogs” table/collection to store the logs related to messagepush activity. The model schema will look like this: {senderId: String, room: String, timestamp: Date, msgId: String}
  2. On every push request, log the action by adding a new entry to the collection.
  3. On count request, execute a count query to the collection.

This solution is simple and even works when I scale my application to multiple distributed instances…, but it is not fast/efficient enough!

Extending the solution with Hazelcast:

  1. On instance start, also create/reference a distributed “Atomic Long” counter. If value is 0, assign the existing PushLogs table/collection entries count.
  2. On every push request, also increment the counter by 1.
  3. On count request, return counter value.

This solution remains simple and even works when I scale my application to multiple distributed instances…, but this time it is fast/efficient enough!

// on application instance start...
const counter = hazelcast.getAtomicLong('distributed-pushes-counter')
await counter.compareAndSet(0, await PushLogs.count({}))

// on push...
await counter.incrementAndGet()

// on request
await counter.get()

Hazelcast AtomicLong example usage in Node.js

Hazelcast supported Data Structures per runtimes are summarized at https://hazelcast.org

Distributed Locks

A Distributed Lock is a very special Data Structure, it allow us to request and receive exclusive read/write access to a shared resource in a distributed environment. The following example is self-explained: “Nobody else touch the console!”

const lock = hazelcast.getLock('my-distributed-lock')
lock.lock().then(() => {
  // the execution of this function will happen once
  // at a time across all distributed instances
  // ...
  console.log('Happy Locking!')
}).finally(() => lock.unlock()) 

Hazelcast Lock example using Node.js

Near Cache (in-RAM synced Maps)

If you ask a mathematician how slow is a network I/O operation compared to RAM based data retrieval, he will just answer: It is infinitely slower!

Map or Cache entries in Hazelcast are partitioned across the cluster members. Hazelcast clients do not have local data at all. Suppose you read the key k a number of times from a Hazelcast client or k is owned by another member in your cluster. Then each map.get(k) or cache.get(k) will be a remote operation, which creates a lot of network trips. If you have a data structure that is mostly read, then you should consider creating a local Near Cache, so that reads are sped up and less network traffic is created. http://docs.hazelcast.org/docs/latest-development/manual/html/Performance/Near_Cache/index.html

Let’s describe the previous with an example: “The Locations Management API”

Within my X organization, almost all business processes revolve around locations. The locations API response time for data retrieval endpoints is required to be the lowest, so it does not impact the performance of dependent APIs…

The “locations” Map, is kept sync in RAM across all application instances.

// on application instance start, init near cache Map
const {Client, Config} = require('hazelcast-client')
const nearCachedMapName = 'my-holy-fast-map'

const cfg = new Config.NearCacheConfig()
cfg.name = nearCachedMapName
cfg.invalidateOnChange = true
cfg.nearCacheConfigs[nearCachedMapName] = cfg

const hazelcast = await Client.newHazelcastClient(cfg)
const locations = hazelcast.getMap(nearCachedMapName)

// on application instance start, fill map with locations once
if (await locations.size() === 0) {
  (await LocationsService.findAll()).forEach(location => {
    locations.put(location._id, location)
  })
}

// on location change
await locations.put(location._id, location)

// on locations retrieval "GET /locations"
await locations.values()

Read access is blazing fast… Polyglot Hazelcast Data Structures are “magically” accessible between different runtimes, this means: the Distributed Lock, the Near Cache Map and many others can be used/shared between the programming languages and runtimes you love. So, let’s get that Node.js Distributed Lock instance from C++ C++ Lock Data Structure example. Source: https://hazelcast.org Hazelcast supported Data Structures per runtimes are summarized at https://hazelcast.org Conclusions “Caching is the last mile for performance optimisations, and the one that brings you the biggest improvements… when implemented right!” For laziness, many developers(including me) tend to stress the SQL/No-SQL databases using them as a central storage for our application’s “distributed state” management. When using Hazelcast IMDG, “GOD” mode Data Structures just become available, everywhere. Give it a try, that is my invitation!