Monday 14 June 2021

Event-Trees and data structure selection – (notes)

 

Event chains, a representation of events stored using a Block-Chain/Merkle tree can have many representation strategies. An expansion of the event chains , which is comparable to a linked list; event trees are comparable with tree data structure representation of an event chain with many child nodes and levels.

It would be desirable to have a means to dynamically switch between storage and query strategies for event trees while making sure the complexity parameters are maintained to be the most optimal. In addition to selecting the most desired, it would be essential to monitor the the usage patterns/trends and storage solutions in the market for distributed graph databases and related in-memory operations for theses event trees/chains.

Dynamically switching between the representations in live/production environments would require a detailed study into data structures that could be use for querying and storage ; in a single node and distributed multi node environments.

Couple of possible in-memory and related storage representation are listed below.

Logical Representations

There are many possibilities for representing event trees as listed below :

1 : Event Chain — Time Series

Within the Merkle tree, we can assume a single single level links such that the immediate child is added based on the time the event was generated.

Event A happening at T1 and Event B happening at T2=(T1 + 1) would be represented in the Merkle tree as :

Event-Chain-Time Series

In this representation note that events get added based on the time they arrive and hence it would typically be a single chain typically. Events getting generated at the same nano/minuscule-second would be added one after while making sure the total level count is always level 0; the same level as the parent/root.

2: Event Trees — Cause effect

This representation tries to capture cause effect in a tree fashion such that any event that caused another would be maintained as a child of the parent event.

level based representation — CauseEvents
level based representation — CauseEvents
cause-effect-tree

In the above example, the levels with the tree are based on the causation, such that EventAA and Event AB are the same level 2, while event AAA, ABA, ABB are at level 3 etc.

Further, note that the event cause time is not considered at all in this representation. If event ABA happened a couple of days after event AB, it would be anyways maintained as a child link under Event AB.

3 : Event Trees — Time Leveled

This representation maintains a tree but makes sure events happening at the same time are maintained at a different level.

A sample representation :

time-leveled-tree

In this example, note that cause effects are not considered, but just the time an event happened. Event B and Event C happening at the same time ends up as a child level 2.

4 : Event Trees —Multi Dimensional

Merging the above two representations of cause and time, we could have a multi dimensional graph such that cause is on dimension z-axis, time on x-axis and event itself on y-axis.

Representation assessment

In each of the above representations indicated above, there needs an in-memory representation and a related persistence strategy. In-memory representation could have a 1:1 relation as above while storage approach could be different, based on the internal implementation of the vendor. In typical cases, one of the above representation strategies is selected by the software designer based on the identified (non) functional requirements.

Rather than have one single representation, what if there was a strategy to dynamically switch between representations both in-memory and at the storage (CRUD+Search) layer based on the active observations in live environments while making sure the current functional and non-functional requirements are not affected ?

An option is to predict optimal representation models based on the query patterns, the consistency & performance requirements. Furthermore, it could be assessed whether it is practical to clone all the representations in memory and switch over based on earlier observed patterns when required; dynamically. This would require more than one representation simultaneously maintained at any point in time. It would not be recommended to maintain the alternate representations for long duration, but enabled only during the assessment phase. Assessment phase to be enabled/disabled any point in the lifetime of the the product.

Space and Time complexity requirements together with the CAP are usually the driving factors and its desired to have a means to predict optimal memory representation and related persistence layer exploiting machine learning/data-science. A possible means would be to observe the performance against usage patterns for a couple of months before making a call.

It’s definitely required to research various possibilities around this in the computing field while assessing their impact on time, space and related cost efficacy.

Though production environments could be used for capturing the logs for some of the graph databases and their performance, it would be desired to create an extensive set of test cases including the test data to mimic the environments. Environments with 10 to 1000 nodes could represent distributed graphs with 1–1000 levels.

In addition to the space and time complexity assessment, with the many offering from cloud vendors today, its desired to periodically re-model the representation while considering the pricing strategy across couple of cloud vendors in the industry.