Bits from the software front...

Saturday, 28 October 2023

Destructibility Unveiled: From Artifacts to Cosmos

Prelude:

The concept of destructibility is a universal and profound theme that touches both tangible and intangible dimensions of our existence. It reminds us of the inherent impermanence and transience not only of human-crafted artifacts but also of life itself. To delve deep into this concept, we embark on a journey that starts with exploring its philosophical roots. In particular, we draw inspiration from the Buddhist concept of "Anicca," which emphasizes the impermanence of all things. This prelude sets the stage for our exploration of destructibility, emphasizing its significance and relevance in both the world of human creations and the broader human experience.

Destructibility in Artifacts Forged by Humanity:

The concept of destructibility in human-crafted items is a multi-faceted one that warrants a closer look. At its core lies the motivation behind designing products and software with planned obsolescence or recyclability in mind. We delve into the intricacies of this design philosophy and consider its implications for the environment and sustainability. Of particular interest is the practice of designing commodities and software with recyclability in mind, incorporating sustainable design principles that encourage the use of recyclable or biodegradable materials in physical products. Shifting our focus to the software realm, we explore the concept of modular design, which allows for the reuse and recycling of software components, emerging as a key technique to create "destructible" software. By introducing these dimensions, our discussion broadens to encompass not only the end of an artifact's life but also its potential for renewal and repurposing through recycling practices.

Software and Destructibility:

In the digital age, the concept of destructibility extends to the realm of software development and management. This section uncovers the layers of this concept within the world of code and algorithms. We delve into the heart of software architecture and discuss the critical concept of IDisposable and resource management. It becomes clear that this is not merely a technicality but a pivotal aspect of creating "destructible" software. Efficient memory allocation and deallocation are paramount for the proper functioning of software and the prevention of memory leaks. Furthermore, we explore the techniques and best practices for designing software that is effectively "destructible." Such software can handle resources efficiently, minimizing memory leaks, optimizing performance, and enhancing system stability. We also consider the potential implications of indestructible software, which extend beyond degraded performance to encompass security vulnerabilities. By the end of this section, it becomes evident that the idea of software destructibility is not confined to technical discussions but extends its influence into the realms of security, stability, and, ultimately, user experience.

Philosophical Facets of Destructibility in Hinduism:

Within the context of Hinduism, the philosophical aspects of destructibility take on a profound and ancient character. Our journey through this philosophical realm centers on Lord Shiva, one of the most revered deities in Hinduism, celebrated as the harbinger of destruction and transformation. Within the Hindu Trinity of Brahma, Vishnu, and Shiva, it is Shiva's role to preside over the destruction and renewal of the universe. This section delves into Shiva's multifaceted character, unveiling his significance as both the annihilator and transformer. Shiva's destructive potency is not an act of wanton destruction but a means of rejuvenation and rebirth, symbolizing the cyclical nature of existence. The profound philosophy underlying Shiva's role invites us to contemplate the idea that destruction within Hinduism is not about mere annihilation but rather an integral part of a grand cosmic cycle.

Interconnections and Parallels:

Our exploration continues as we uncover the interconnections and striking parallels that exist between the concept of destructibility prevalent in human-crafted items and the philosophical essence of Shiva's destruction within Hinduism. The cyclical nature of destruction and creation in Hinduism resonates with the life cycle of human-made products and software. Furthermore, the idea of rebirth or renewal transcends both realms, drawing parallels between the transformative aspects of destruction in Hindu philosophy and the potential for innovation and progress in human creations. It is in these parallels that we find a unifying thread that transcends cultural and disciplinary boundaries, revealing that the concept of destructibility, in its essence, resonates with the human experience of transformation and renewal.

Challenges and Ethical Considerations:

While the concept of destructibility offers profound insights, it also presents important ethical and practical considerations. This section addresses these multifaceted dimensions. We delve into the ethical implications of planned obsolescence and the disposable culture, raising essential questions about sustainability and responsible consumption. The section also navigates through the potential conflicts that may arise between software destructibility and data privacy and security concerns. Technology's ever-expanding role in our lives brings forth significant ethical considerations, and this discussion takes them head-on. Moreover, we turn our attention to the ethical dimensions surrounding depictions of destruction in religious symbolism and their interpretation in modern contexts. As we contemplate these ethical facets, we encourage thoughtful reflection on the implications of destructibility in various domains, underlining the need for a balance between renewal and sustainability.

Epilogue:

Our exploration of destructibility reaches its conclusion in an epilogue that brings together the threads of our journey. It offers a concise summary of the key points discussed throughout the manuscript, highlighting the interconnectedness between destructibility within human-crafted artifacts, software development, and Hindu philosophy. As we ponder the insights gathered, the epilogue prompts readers to contemplate their own perspectives on destructibility, impermanence, and renewal, inviting them to explore the transformative power of these concepts in their own lives.

Future Trajectories:

As our journey through the realms of destructibility concludes, we look to the future and propose potential areas for further exploration. We encourage the pursuit of sustainable design practices that align with the ideals of recyclability and eco-friendliness. In the realm of software development, we advocate for innovative approaches that incorporate built-in destructibility to enhance user experiences. Additionally, we invite deeper philosophical investigations into the interplay between creation, destruction, and renewal in various cultural and religious contexts. These avenues of exploration promise to enrich our understanding of destructibility and its enduring significance in our ever-evolving world.

Destructibility and Transformation in the Cosmic Context:

The concept of destructibility and transformation is not limited to human-crafted artifacts or software; it extends to the universe as a whole. The cyclical nature of destruction and creation is a fundamental aspect of the cosmos, as evidenced by phenomena such as supernovas, black holes, and the Big Bang itself. In this expanded section, we delve into the concept of destructibility and transformation on a cosmic scale. We examine how the cycle of creation and destruction is a fundamental aspect of the universe's evolution, shaping the cosmos as we know it. We illustrate the cyclical nature of destruction and creation through examples of cosmic phenomena, such as the explosive death of stars in supernovas, the mysterious gravitational behemoths known as black holes, and the cataclysmic birth of the universe in the Big Bang. Philosophically, we ponder how this cosmic cycle reflects broader notions of impermanence, change, and renewal in the human experience. Additionally, we draw connections between these cosmic principles of destructibility and the Hindu concept of Lord Shiva's role as the god of destruction, unveiling profound insights into the alignment of these concepts on both cosmic and philosophical levels. By incorporating this cosmic dimension, we offer a more holistic perspective on the concept of destructibility, emphasizing its ubiquity and timeless significance in the grand narrative of our universe.

Monday, 14 June 2021

Event-Trees and data structure selection – (notes)

Event chains, a representation of events stored using a Block-Chain/Merkle tree can have many representation strategies. An expansion of the event chains , which is comparable to a linked list; event trees are comparable with tree data structure representation of an event chain with many child nodes and levels.

It would be desirable to have a means to dynamically switch between storage and query strategies for event trees while making sure the complexity parameters are maintained to be the most optimal. In addition to selecting the most desired, it would be essential to monitor the the usage patterns/trends and storage solutions in the market for distributed graph databases and related in-memory operations for theses event trees/chains.

Dynamically switching between the representations in live/production environments would require a detailed study into data structures that could be use for querying and storage ; in a single node and distributed multi node environments.

Couple of possible in-memory and related storage representation are listed below.

Logical Representations

There are many possibilities for representing event trees as listed below :

1 : Event Chain — Time Series

Within the Merkle tree, we can assume a single single level links such that the immediate child is added based on the time the event was generated.

Event A happening at T1 and Event B happening at T2=(T1 + 1) would be represented in the Merkle tree as :

Event-Chain-Time Series

In this representation note that events get added based on the time they arrive and hence it would typically be a single chain typically. Events getting generated at the same nano/minuscule-second would be added one after while making sure the total level count is always level 0; the same level as the parent/root.

2: Event Trees — Cause effect

This representation tries to capture cause effect in a tree fashion such that any event that caused another would be maintained as a child of the parent event.

level based representation — CauseEvents — cause-effect-tree

In the above example, the levels with the tree are based on the causation, such that EventAA and Event AB are the same level 2, while event AAA, ABA, ABB are at level 3 etc.

Further, note that the event cause time is not considered at all in this representation. If event ABA happened a couple of days after event AB, it would be anyways maintained as a child link under Event AB.

3 : Event Trees — Time Leveled

This representation maintains a tree but makes sure events happening at the same time are maintained at a different level.

A sample representation :

In this example, note that cause effects are not considered, but just the time an event happened. Event B and Event C happening at the same time ends up as a child level 2.

4 : Event Trees —Multi Dimensional

Merging the above two representations of cause and time, we could have a multi dimensional graph such that cause is on dimension z-axis, time on x-axis and event itself on y-axis.

Representation assessment

In each of the above representations indicated above, there needs an in-memory representation and a related persistence strategy. In-memory representation could have a 1:1 relation as above while storage approach could be different, based on the internal implementation of the vendor. In typical cases, one of the above representation strategies is selected by the software designer based on the identified (non) functional requirements.

Rather than have one single representation, what if there was a strategy to dynamically switch between representations both in-memory and at the storage (CRUD+Search) layer based on the active observations in live environments while making sure the current functional and non-functional requirements are not affected ?

An option is to predict optimal representation models based on the query patterns, the consistency & performance requirements. Furthermore, it could be assessed whether it is practical to clone all the representations in memory and switch over based on earlier observed patterns when required; dynamically. This would require more than one representation simultaneously maintained at any point in time. It would not be recommended to maintain the alternate representations for long duration, but enabled only during the assessment phase. Assessment phase to be enabled/disabled any point in the lifetime of the the product.

Space and Time complexity requirements together with the CAP are usually the driving factors and its desired to have a means to predict optimal memory representation and related persistence layer exploiting machine learning/data-science. A possible means would be to observe the performance against usage patterns for a couple of months before making a call.

It’s definitely required to research various possibilities around this in the computing field while assessing their impact on time, space and related cost efficacy.

Though production environments could be used for capturing the logs for some of the graph databases and their performance, it would be desired to create an extensive set of test cases including the test data to mimic the environments. Environments with 10 to 1000 nodes could represent distributed graphs with 1–1000 levels.

In addition to the space and time complexity assessment, with the many offering from cloud vendors today, its desired to periodically re-model the representation while considering the pricing strategy across couple of cloud vendors in the industry.

Monday, 19 October 2020

Distributed state management — refresher

Ease of Stateless Services
Designing stateless distributed systems are relatively easy. You would have raised an event/message once the service had done processing. You typically are not worried about how the other systems consume your data. In fact in the majority of these scenarios, you don’t care what happens to the message/event after you were done.
Think of fire and forget. This could be easily achieved using typical message broker queues & topics. Sadly, this stateless approach is quite bookish and not practical for most enterprise application needs.
Pains of Stateful Services
State could be as straightforward as an entity-update which another service is dependent on or a bit more trickier one as an amount being debited from one account and credited to another. This information/state needs to be maintained somewhere.
In 2 / 3 tier applications, we had a central transaction/state coordinator/server through which all transactions must flow. In case of errors, it was the task of this coordinator to rollback all the child tasks relevant in the transaction context.
One of the problems with this approach was about a single point of failure (SPoF) — what if this server went down — what happens to the workflow/process state ?
In distributed architectures, we would require a distributed state log which is virtually “centralized”, but practically distributed across nodes/pods/VMs for enabling both availability and scalability.
One of the main challenge raised in this distributed situation is state consistency :
- How do we make the state synced across the nodes such that all nodes return the same state even if queried separately? — strong consistency.
- Staleness/freshness — Is the state returned older than the state at other nodes ?
- During a retry operation (say when part of the workflow failed), what is the impact of executing the service again ? Processing duplicate messages should not affect the underlying entity ever — think of 2 debit messages against the same account during a retry. Is the old state overwritten with new or should it be ignored ?
Approaches
There are many approaches employed these days for state management :
Coordinators/Orchestrators : a set of system(s) that manages the state — think of an orchestrator in a music symphony.
- Distributed locks : What if nodes in a cluster can elect a leader among themselves such that leader is the only one that can change the state. Paxos/Chubby/Raft being some of the prominent algorithms with many implementations.
- 2-Phase commit : Employed mostly in systems migrating from 2 / 3 tier to the cloud; requires two cycles of requests across ALL participants — First cycle preparing the for commit — PrepareForCommmit and second the commit itself — CommitNow.
- Eventual Consistency : state as received is processed as long it’s more newer than the one it is aware. All nodes are NOT required to be aware of the most recent state at any point in time. Once all of the applicable process completes, the eventual state is available in the DB/Cache. The UI depicts information available to it, though stale. For a specific workflow, UI would depict states as ‘InProgress’ and later as ‘Completed’ once all the tasks in the workflow are identified as completed or after a timeout. This approach is applicable for information that is not too business critical, where clients are OK with information that is stale.
- Optimistic/TimeStamp based : If the timestamp of the state received is newer than the one available in the data store, the node applies it as a new state. As in most cases, locks might need to be applied on the DB record to make sure no one else is simultaneously applying.
- Event Source Logs — EventSourcing depends on an append only store for all events. This distributed event-log could in fact be used to derive the state of an entity at a point in time by evaluating the events until a point in time. Snapshots of the state at a point in time can definitely speed up this evaluation of state.
Depending on the consistency requirement, it’s usually a mix of the above approaches used for distributed state management.
Technology Options
Writing frameworks/libraries from scratch that address above challenges are complex and not recommended unless you are a software company with serious software research focus.
Across the spectrum, there are interesting frameworks and stacks that can assist the ‘common engineer’ (derived from ‘common man’). As each framework has their (dis)advantages for adoption, based on the technical capability of the team, hybrid solutions too can be looked into.
- Azure Durable Functions based orchestration : OrchestrationTrigger, when applied on services, exploits the capabilities of Durable Azure Functions that automatically orchestrate the state. The fundamental idea being, the context/state is available for all of the functions(services) participating in the workflow. For developers, this is like writing a 2 tier application with a single try-catch block to handle any error/state across libraries. Instead of libraries, you are calling services with the entire context of execution available to you across the services. From the AWS world, we now have the Step Functions which behaves the same.
- Reliable Actors, Reliable Collection built over Service Fabric/Azure Service Fabric (SF) support orchestration for getting/updating state information across the distributed nodes while hiding away the intricacies on how nodes internally communicate and keep themselves in sync. This is a pretty good option if you have no plans to support multi cloud. Though it’s conceptually possible to host SF onto AWS, no assurance on the level of compatibility today. Possibly satisfying CP in CAP.
- Orleans research project from Microsoft needs a special reference here as both SF and Orleans are based on the actor model though the design is different.
- Akka and related Akka.Net : An actor model where a conductor/parent actor is internally aware of its child actors within a cluster. The model can be exploited for various distributed state management together with its supported persistent actor and singleton model. Compared with the Service Fabric model, SF does not support this parent-child relationship. (PS: if you are from a .NET background, do check Akka.net. Further ahead, together with Azure, Akka.net deployed on AKS pods is a cool experiment for massively scalable state management needs). CAP is deterministic here based on the storage/persistence store selected.
- Dapr.io : Supporting strong consistency (all nodes must be in sync) for the state management, Dapr is deployed typically as a sidecar and does not disturb the service code (unlike Akka.net where runtime is part of the service). (do check out similar service mesh offerings like Istio too)
- Kafka Streams , HazelCast Jet : acting as orchestrators, these stream processing engines have intrinsic support to make sure a message in the cluster is processed ‘exactly-once’. This out of the box feature can be exploited to manage state as you don’t need to worry how the set of nodes are talking with each other to reach an agreement internally. Intricacies on how the nodes in the cluster message each other over queues to reach an agreement is completely abstracted away. Possibly satisfying CP in CAP.
- Axon, Eventuate.io, Camunda , Netflix Conductor : similar to the way above ServiceFabric/Akka/Kafka streams function, these too hide away the internals of inter node state synchronization and could be looked into.
- NServicebus : the supporting framework requires an explicit transaction start call such that it can handle rest of the related messages in the transaction. Internally it can lock all related messages for a forced sequencing like a funnel in case there are too many consumers. Possibly satisfying CP in CAP.
- Redis Locks : Use Redis for achieving a distributed lock before changing state.
It’s highly recommended to check with your architect team who could weigh-in the features while considering the characteristics/NFR’s and KPI’s of your system. The underlying storage/persistence later for each of the above set of frameworks/services directly effect the CAP. In cases the framework allows for choosing a persistence store, must review whether its CP or AP of the CAP that you are planning to satisfy for your service.
References

Tuesday, 13 October 2020

M&A and TOGAF

During an interesting discussion online on Mergers & Acquisitions, a basic question arose — if we consolidate technologies & tools used across the two merging companies, would it suffice most of the Architecture needs for the new company ?

Maybe; but in most cases, No.

A more formal approach is required to make sure we do not end up with a half cooked chowder served in a platinum goblet. We need means to formalize a recipe that takes care of most the business & stakeholder concerns while making sure we have added essential quantities of innovation and budget to the recipe.

What if we could exploit learning from TOGAF and its 4 domain pillars (BDAT) as the base line?

As the first essential requirement, management, stakeholders and technology leaders must define and agree upon an Architecture vision. The vision must represent the desired state of Architecture that cuts across the BDAT (Business, Data, Application and Technology) pillars. Furthermore, vision must act as the means to communicate with other partner leaders on where the new company is headed in the next 3–5 years.

Think of vision as a simple but appealing menu at the Michelin starred restaurant — — just enough to interest the diner. For typical small to medium enterprises going through a merge, think of at least 2–3 months to define this, as this would essentially become the guiding star for the rest of the architecture detailing exercise in the coming days.

Once we have defined the vision, it’s critical to have the current state of architectures ranked against the vision (think of maturity models). This could be as simple as 1,2,3,4,5 with the vision ranked at 5, while current architectures at rank 1; especially if we just adopt everything from both companies as-is before going through the below exercise.

Each cycle of the TOGAF ADM in the coming months should help us get to rank 5 as we reassess our rank periodically — every quarter/year. This is similar to Michelin Star 1 going until 3.

Following the TOGAF ADM is quite perfect for our need while detailing each of the BDAT pillars.

Business (Common processes which are procurement, operations …), Data (kind, tools, policies …), Application (Toolsets, policies) , Technology (Service Registry, SOA, micro services, neural networks…) pillars require many viewpoints to be created as required.

In addition, typical cross cutting viewpoints like Devops, Infra, HR, too must be assessed and detailed during the ADM.

Carrying ahead with the BDAT definition, ADM does provide means to define the governance model (who, how, what) and when/who can change/refine the governance model itself.

Now is the perfect opportunity to define the road map for the next couple of years for the merged company that also helps better the targeted architecture rank.

As we observe, ADM does force us into absorbing a formal mechanism to identify the perfect recipe for our new architecture. ADM compels us to look into opportunities (even across innovation programs active in the two companies) that could pop up during the merger that could further lead to defining new business processes/tools/use cases/products too.

Once the first cycle of the ADM is complete, we could have reference enterprise architectures that partner businesses can consider. All documentation including the reference models, process changes, view points, governance models, recipes�, principles could now be captured in the TOGAF enterprise continuum.

Soup is now being served. This time it was well cooked and served in a proper china soup bowl.

EventChain

Applying Blockchain to Event Sourcing

Event Sourcing pattern at the core requires an event store to maintain the events. What if we add these events as it arrives into a blockchain ? This should effectively make sure the events have not been tampered with. The plan would be to initiate typical blockchain mining after which the event is added to the “block-chain of events” — an “EventChain”.

The definite side effect is that until the mining is complete, the business transaction cannot be internally marked as complete. Considering the time typically taken for mining, this would probably be an offline job.

Tamper Proof

The typical challenges faced by organizations who employ event sourcing and the event store is about securing the events. What if the DB admin for the event store manages to inject/remove events ? The replayed events and resulting projections are no longer valid in this case. Event chains should solve this issue for typical event stores.

Exploit the distributed infrastructure.

For private event chains , where businesses do not want the chain nor events to be exposed, existing distributed systems/hosts can be exploited for mining. Your event store DB cluster hosts, event sourcing services hosts, API hosts, cache cluster hosts and others that are spread across geography could be exploited for the same.

GDPR Challenges

There are cases where regulations require personal data to be removed from all data stores. In our case, this is about removing the related set of events from the event chain. Without the event chain, removing events from the event store was quick and easy.

Resetting the event chain when events are required to be deleted is challenging especially if there have been many events after the event(s) in concern. This would require re-mining the rest of events after removing the event(s) that had personal data all the way down to the most recent event. As this is an extremely time and compute intensive operation, it’s not recommended to store events that contain personal data in the event chain.

Snapshots

As the events from the event store can be played back to recreate a state at a point in time (“projections”), we could in fact have “snapshots” to identify a specific projection in time. We could link this snapshot as a child branch to the main event chain tree such that it’s not required to recalculate the projections each time; while making sure the projections themselves have not been tampered with.

We could look at having many child branches/trees for the different filters/conditions too.

Monday, 12 October 2020

Kafka Streams has an edge over Service Fabric ?

Compared against the .NET/Azure offerings, the level of abstraction enabled by Kafka Streams for event processing while exploiting underlying Kafka message-topic-queue patterns is pretty neat.

Did come across an interesting framework that used C# libraries over Kafka Streams by @tonysneed in GitHub too here : https://wp.me/pWU98-1v2

Hope Service Fabric Mesh Reliable Actor or similar offerings from Azure catches up with Kafka Streams in terms of seamless integration for distributed event processing.

For a start, assuring messages are processed 'exactly-once' is a basic requirement for most distributed systems. Yet to come across native frameworks in the .NET world that use Azure/Akka.NET streams/Service Fabric Mesh or the likes that enable essential distributed capabilities like 'exactly-once' and others with minimal developer effort :!

#azure #kafka #confluent #kafkastreams #eventsourcing #akka #distributedcomputing #cloudarchitecture

Saturday, 29 June 2019

Software Engineering lost in the cloud?

It would seem the cloud is making you a lazy software engineer. Engineers these days are now have a ready answer for most of the architectural and design concerns - "its taken care at the cloud". This perception is scary and appears to makes any tom-dick-harry engineer with minimal to zero computer/software knowledge "become" "master" software-engineer overnight. This halo is bothering. Whatever happened to clean code / patterns essential to designing your software during the days of distributed computing setup in local clusters ? Perhaps none today cares about minimizing traffic across nodes and syncing time across nodes nor time sharing and optimizing resources during your minimal time at the node. Not sure the solution for this until you are choked to become yet another Harry.

Saturday, 12 January 2019

ServerFULL deployments

Moving away from Typical service deployments

Rather than have services typically tied to a set of machines and load balanced as-is today in the SOA/SaaS/Microservices world, what if we could just throw a set of servers and get them be assigned/allocated dynamically and more specifically, attain tight packing of services on the same hardware ?

Though mostly exploited on the Cloud with AWS Lamdas and Azure Functions, Serverless as a pattern are awesome for OnPremise deployments too. An interesting set of options for ServerLess OnPremise is available at this list. Though its quite a misnomer in cases of OnPremise deployments where we really need to bother about extreme and efficient hardware utilization of the server, it is preferable to call this approach as ServerFULL as the desired effect is to be fill up the server to the FULL ;)

Once the Docker Images/perhaps later Memory Images are available in shared In-memory/SSD drives, any of the machines/VM could be dynamically chosen for deploying the service and finally un-deployed down once done, allowing the space for the next.

OpenFaas/OpenWhisk seem to be on top of the list with both exploiting Docker containers. Though there is still constraints on elasticity (bringing up new VMs that finally run the Containers is time consuming, while adding more physical machine could take days), it is still an exciting means to efficiently exploit what is available on-premise in the moment.

Just like in Serverless world on the cloud, Services that consume high resources (CPU/RAM) for long duration and the ones that comparatively take higher time to spawn, might not be a candidate for being in the ServerFULL environment as these tend to block up the VMs/containers for long.

Think of designing typical business workflows with events, triggers, logic, nested flows and actions that span in/out, with these getting mapped into services by developers and further mapped to the ServerFULL world of machines dynamically - quite exciting times.

References:

https://martinfowler.com/articles/serverless.html
https://winderresearch.com/a-comparison-of-serverless-frameworks-for-kubernetes-openfaas-openwhisk-fission-kubeless-and-more/

Wednesday, 11 July 2018

Structural Imbalance - In Software Systems

We come across many instances in the industry where "code-lumps" get deployed as software services/products with a beautiful UI included to cover up all the ugliness underneath. The design document too look fancy with usages of software patterns neatly listed. After all this stunt, these modules end up with a short life-span and before long, there are in-numerous critical issues being raised.

In majority of these cases, product owners were forced into releasing these "code-lumps" that just weren't ready, while in other cases the anointed "architect" had no clues to why the "code-lumps" exist and why the pattern was used. At the first look, the software does appear to function as desired with all the components "working" great in the demos.

How could these be avoided in the first place ?

Just like in typical broken buildings we see across the road, structural imbalance refers to modules that doesn't making sense together. Individually, these chosen components / patterns appear perfect for the problem at hand but they just don't sync enough; structurally.

Right from a birds-eye/logical view to the drilled-down/code view, its critical that a dedicated team of architects reach consensus on the many choices being made every day by engineers.

Only if the team of architects had identified the applicable Non-Functional-Requirements (NFR) and defined them initially. Architects and the team of software designers could drill down into one or modules for a detailed design before coding.

Though check-ins could be allowed from all engineers, none of it should reach the release pipeline until all the "code-lumps" were removed. Architects & designers must agree that the code comply towards the agreed NFR before promoting the code up its life-cycle.

Working closely with the architects, the product owner would now be more confident in communicating with the stake holders.

Do look forward related article on "Why all software engineers must NOT automatically become an 'Architect' " ; which in addition to looking at skill & interest, also touches upon the essential philosophical outlook required by any upcoming architect.

Friday, 9 August 2013

Self-optimization in Distributed caches.

Distributed caches are systems where the cache data/objects are stored across distributed nodes/machine. When a data is stored/retrieved by the consuming application, one or more of systems in the distributed system serve the request. This paper attempts to identify self-optimization techniques that could be applied to this distributed cache. For a base implementation of the distributed cache, the open source project HoC (herd of cache @ http://hoc.codeplex.com) is referred. This project implements the distributed cache in .NET using the concepts of consistent hashing.

Self-Optimization in distributed computing refers to the capability of distributed systems to optimize independent of any intervention - machine/human. In a typical de-centralized and cooperative system like HoC, this means the nodes in the distributed cache can make decisions either independent or together. The latter would require the use of various consensus algorithms to be applied by this distributed cache.

Self-Optimization: Candidate Use Cases:

1.) Optimization of node load: decision made by internally by hosting nodes

In a typical consistent hash implementation, there is possibility that the number of objects stored in the cache of some of the nodes are high compared to the other neighboring nodes. This requires some of the data to be moved to the neighboring nodes. This would first include a node first asking the neighboring node for its load. If it detects that the total count of self is considerably higher, it would apply a partitioning of the objects stored and move the selected objects.

Locating an item in the cache would require multiple hops to reach the target node where the data is stored. Whenever a node gets a request for an item that has been moved to a neighboring node, it would require the call to be routed to the neighboring node. Each node is expected to maintain a list of objects that were moved and the target neighboring node to which the object was moved.

During each fetch, the path/nodes traversed to reach the target node could be returned back to the caller such that the next call to the same object directly calls the target server while avoiding the intermediary traverse across nodes.

The end result of this approach would be a more balanced store of objects across nodes.

2.) Self-Optimizing Consistent Hash Algorithm for load balancing

In a consistent hash implementation, similar to a hash bucket, the target node is selected based on the hash key returned by the underlying hashing algorithm. A typical problem would be that the data could get collected more at a specific server. An alternative approach to solution 1 indicated above would be to apply machine learning approaches such that the change/adjustment -> fn(load distribution) required to adjust the hash algorithm can be identified. In this case, it should be noted that the fn(load distribution) required to normalize the overall load is specific to each system. A pattern could be detected for a specific system/installation and the load pattern for this system could be derived.

Applying this change to the underlying hash key algorithm would require a possible reset of the distributed system. Once reset, the adjustment learned/deduced by the system => fn(load distribution) would need to be applied each time a new object requires to be saved/retrieved. This adjustment function itself could be tweaked further down the time automatically by the system such that a new adjustment function is derived for the next run.

To monitor the overall usage pattern / load across nodes, it would be required to have a data store where the node v/s storage vs. load factor could be stored. Each data stored into the cache system would require its statistics to be stored into this data store. The next reset would require fn(load distribution) to be derived and applied to the underlying hash algorithm such that the load is more spread out in the next run.

This optimization technique assumes that the kind of data including its type, format, locale etc. does not vary considerably across resets.

3.) Optimized resource utilization on nodes

The CPU, RAM and other resources of each nodes would need to be used in a highly optimized fashion. Assuming these are not dedicated nodes, but machines shared by other processes too, it would be required to make sure the cache service does not overuse/bloat the machine resources. Optimized usage would require continuous monitoring of usage of these resources and adjusting the internal parameters accordingly. These parameters could be thread counts, memory allocated from heap, priority of thread/process (to free up CPU), receive/send buffer etc.

Each node should have capability to derive the optimal usage of resources on a continuous basis and refined after each optimization run. Parameter dependence (e.g.:- thread priority v/s memory) would be a factor that would need to derive again based on basic statistical record of resource usage. If the nodes are similar in deployment, learning from individual machine/node could be shared among other nodes.

4.) Optimization of node hit rate using duplicate stores.

If its seen that the hit rate of particular object/s is high on a specific node, it would be desired to have duplicates of the same object stored across nodes or across duplicate nodes such that a virtual relay/routing mechanism could be employed to divert the underlying request call. A virtual software relay could be employed just before this set of nodes such that it could route/direct to one of the clone/duplicate nodes. This mechanism assumes custom relay code that determines if the data has been duplicated and then diverts accordingly.

For this self-optimization, the systems needs to have a knowledge base that knows whether a duplicate item is being stored and its hit rate. Each node would need to determine based on the object hit rate in a time duration on whether to duplicate this object. In addition to basic object hit rate frequency, the system can learn from patterns in object usage – a specific group of objects might see high hit rate during Mondays and the system might assign duplicate nodes automatically on Mondays based on the learned hit rate pattern.

This method of store can be exploited as a disaster recovery option too. If one of the node in the duplicate set goes down, we are assured that the system continues to work as the service can now be taken care by the other nodes in the duplicate set.

5.) Optimization for near geography store.

Enterprise applications hosted on the cloud today are distributed on a global scale and when distributed caches are hosted on a cloud, it would be desired to have the most commonly used items near to the consumer geographically.

Dynamic cache clusters (not just cache groups, but cache within a cache in a consistent hash implementation) wherein each target node internally maintains another set of distributed cache could be employed. The dynamic cache cluster creation would be based on the geo usage statistics and would require the nodes to group themselves into a cluster and allocate one of it as a node in the parent cluster.

E.g.:- when the usage across Bangalore is seen to be high for a specific object, this object could be moved to a cluster/node near Bangalore. Internally routing tables would need to be updated accordingly to now point to the new target node.

More than likely, in typical implementations, it would be required to derive geo usage statistics for a group of objects rather than independent objects. The group of objects could be based on an ID or even a derivative function of a record.

6.) Optimized Network utilization

Similar to point 3, optimal usage of network is of high importance in any distributed system. Whenever a routing happens (cases 5, 4, 1 mentioned above), each node could internally maintain a spanning tree with weightage of paths, with weightage directly reflecting the historical usage of that particular network path for a better optimized usage of the network. Physical routers could be programmed to use a specific path based on learning by each node.

Highly optimized Systems

Highly optimized caches would require one or more of the above strategies to be applied together wherever applicable. This would also require the fn(optimization parameters) to be derived on the go by the system independent of any additional input.