Correlation, Causality, and Developer Tooling

May 13, 2024

While going about our daily lives, every action we take involves identifying correlations, drawing causal inferences, or a mix of both. Identifying correlations, or associating an observation with something else is a primal response across species in the animal kingdom, e.g. flocks of birds taking flight when startled by a sound. In human cultures, correlations are popularized through idioms such as “where there's smoke there's fire”.

Smoke

On the other hand, causality is thought to be something unique to humans as it requires possessing a fundamental understanding of how different components/actors in a given scenario interact with each other. This enables us to answer questions such as “What would happen if I do X?”, and the ability to look back and say, “What would have happened if I did Y instead?” which sets us apart from the rest of the animal kingdom. Consider this example - most people would likely agree that warmer weather triggers more ice cream sales but would be bewildered if told that selling more ice cream causes warm weather. This is because most of us understand the world to a reasonable resolution. We know for a fact that people like cooling down with an ice cream when it is warm outside. We also know that there is no mechanism linked to ice cream sales that could cause the weather to warm up. Unfortunately, it is not as simple in other cases such as understanding the effectiveness of a new drug. It is common to arrive at incorrect conclusions because we have an incomplete/incorrect understanding of the world, and/or are unable to reason with the facts at hand. At other times, people just give up on drawing causal inferences because it is simply too hard.

A similar challenge exists in the software industry too. Most software products running at scale today are complex and need thousands of engineers to manage them. The need to manage such complexity has led to the rise of the Developer Tools industry. This industry includes product lines such as cloud providers to run applications at scale, observability platforms to understand how systems are running, IDEs to develop applications faster, and project management tools, amongst others. But, they generally do not offer causal insights explaining how doing X would affect user experience, or rather, what would have happened if they did Y instead. At best, developer tools will ingest a large quantity of your data and present a clean interface listing out correlated events in chronological order, leaving it up to an engineer to draw causal conclusions. The current state-of-the-art works for smaller use cases but often falls flat when dealing with time-sensitive issues in complex systems.

Uptime report

Despite using state-of-the-art tools, systems go offline regularly as seen in the uptime report above for a major Silicon Valley tech company. In my experience, no single tool allows engineers to tie business rules based on user requirements to the abstractions they deal with such as applications, endpoints, system architecture, infrastructure, and third-party services. Most engineers try to overcome this problem by spending a significant portion of their precious time setting up complex runbooks and alerting systems. Even after setting up runbooks and alerts, it takes multiple engineering hours to resolve the average outage. This is precisely what we are addressing at Hoistr - we are building a developer tool that aims to help resolve incidents faster than any human-driven process. As with the simple case of warm weather and ice cream, we make this possible by acquiring a deep understanding of our users’ software systems and building the ability to reason with the facts available using industry best practices. Our vision is that in a few years, Hoistr's users will have 10x fewer outages and when they do happen, they will be resolved 10x faster.

How do we go about doing this? Stay tuned to learn more!

Deepak

Co-founder, Hoistr

← Back to home