Why automated causal reasoning is possible for incident investigation

June 8, 2024

In the previous blog, we discussed how a developer tool with causal reasoning is key to running reliable and robust systems. Now, we discuss why automated causal reasoning is easier to implement in certain domains than others. We will use two use cases to explore the challenges and factors that complicate automating causal reasoning. In the end, we will look at why it is possible to implement this for incident investigation.

To establish a causal relationship between two variables (X caused Y), three conditions must be met:

  • Chronological ordering: X happened before Y
  • Strong association: X and Y are usually found together
  • Elimination of alternative causes: no variable Z exists that could have caused Y

Chronological ordering is the easiest to establish as modern timekeeping methods have enabled us to easily record timestamps up to microsecond precision. Detecting whether two variables are strongly correlated is simple too - in most cases, one could simply get associations in literature or in rare cases, run statistical experiments to get this information. Eliminating alternative causes is the most difficult of all, as it requires a complete understanding of the environment in which the two variables exist. As a rule of thumb, the more closed an environment is, the easier it is to eliminate alternative causes. Conversely, the more open an environment is, the harder it becomes to eliminate alternate causes.

Let’s consider a simple case first - temperature rising in an unplugged refrigerator. Applying our three conditions for causal inference, we can see that there is always a chronological ordering between unplugging the refrigerator and its internal temperature rising, i.e. the temperature rises in the refrigerator after we unplug it. Second, we know from our understanding of the working of a refrigerator that the temperature in a refrigerator goes up when the power is turned off. Lastly, this is a controlled environment (closed system) as we have all conditions under control. So, we can confidently say that no other possible factors are at play such as the refrigerator door being left ajar or a mechanical malfunction. Therefore, building an automated causal reasoning engine to accurately infer whether unplugging a refrigerator would increase its internal temperature is relatively straightforward.

Now let's move on to a more complex scenario of predicting if a new drug can cure a disease. Applying the three conditions again, we see that the chronological condition is the easiest to apply. That is, the patient must get better after taking the drug. Second, there must be strong theoretical or statistical evidence to prove that the drug indeed cures the disease. This is typically established through existing knowledge and often supplemented by animal testing. Lastly, one needs to be able to ensure that no other variables are at play. This is the hardest of all since our understanding of the human body is incomplete and it is simply not possible to “look” at all variables in the human body due to the prohibitive cost of medical imaging. As a result, it is almost impossible to build an automated causal reasoning engine to predict if a drug can cure a disease. In real life, this is overcome by Randomized Controlled Trials (RCTs) which are considered the gold standard to eliminate any alternative causes (confounding variables).

Where does a causal reasoning engine for incident investigation lie on this spectrum? Applying the three conditions to incident investigation, we see that chronological ordering is possible due to microsecond timestamps on observability data. Distributed tracing takes this one step further to delineate what came first. Determining the association between two variables can be accomplished by building an in-house correlation engine. Lastly, eliminating alternative causes is possible because, despite all their complexity, software systems are closed systems. This is due to efforts put in by the software engineering community over the last few years to standardize applications and infrastructure. Using infrastructure from cloud providers, containerization, and industry-standard application development frameworks helps limit alternative causes to a manageable finite number. Based on this theoretical evaluation, I believe that automated causal reasoning is possible for incident investigation.

However, something being theoretically possible is different from working in reality. Want to know what it would look like in practice? Stay tuned to know more!

Deepak

Co-founder, Hoistr

← Back to home