📊Reading Notes for "LEMMA-RCA"

Jul 9, 2024 · 6 min read

Overview

This paper introduces a new large dataset named LEMMA-RCA for diverse RCA tasks across multiple domains and modalities. This dataset contains IT and OT operation systems from the real world. They also evaluate eight baseline methods on this dataset to prove the high quality of LEMMA_RCA. The official website is https://lemma-rca.github.io/.

  1. What problem does the paper try to solve?

    The use of automated methods for root cause analysis is crucial, but currently, there is a lack of a mainstream dataset and fair comparison is not possible.

  2. What is the proposed solution?

    They proposed a rich dataset LEMMA-RCA containing multiple sub-datasets.

  3. What are the key experimental results in this paper?

    Tested the performance of 8 models on the LEMMA-RCA dataset.

  4. What are the main contributions of the paper?

    They propose the LEMMA-RCA dataset and evaluate eight baseline models on this.

  5. What are the strong points and weak points in this paper?

    • Strong Point: Proposed a new dataset and conducted extensive evaluation.
    • Weak Point: There are no baseline methods not belonging to the causal-graph-based model.

Background

Root cause analysis (RCA) is essential for identifying the underlying causes of system failures and ensuring the reliability and robustness of real-world systems. However, traditional manual RCA is labor-intensive, costly, and prone to errors, so data-driven methods are needed. Despite significant progress in RCA techniques, the large-scale public datasets remain limited.

In RCA fields, here are some important keywords:

  • Key Performance Indicator (KPI) is a time series indicating the system status, such as latency and service response time in microservice systems.
  • Entity Metrics are multivariate time series collected by monitoring numerous system entities or components, such as CPU/Memory utilization in microservice systems.
  • Data-driven Root Cause Analysis Problem. Given the monitoring data of system entities and system KPIs, identify the top K system entities that are relevant to KPIs when the system fails.
    • Offline/Online: Offline RCA only uses historical data to determine past failures; Online RCA operates in real-time using current data streams to promptly address issues.
    • Single-modal/multi-modal: Single-modal RCA relies solely on one type of data for a focused analysis; Multi-modal RCA uses multiple data sources for a comprehensive assessment.

RCA workflow
RCA workflow

Dataset

Base Information

LEMMA-RCA is a multi-domain, multi-modal dataset that includes textual system logs with millions of event records and time series metric data collected from real system faults. This dataset includes IT and OT scenes, such as microservice and water treatment.

Existing datasets for root cause analysis.
Existing datasets for root cause analysis.

Collection

The dataset collected from two domains, divided into four sub-datasets:

  • IT operations

    • Product Review

      Platform: Composed of six OpenShift nodes and 216 system pods.

      The architecture of Product Review Platform
      The architecture of Product Review Platform

      Faults: out-of-memory, high-CPU-usage, external-storage-full, DDos attack.

      Metrics: Using Prometheus to record eleven types of node-level metrics and six types of pod-level metrics; Using ElasticSearch to collect log data including timestamp, pod name, log message, etc; Using JMeter to collect the system status information.

      KPI: Consider latency as system KPI due to system failure will result in latency significantly increasing.

    • Cloud Computing

      Platform: Eleven system nodes.

      Faults: six different types of faults, such as cryptojacking, configuration change failure, etc.

      Metrics: Extracting system metrics from CloudWatch Metrics on EC2 instances; Extracting three logs types (log messages, API debug log, and MySQL log) from CloudWatch Logs; Using JMeter tools to record error rate and utilization rate as KPIs.

    Data statistics of IT operation sub-datasets.
    Data statistics of IT operation sub-datasets.

  • OT operations

    • SWaT: Collected over an 11-day period from a water treatment testbed equipped with 51 sensors. The system operated normally during the first 7 days, followed by attacks over the last 4 days, resulting in 16 system faults.
    • WADI: Gathered from a water distribution testbed over 16 days, featuring 123 sensors and actuators. The system maintained normal operations for the first 14 days before experiencing attacks in the final 2 days, with 15 system faults recorded.

Data statistics of OT operation sub-datasets.
Data statistics of OT operation sub-datasets.

Preprocessing

Some non-stationary data are unpredictable and cannot be effectively modeled, which means they should be excluded. Thus this paper introduces some methods to preprocessing the data.

Log Feature Extraction. Due to the log data being unstructured and some of them being unmeaning, this paper transforms the log data into the time-series format. First, they use log-parsing tools to structure the log message. Then they segment the data using 10-minute windows with 30-second intervals and calculate the occurrence frequency as the first feature type donated as $X_1^L\in \mathbb{R}^T$. Then, they introduce a second feature type based on “Golden signals” derived from domain knowledge, such as the frequency of abnormal logs associated with system failures like DDoS attacks, storage failures, and resource over-utilization. This feature is donated as $X_2^L\in \mathbb{R}^T$. Finally, they segment the log using the same time windows and apply PCA to reduce feature dimensionality, selecting the most significant component as $X_3^L\in \mathbb{R}^T$. The overall data can form as matrix $X^L=[X_1^L,X_2^L,X_3^L]\in \mathbb{R}^{3\times T}$.

KPI Construction. Using anomaly detection algorithms to model the SWaT and WADI datasets, and transform the discrete value into continuous format.

Experiments

Metrics

Precision@K (PR@K): It measures the probability that the top $K$ predicted root causes are real, formulated as:

$$ \text{PR@K}=\frac{1}{|\mathbb{A}|}\sum_{a\in\mathbb{A}}\frac{\sum_{i\le k}R_a(i)\in V_a}{\text{min}(K,|v_a|)} $$

Where $\mathbb{A}$ is the set of system faults, $a$ is one fault, $V_a$ is the real root cause of $a$, $R_a$ is the predicted root cause of $a$, and $i$ is the $i$-th predicted cause of $R_a$.

Mean Average Precision@K (MAP@K): It assesses the top $K$ predicted causes from the overall perspective, formulated as:

$$ \text{MAP@K}=\frac{1}{K|\mathbb{A}|}\sum_{a\in \mathbb{A}}\sum_{i\le j\le K}\text{PR@j} $$

Mean Reciprocal Rank (MRR): It evaluates the ranking capability of models, formulated as:

$$ \text{MRR@K}=\frac{1}{|\mathbb{A}|}\sum_{a\in \mathbb{A}}\frac{1}{\text{rank}_{R_a}} $$

Where $\text{rank}_{R_{a}}$ is the rank number of the first correctly predicted root cause for system fault $a$.

Baselines

Causal-graph-based RCA methods can provide deeper insights into system failures, thus all baseline methods fall into this category.

  • PC: Classic constrain-based causal discovery algorithm that can identify the causal graph’s skeleton using an independence test.
  • Dynotears: It construct dynamic Bayesian networks through vector autoregression models.
  • C-LSTM: Utilizes LSTM to model temporal dependencies and capture nonlinear Granger causality.
  • GOLEM: relaxing the hard Directed Acyclic Graph (DAG) constraint of NOTEARS with a scoring function
  • REASON: An interdependent network model learning both intra-level and inter-level causal relationships.
  • Nezha: A multi-modal method designed to identify root causes by detecting abnormal patterns.
  • MULAN: A multi-modal RCA method that learns the correlation between different modalities and co-constructs a causal graph for root cause identification
  • CORAL: An online single-modal RCA method based on incremental disentangled causal graph learning.

Results

Results for offline RCA baselines with multiple modalities on the Product Review dataset.
Results for offline RCA baselines with multiple modalities on the Product Review dataset.

Results for offline RCA baselines on the SWaT and WADI dataset.
Results for offline RCA baselines on the SWaT and WADI dataset.

Results for online root cause analysis baselines on all sub-datasets.
Results for online root cause analysis baselines on all sub-datasets.