📊Reading Notes for "LEMMARCA"
Overview
This paper introduces a new large dataset named LEMMARCA for diverse RCA tasks across multiple domains and modalities. This dataset contains IT and OT operation systems from the real world. They also evaluate eight baseline methods on this dataset to prove the high quality of LEMMA_RCA. The official website is https://lemmarca.github.io/.

What problem does the paper try to solve？
The use of automated methods for root cause analysis is crucial, but currently, there is a lack of a mainstream dataset and fair comparison is not possible.

What is the proposed solution?
They proposed a rich dataset LEMMARCA containing multiple subdatasets.

What are the key experimental results in this paper?
Tested the performance of 8 models on the LEMMARCA dataset.

What are the main contributions of the paper?
They propose the LEMMARCA dataset and evaluate eight baseline models on this.

What are the strong points and weak points in this paper?
 Strong Point: Proposed a new dataset and conducted extensive evaluation.
 Weak Point: There are no baseline methods not belonging to the causalgraphbased model.
Background
Root cause analysis (RCA) is essential for identifying the underlying causes of system failures and ensuring the reliability and robustness of realworld systems. However, traditional manual RCA is laborintensive, costly, and prone to errors, so datadriven methods are needed. Despite significant progress in RCA techniques, the largescale public datasets remain limited.
In RCA fields, here are some important keywords:
 Key Performance Indicator (KPI) is a time series indicating the system status, such as latency and service response time in microservice systems.
 Entity Metrics are multivariate time series collected by monitoring numerous system entities or components, such as CPU/Memory utilization in microservice systems.
 Datadriven Root Cause Analysis Problem. Given the monitoring data of system entities and system KPIs, identify the top K system entities that are relevant to KPIs when the system fails.
 Offline/Online: Offline RCA only uses historical data to determine past failures; Online RCA operates in realtime using current data streams to promptly address issues.
 Singlemodal/multimodal: Singlemodal RCA relies solely on one type of data for a focused analysis; Multimodal RCA uses multiple data sources for a comprehensive assessment.
Dataset
Base Information
LEMMARCA is a multidomain, multimodal dataset that includes textual system logs with millions of event records and time series metric data collected from real system faults. This dataset includes IT and OT scenes, such as microservice and water treatment.
Collection
The dataset collected from two domains, divided into four subdatasets:

IT operations

Product Review
Platform: Composed of six OpenShift nodes and 216 system pods.
Faults: outofmemory, highCPUusage, externalstoragefull, DDos attack.
Metrics: Using Prometheus to record eleven types of nodelevel metrics and six types of podlevel metrics; Using ElasticSearch to collect log data including timestamp, pod name, log message, etc; Using JMeter to collect the system status information.
KPI: Consider latency as system KPI due to system failure will result in latency significantly increasing.

Cloud Computing
Platform: Eleven system nodes.
Faults: six different types of faults, such as cryptojacking, configuration change failure, etc.
Metrics: Extracting system metrics from CloudWatch Metrics on EC2 instances; Extracting three logs types (log messages, API debug log, and MySQL log) from CloudWatch Logs; Using JMeter tools to record error rate and utilization rate as KPIs.


OT operations
 SWaT: Collected over an 11day period from a water treatment testbed equipped with 51 sensors. The system operated normally during the first 7 days, followed by attacks over the last 4 days, resulting in 16 system faults.
 WADI: Gathered from a water distribution testbed over 16 days, featuring 123 sensors and actuators. The system maintained normal operations for the first 14 days before experiencing attacks in the final 2 days, with 15 system faults recorded.
Preprocessing
Some nonstationary data are unpredictable and cannot be effectively modeled, which means they should be excluded. Thus this paper introduces some methods to preprocessing the data.
Log Feature Extraction. Due to the log data being unstructured and some of them being unmeaning, this paper transforms the log data into the timeseries format. First, they use logparsing tools to structure the log message. Then they segment the data using 10minute windows with 30second intervals and calculate the occurrence frequency as the first feature type donated as $X_1^L\in \mathbb{R}^T$. Then, they introduce a second feature type based on “Golden signals” derived from domain knowledge, such as the frequency of abnormal logs associated with system failures like DDoS attacks, storage failures, and resource overutilization. This feature is donated as $X_2^L\in \mathbb{R}^T$. Finally, they segment the log using the same time windows and apply PCA to reduce feature dimensionality, selecting the most significant component as $X_3^L\in \mathbb{R}^T$. The overall data can form as matrix $X^L=[X_1^L,X_2^L,X_3^L]\in \mathbb{R}^{3\times T}$.
KPI Construction. Using anomaly detection algorithms to model the SWaT and WADI datasets, and transform the discrete value into continuous format.
Experiments
Metrics
Precision@K (PR@K): It measures the probability that the top $K$ predicted root causes are real, formulated as:
$$ \text{PR@K}=\frac{1}{\mathbb{A}}\sum_{a\in\mathbb{A}}\frac{\sum_{i\le k}R_a(i)\in V_a}{\text{min}(K,v_a)} $$Where $\mathbb{A}$ is the set of system faults, $a$ is one fault, $V_a$ is the real root cause of $a$, $R_a$ is the predicted root cause of $a$, and $i$ is the $i$th predicted cause of $R_a$.
Mean Average Precision@K (MAP@K): It assesses the top $K$ predicted causes from the overall perspective, formulated as:
$$ \text{MAP@K}=\frac{1}{K\mathbb{A}}\sum_{a\in \mathbb{A}}\sum_{i\le j\le K}\text{PR@j} $$Mean Reciprocal Rank (MRR): It evaluates the ranking capability of models, formulated as:
$$ \text{MRR@K}=\frac{1}{\mathbb{A}}\sum_{a\in \mathbb{A}}\frac{1}{\text{rank}_{R_a}} $$Where $\text{rank}_{R_{a}}$ is the rank number of the first correctly predicted root cause for system fault $a$.
Baselines
Causalgraphbased RCA methods can provide deeper insights into system failures, thus all baseline methods fall into this category.
 PC: Classic constrainbased causal discovery algorithm that can identify the causal graph’s skeleton using an independence test.
 Dynotears: It construct dynamic Bayesian networks through vector autoregression models.
 CLSTM: Utilizes LSTM to model temporal dependencies and capture nonlinear Granger causality.
 GOLEM: relaxing the hard Directed Acyclic Graph (DAG) constraint of NOTEARS with a scoring function
 REASON: An interdependent network model learning both intralevel and interlevel causal relationships.
 Nezha: A multimodal method designed to identify root causes by detecting abnormal patterns.
 MULAN: A multimodal RCA method that learns the correlation between different modalities and coconstructs a causal graph for root cause identification
 CORAL: An online singlemodal RCA method based on incremental disentangled causal graph learning.