A practical guide to the OTel GenAI Semantic Conventions, for engineers.

The Problem: Agents Made Trajectories Complicated Link to heading

Two years ago, instrumenting an LLM app was simple โ€” one round-trip, prompt in, completion out.

Today is different. A single agent task often contains a dozen tool calls, the creation and activation of several sub-agents, cross-agent handoffs, and even parallel branches or recursion. Capturing the full trajectory now takes more engineering than a traditional web trace.

The result is a mess. Every framework has its own trace format; every backend has its own extension fields. Switching frameworks means rewriting instrumentation; switching backends means rewriting exporters; comparing data across sources means writing a pile of adapters.

OpenTelemetry GenAI Semantic Conventions (OTel GenAI for short) was created to clean up this mess. Maintained by the GenAI Observability SIG (founded April 2024) under CNCF, it is the closest thing to a de facto standard we have today.

The spec is still in Development status โ€” field names may still shift, but the skeleton is essentially stable. Datadog, LangChain, AutoGen, CrewAI, OpenAI Agents SDK, and MLflow have all aligned in practice. Entry point: GenAI semconv homepage

A Standardized Skeleton: Five Span Types Link to heading

The first thing OTel GenAI gives you is a standard skeleton for agent trajectories. It abstracts all execution into five span types, distinguished by the field gen_ai.operation.name:

  • invoke_workflow โ€” the root of a multi-agent collaboration
  • invoke_agent โ€” a single activation of one agent
  • create_agent โ€” agent instantiation
  • chat โ€” a single LLM call
  • execute_tool โ€” a single tool call

These spans link up via parent-child relations to form a call tree โ€” a trajectory, in data, is a span tree.

invoke_workflow (root)
โ”œโ”€โ”€ invoke_agent[Planner]
โ”‚   โ”œโ”€โ”€ chat
โ”‚   โ””โ”€โ”€ execute_tool[search]
โ”œโ”€โ”€ invoke_agent[Coder]
โ”‚   โ”œโ”€โ”€ chat
โ”‚   โ”œโ”€โ”€ execute_tool[write_file]
โ”‚   โ””โ”€โ”€ execute_tool[run_tests]
โ””โ”€โ”€ invoke_agent[Reviewer]
    โ””โ”€โ”€ chat

OTel does not explicitly model agent-to-agent handoff. When Planner hands control to Coder, in the data it just looks like two sibling child spans; figuring out “how the previous agent’s output became the next agent’s input” requires inspecting message content. This is a pragmatic trade-off โ€” it avoids assuming any particular framework’s DAG abstraction, at the cost of leaving cross-agent causal analysis to you.

A Standardized Vocabulary: Common Attributes Link to heading

The skeleton determines the shape of a trajectory. What goes inside each span node is the job of a shared set of fields.

You have actually already met one โ€” gen_ai.operation.name, the field used to distinguish the five span types in the previous section. It belongs to a broader namespace gen_ai.*, which uniformly describes four kinds of information across providers:

  • Request parameters: model, provider, temperature, max_tokens, etc.
  • Response content: input / output messages, finish_reasons, response id
  • Usage: input / output token counts
  • Errors: error.type, hooking into OTel’s standard error namespace

Agent-related fields (gen_ai.agent.name / id / description / version) live on invoke_agent; tool-related fields (gen_ai.tool.name, etc.) live on execute_tool.

What this directly enables: anomaly detection can simply read error.type or finish_reasons; agent attribution can walk up the parent chain to the nearest invoke_agent and read agent.name โ€” no need to design your own fields.

An Easily Overlooked Capability: Event vs Attribute Link to heading

You may have noticed the phrase “input / output messages should go through Event.” Here “Event” isn’t the everyday word โ€” it’s a specific concept in OTel. Getting this straight matters, because it directly determines how you manage PII in message bodies.

A span is not just a log line โ€” think of it as a small dossier covering a time interval. Two kinds of things can be attached to that dossier (simplified):

  • Attribute โ€” a label on the dossier’s cover. Once written, it travels with the span. Can’t be turned off.
  • Event โ€” a timestamped sticky note inside the dossier. Can be filtered independently before transport.

For LLM applications, message bodies are a real problem: they’re long, they may contain PII, and you’d like to be able to turn them off without touching business code. OTel therefore recommends putting messages on Events rather than Attributes โ€” that way one line of SDK / Processor configuration toggles message capture, and your business code stays untouched.

This is OTel GenAI’s built-in answer to privacy management at the spec level โ€” no need to wait for backend redaction.

Collection: Three Layers Link to heading

Collecting OTel GenAI data happens at three layers, and in practice you almost never need to write instrumentation from scratch.

Layer 1: native framework emission โ€” the ideal case. The major frameworks of 2025โ€“2026 (OpenAI Agents SDK, AutoGen, CrewAI, LangChain / LangGraph, LlamaIndex, MLflow) have all aligned in practice. Once you adopt one of these frameworks, all you need to do is set the environment variable OTEL_EXPORTER_OTLP_ENDPOINT; business code is completely unaware.

Layer 2: SDK auto-instrumentation โ€” for low-level LLM clients:

from opentelemetry.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()
# all openai.chat.completions.create() calls now auto-emit a chat span

Layer 3: manual instrumentation โ€” to cover custom logic that frameworks / SDKs miss.

Transport: Push Model and Collector Link to heading

OTel uses a native push model โ€” spans are pushed the moment they’re produced, not pulled by a backend.

For real-time analysis, monitoring, or anomaly detection, the key is the two hooks provided by SpanProcessor: on_start and on_end:

class MyStreamingProcessor(SpanProcessor):
    def on_start(self, span, parent_context): ...   # span has just started
    def on_end(self, span): ...                     # span has just ended

Each span gives you two chances to intervene, which means you can begin downstream processing before the trajectory has finished โ€” this is what sets OTel apart from “export the full trace, then batch process” approaches.

Collector: A Programmable Middle Layer Link to heading

The Collector is a programmable middle layer between the SDK and the backend. A single instrumentation setup can fan out to multiple destinations simultaneously:

            โ”Œโ”€> file (JSONL, offline archive)
SDK โ”€OTLPโ”€> Collector โ”€โ”ผโ”€> Kafka / Redis (real-time analysis)
            โ””โ”€> commercial backend (debug visualization)

Business code emits once; where the data ends up is decided by Collector configuration โ€” the SDK only emits OTLP; which backend it lands in, which queue it’s pushed to, is purely a deployment-time decision.

Current State and Recommendations Link to heading

Coverage today:

  • LLM call portion is essentially stable; industrial implementations are everywhere
  • Agent portion is still marked Experimental, but the skeleton has been adopted in practice by the major frameworks
  • MCP sub-spec is a newer addition; its key function is compatibility with execute_tool spans, avoiding double recording

Open issues not yet fully closed:

  • Agent-to-agent handoff edges can only be inferred from message content
  • Provider-specific features (extended thinking, prompt caching strategies, etc.) are not fully covered
  • No timeline yet for Stable status

Practical recommendations:

  • Start using it now; route messages via Event for easy PII management
  • Set OTEL_SEMCONV_STABILITY_OPT_IN to control the spec version โ€” don’t let instrumentation auto-jump versions
  • Put custom fields in a new namespace; don’t pollute gen_ai.*
  • If the spec is missing a field you need, file an issue โ€” that’s more valuable than quietly extending it. The SIG is actively iterating.

Extension: MCP โ€” No Double Recording for Tool Calls Link to heading

Any discussion of GenAI tool calling eventually hits MCP (Model Context Protocol) โ€” an open protocol Anthropic proposed in 2024. Think of it as USB-C for LLM tool calls: whether it’s an IDE, a database, or an API, anything that speaks MCP plugs straight into an LLM app.

Inside OTel GenAI, the MCP sub-spec is functionally solving one critical problem: don’t record the same thing twice.

Imagine: you’re running an agent in LangChain, and the agent invokes a tool via MCP. LangChain itself emits an execute_tool span. If the MCP library also emits its own MCP Client span, the trajectory ends up with two nested copies of the same tool call โ€” anomaly detectors fire false positives, call counts double, cost attribution drifts.

The MCP sub-spec spells it out:

If an outer GenAI framework is already tracing tool execution, the MCP framework should add MCP attributes to the existing span rather than open a new one.

One tool call, one span, all the info you need. This looks unremarkable but is functionally crucial โ€” it directly determines trajectory clarity and prevents observability data from being skewed.

Extension: Anthropic vs OpenAI โ€” Two Provider Styles Link to heading

The common gen_ai.* namespace can only cover features that every provider has. The moment a vendor introduces something unique, the common conventions fall short. The SIG’s response is provider-specific sub-specs.

Anthropic and OpenAI happen to represent two contrasting styles, worth comparing.

Anthropic’s restraint: the Anthropic sub-spec introduces almost no proprietary new fields โ€” it just specifies gen_ai.provider.name="anthropic" and how prompt-cache hit / write tokens should be expressed using common fields. The reasoning: caching isn’t Anthropic-only (OpenAI does automatic cache, Gemini has context cache), so put it in the common namespace where other providers can reuse it later.

OpenAI’s expansion: the OpenAI sub-spec opens its own openai.* namespace outright, defining fields like service tier, system fingerprint, and API variant (chat completions vs responses). These are tied to OpenAI’s commercial model and deployment mechanics โ€” they can’t be generalized, so they have to be provider-specific.

Takeaway for engineers: if you’re writing your own instrumentation, first ask whether the field you want to add is really “only this provider has it.” If it can be generalized, don’t put it in a vendor namespace โ€” otherwise, three years from now when someone else writes the same field, you become the legacy.

Summary Link to heading

OTel GenAI delivers four things at the capability level:

  1. A standardized execution skeleton: five span types abstract a trajectory into a span tree
  2. A standardized vocabulary of fields: gen_ai.* describes request / response / usage / error consistently across providers
  3. PII management for message bodies: the Event vs Attribute distinction lets you toggle PII at the SDK layer
  4. Unified tool-call recording: the MCP sub-spec requires attaching to an existing execute_tool span, preventing double recording

It is not perfect โ€” still in Development, agent spans still Experimental, handoff-edge modeling still contested. But it is the most realistic option we have today: CNCF-neutral, ecosystem active, major frameworks already aligned. Aligning your data schema with gen_ai.* now is a low-risk, high-reward decision.

Further reading: