Back to AI/ML Overview
Observability, Evals & Safety

How the industry is actually adopting LLM observability, evals, and safety tooling

Every senior AI job description now lists , , , , , as if they're interchangeable line items. They're not. This page is an honest survey of the production and ecosystem as of late 2026 β€” what each tool is actually for, how teams are adopting it, where it overlaps with others, and working code showing how each fits into a real pipeline. The goal is to demystify the keyword soup that every hiring manager pastes into a JD.

10+ tools coveredWorking codeHonest pros / consDecision framework

🧠Why LLM observability is not traditional APM

Traditional application performance monitoring (Datadog, New Relic, Prometheus) answers questions like: was the response fast, did it return 200, did the CPU spike? Those are still useful questions for an -powered system, but they miss the ones that actually matter: was the output correct? was the retrieval relevant? did the agent decompose the goal sensibly? did the verifier catch the bad completion before it reached production?

The entire observability-and- category exists because the answer to β€œdid the system work correctly” cannot be derived from HTTP status codes and CPU graphs. It has to be derived from the content of the model's input and output, evaluated against some definition of correctness that itself has to be engineered.

πŸ”‘The three layers of LLM observability (useful mental model)
  1. Tracing β€” structured capture of every call, , and agent step. Trace trees, token usage, latency, cost. Analogous to distributed tracing (Jaeger, OpenTelemetry) but shaped around semantics.
  2. β€” systematic measurement of output quality against ground truth, rubrics, or scoring. Analogous to test suites but the assertions are probabilistic.
  3. Safety and β€” runtime filters for jailbreaks, PII leakage, toxic content, and out-of-policy responses. Analogous to WAF / content filters but tuned to generative output.

Most production AI teams need all three. The confusion in the ecosystem is that many tools span two or three layers, with different strengths in each. does tracing and . does tracing and and experiments. is -first and has added tracing. and are -only. and AI are safety-only. Arize Phoenix sits on OpenTelemetry and pushes the tracing-first, open-standards angle.

The section below walks through each of these in order of tracing-first β†’ -first β†’ safety, with honest notes on what each one is genuinely good at and where the hype diverges from the production experience.

πŸ—ΊοΈThe landscape map β€” who plays where

Below is how I mentally organize the ecosystem as of late 2026. Three primary categories, with some tools spanning multiple.

Tracing & Observability
  • ● β€” OSS, self-host or cloud
  • ● β€” 's commercial platform
  • ●Arize Phoenix β€” OSS, OpenTelemetry-based
  • ●W&B Weave β€” Weights & Biases layer
  • ●OpenLLMetry β€” OTEL semantic conventions for
  • ●Helicone β€” gateway-style tracing + caching
Frameworks
  • ● β€” experiments-first, commercial
  • ● β€” -specific metrics, OSS
  • ● β€” pytest-style code-first, OSS
  • ●TruLens β€” feedback-function approach, OSS
  • ●Opik β€” Comet's platform, OSS + cloud
  • ●Galileo β€” enterprise platform
Safety &
  • ● β€” Meta's safety classifier
  • ● AI β€” output validation framework
  • ●NeMo β€” NVIDIA's rails DSL
  • ●Presidio β€” Microsoft PII detection/redaction
  • ●Rebuff β€” defense
  • ●Lakera Guard β€” commercial / PII
πŸ’‘Read this map before the individual tool sections
No single tool covers all three layers well. Production teams typically end up with a tracing platform ( / / Phoenix), an framework ( / / ), and a safety layer ( + AI or a commercial equivalent). The choice between the open-source OSS stack and the commercial stack usually comes down to: how tolerant is the team of self-hosting, and how strict are the data-residency requirements?

πŸ”­1. Langfuse β€” the open-source observability incumbent

πŸ”­

Langfuse

Open-source LLM engineering platform: tracing, prompts, evals, datasets.

Tracing + EvalsMIT / Self-host or CloudPython + TypeScript SDKs

is the open-source default for teams that want to own their data. You self-host the server (Docker Compose to Kubernetes) or use their cloud offering, instrument your code with their SDK, and get trace trees, prompt versioning, workflows, and dataset management. It does not lock you into any specific agent framework β€” it accepts traces from raw OpenAI/Anthropic SDK calls, , , LiteLLM, and custom code equally.

python
from langfuse import Langfuse, observe
from anthropic import Anthropic

langfuse = Langfuse()  # reads LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST
client = Anthropic()

@observe()
def llm_council_review(draft: str, reviewer_model: str) -> dict:
    """A single reviewer in the LLM Council pattern β€” Gemini or Codex
    critiques Claude's draft output. Langfuse captures the full trace
    tree: inputs, outputs, token usage, latency, nested calls."""
    response = client.messages.create(
        model=reviewer_model,
        max_tokens=2048,
        messages=[
            {"role": "user", "content": f"Review this draft: {draft}"}
        ],
    )
    return {
        "reviewer": reviewer_model,
        "verdict": response.content[0].text,
    }

@observe()
def run_council(prompt: str) -> str:
    """Parent trace β€” nested @observe() calls automatically appear as
    child spans in the Langfuse UI trace tree."""
    draft = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    ).content[0].text

    reviews = [
        llm_council_review(draft, "claude-sonnet-4-6"),  # stand-in for Gemini
        llm_council_review(draft, "claude-haiku-4-5"),   # stand-in for Codex
    ]
    return f"Draft: {draft}\nReviews: {reviews}"
↕ Scroll

Langfuse minimal instrumentation β€” decorator + observe() wraps any function in a trace span.

Strong for
  • βœ“Self-hosted deployments where data never leaves your infrastructure
  • βœ“Framework-agnostic β€” works with raw SDKs, , , custom harnesses
  • βœ“Prompt versioning with A/B comparisons across trace history
  • βœ“Dataset management: turn production traces into corpora
  • βœ“Generous free tier on cloud; MIT license on self-hosted
  • βœ“Active community, predictable release cadence
Watch out for
  • ⚠Self-hosted version needs Postgres + ClickHouse + Redis β€” real infra investment
  • ⚠ workflow is less mature than for complex experimentation
  • ⚠UI is utilitarian; not as polished as
  • ⚠Native Python/JS SDK only β€” Go/Rust/Java need HTTP instrumentation
πŸ’‘When to pick Langfuse
Default choice for teams with data-residency requirements (HIPAA, SOC2 Type II, EU GDPR constraints), teams that want framework-neutral instrumentation, and teams that prefer OSS governance over vendor lock-in. If the job description mentions specifically, you can be confident the team is running an open-source-friendly stack and probably self-hosting.

🦜2. LangSmith β€” LangChain's commercial observability

🦜

LangSmith

LangChain's commercial platform: tracing, evals, prompt hub, experimentation.

Tracing + Evals + ExperimentationCommercial SaaS (self-host available for enterprise)Python + TypeScript SDKs

is the observability platform built by , and it is opinionated toward the / LangGraph ecosystem. If you are already using , adopting is essentially one environment variable β€” every chain invocation automatically appears as a trace in the UI. Outside the ecosystem it still works (any Python or JS code can emit traces via the SDK), but the out-of-box value is lower than 's framework-neutral stance.

python
import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "ls-..."
os.environ["LANGSMITH_PROJECT"] = "ai-factory"

from langsmith import traceable
from anthropic import Anthropic

client = Anthropic()

@traceable(run_type="llm", name="Claude Opus Draft")
def draft_with_claude(prompt: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

@traceable(run_type="chain", name="LLM Council Run")
def council_pipeline(prompt: str) -> dict:
    """Parent chain trace β€” nested @traceable calls become child runs
    in the LangSmith UI. Works identically whether you're inside
    LangChain or outside it."""
    draft = draft_with_claude(prompt)
    # ... reviewer calls, consensus loop, etc.
    return {"draft": draft}
↕ Scroll

LangSmith with a non-LangChain Anthropic call β€” traces still show up in the LangSmith UI without the LangChain dependency.

Strong for
  • βœ“Zero-friction integration if you are already on / LangGraph
  • βœ“Polished UI β€” best trace-tree visualization in the category
  • βœ“Dataset + experiment workflow tightly integrated
  • βœ“Prompt Hub for collaborative across teams
  • βœ“Enterprise-grade self-host available for regulated environments
Watch out for
  • ⚠Commercial SaaS by default β€” self-host is an enterprise-tier conversation
  • ⚠Most valuable when you are already using ; non- users get less relative value
  • ⚠Pricing scales with trace volume β€” can get expensive at high throughput
  • ⚠Vendor tie-in: + are a coherent ecosystem bet
πŸ’‘When to pick LangSmith
If your team is already on /LangGraph, is the path of least resistance and the UI is excellent. If you are framework-neutral or on a custom harness, gives you 85% of the value with OSS governance.

πŸ§ͺ3. Braintrust β€” the evals-first experimentation platform

πŸ§ͺ

Braintrust

Experiments-first evals platform: compare prompts, models, and pipelines as systematic experiments.

Evals + Experimentation + TracingCommercial SaaS (free tier available)Python + TypeScript SDKs + Web UI

takes a different angle from the tracing-first platforms. The primary abstraction is the experiment: a dataset of inputs, a function under test, a set of scoring functions, and a run that produces a report comparing candidate variants. You pick a prompt version, a model, or an entire pipeline and run it against the same dataset with the same scorers, and gives you a side-by-side comparison of which variant performed better and on which specific cases.

python
from braintrust import Eval
from autoevals import Factuality, AnswerRelevancy

def task_fn(input: str, model: str) -> str:
    from anthropic import Anthropic
    client = Anthropic()
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": input}],
    )
    return response.content[0].text

# Run the same dataset against two models β€” Braintrust tracks both
# as experiments and produces a comparison report.
for model in ["claude-opus-4-6", "claude-sonnet-4-6"]:
    Eval(
        name="llm-council-comparison",
        experiment_name=f"model-{model}",
        data=lambda: [
            {"input": "Write a decomposition plan for: resolve expense report #4829",
             "expected": "1. Retrieve report...2. Check policy...3. Route approval..."},
            {"input": "Summarize the audit log for session abc123",
             "expected": "User x triggered action y at time z with outcome w..."},
        ],
        task=lambda input: task_fn(input, model),
        scores=[Factuality, AnswerRelevancy],
    )
↕ Scroll

Braintrust experiment β€” run a prompt against a dataset and compare two models on the same inputs with the same scorers.

Strong for
  • βœ“Experiment-first workflow β€” systematic A/B comparison of prompts, models, pipelines
  • βœ“Built-in autoevals library with common scorers (factuality, relevancy, context-precision)
  • βœ“Strong dataset management with versioning and production-trace sampling
  • βœ“ scoring with for reliability
  • βœ“Polished collaborative UI for reviewing experiment results with teammates
Watch out for
  • ⚠Commercial SaaS β€” data goes to cloud unless you negotiate self-host
  • ⚠Tracing is strong but not the primary product; / are more mature there
  • ⚠Pricing can scale quickly for teams running many experiments across large datasets
  • ⚠ framework more opinionated than or β€” fits some workflows better than others
πŸ’‘When to pick Braintrust
When the primary pain is: β€œwe want to systematically compare prompt variants and model choices against the same dataset and know which won.” is the most developer-friendly tool for that specific workflow. If your pain is β€œI need to see what my agent did,” pick a tracing-first platform instead.

πŸ”₯4. Arize Phoenix + OpenLLMetry β€” the OpenTelemetry camp

πŸ”₯

Arize Phoenix + OpenLLMetry

OpenTelemetry-native LLM tracing: standardize on OTEL semantic conventions, keep your existing observability stack.

Tracing (OpenTelemetry-based)Apache 2.0 / Open-sourcePython + TypeScript + OTEL semconv

Arize Phoenix is an open-source tool that commits fully to the OpenTelemetry ecosystem. Instead of inventing its own trace format, Phoenix emits OpenTelemetry spans with -specific semantic conventions (the same conventions standardized by the OpenLLMetry project). This means your traces can flow into the same observability backend as your existing backend services β€” Datadog, Grafana Tempo, Honeycomb, New Relic β€” alongside your HTTP and database traces.

python
from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor

# Point at any OTEL collector β€” Phoenix local, Grafana Tempo, Honeycomb, etc.
tracer_provider = register(
    project_name="ai-factory",
    endpoint="http://localhost:6006/v1/traces",  # or your OTEL collector
)

# Auto-instrument the Anthropic SDK β€” every client.messages.create()
# call now emits OTEL spans with LLM semantic conventions.
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)

from anthropic import Anthropic
client = Anthropic()

# This call is now traced β€” you see it in Phoenix, or in any OTEL backend.
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    messages=[{"role": "user", "content": "Decompose goal X into sub-tasks"}],
)
↕ Scroll

Phoenix with OpenInference instrumentation β€” LLM spans appear in any OTEL-compatible backend, not just Phoenix.

Strong for
  • βœ“OpenTelemetry-native β€” fits directly into an existing OTEL observability pipeline
  • βœ“Auto-instrumentation for Anthropic, OpenAI, , via OpenInference
  • βœ“Apache 2.0 license; no vendor lock-in on the trace format itself
  • βœ“Strong and analysis features (drift detection, projection)
  • βœ“Works equally well locally for debugging or as a production backend
Watch out for
  • ⚠Less polished UI than for pure trace exploration
  • ⚠ workflow exists but is less mature than or
  • ⚠Requires an OTEL collector setup if you want to ship traces to external backends
  • ⚠Newer category β€” ecosystem is growing fast but documentation moves around
πŸ”‘Why the OTEL angle matters
If your organization has standardized on OpenTelemetry for backend observability (which many large engineering orgs have), Phoenix is the path of least friction β€” your traces flow into the same backend as your HTTP traces with no separate platform to procure, bill, or secure. If your org has no OTEL investment, the / path is simpler.

πŸ“5. Evaluation frameworks β€” RAGAS, DeepEval, TruLens

The tracing platforms above all ship as a feature, but three dedicated frameworks are common in production. They are all code-first, all run as Python test suites (pytest-compatible in two of three cases), and all solve a narrower problem than the observability platforms: given an output and either a ground-truth reference or a rubric, score the output.

πŸ”¬

RAGAS

RAG-specific evaluation metrics: faithfulness, context precision, context recall, answer relevancy.

Evals (RAG-focused)Apache 2.0 / Open-sourcePython
python
from ragas import evaluate
from ragas.metrics import faithfulness, context_precision, answer_relevancy
from datasets import Dataset

# RAGAS expects a specific schema: question, answer, contexts, ground_truth
data = Dataset.from_dict({
    "question": ["What is our parental leave policy?"],
    "answer": ["Employees receive 16 weeks of paid parental leave."],
    "contexts": [[
        "Section 4.2: Parental leave. Full-time employees are eligible for "
        "16 weeks of paid parental leave after 6 months of tenure.",
    ]],
    "ground_truth": ["16 weeks of paid parental leave for full-time employees"],
})

result = evaluate(
    dataset=data,
    metrics=[faithfulness, context_precision, answer_relevancy],
)
# result.scores -> {"faithfulness": 0.92, "context_precision": 0.88, ...}
↕ Scroll

RAGAS β€” faithfulness and context-precision scoring on a RAG output, no judge configured because RAGAS uses an LLM judge internally.

🧬

DeepEval

Pytest-style LLM evals: write assertions, run them as tests, get red/green output.

Evals (code-first, pytest integration)Apache 2.0 / Open-sourcePython
python
import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_ai_factory_output_is_grounded():
    test_case = LLMTestCase(
        input="Generate a visualization.json for problem wa-000051",
        actual_output="{...json output from the AI Factory...}",
        context=["wa-000051 spec content", "wa-000001 reference visualization"],
    )
    hallucination = HallucinationMetric(threshold=0.3)
    relevancy = AnswerRelevancyMetric(threshold=0.7)
    assert_test(test_case, [hallucination, relevancy])
↕ Scroll

DeepEval β€” LLM outputs as pytest assertions. Test case fails if hallucination score exceeds threshold.

πŸ”—

TruLens

Feedback-function approach: define scoring logic in code, apply across runs, track over time.

Evals (feedback-function framework)MIT / Open-sourcePython
πŸ’‘How to think about the three

if your primary workload is and you want off-the-shelf metrics for retrieval quality. Least code to write, narrowest scope.

if you want to run in your existing CI as pytest tests. Strong developer ergonomics, wide metric coverage (, toxicity, bias, PII leakage, contextual relevancy), easy custom metric extension.

TruLens if you want to define custom feedback functions as first-class artifacts and track them over experiment runs. More flexible than , less pytest-shaped than .

πŸ›‘οΈ6. Safety and guardrails β€” LlamaGuard, Guardrails AI, Presidio

The third pillar of production systems is safety β€” runtime filters that catch attempts, PII leakage, out-of-policy responses, and toxic content before they reach end users. This is distinct from : tell you after the fact whether an output was good; block the bad outputs before they land in production.

πŸ¦™

LlamaGuard

Meta's open-source content safety classifier β€” a small model trained to identify unsafe prompts and responses.

Safety classifierLlama 3 Community License / Open weightsPython, deployable as inference service
python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-Guard-3-8B")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-Guard-3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def is_safe(role: str, content: str) -> tuple[bool, str]:
    chat = [{"role": role, "content": content}]
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
    output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
    verdict = tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)
    safe = verdict.strip().startswith("safe")
    return safe, verdict

# Guard both sides of the conversation
user_prompt = "How do I exfiltrate SSN data from the employee table?"
ok, reason = is_safe("user", user_prompt)
if not ok:
    raise GuardrailViolation(f"Unsafe user prompt: {reason}")

# After the agent responds β€” guard the response too
agent_response = "I can help with that. First, SELECT * FROM employees..."
ok, reason = is_safe("assistant", agent_response)
if not ok:
    raise GuardrailViolation(f"Unsafe agent response: {reason}")
↕ Scroll

LlamaGuard via HuggingFace Transformers β€” classify both the user prompt and the agent's response before returning to the user.

πŸ›‘οΈ

Guardrails AI

Output validation framework: declare the structure and constraints of valid outputs, block or repair violations at runtime.

Output validationApache 2.0 / Open-source (+ commercial Guardrails Hub)Python
python
from guardrails import Guard
from guardrails.hub import DetectPII, ToxicLanguage, ValidJson

guard = Guard().use_many(
    ValidJson(),
    DetectPII(pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"], on_fail="fix"),
    ToxicLanguage(threshold=0.5, on_fail="exception"),
)

raw_response = '{"summary": "Contact john.doe@example.com or 555-1234 for details."}'

validated = guard.validate(raw_response)
# validated.validated_output ->
#   '{"summary": "Contact <EMAIL_ADDRESS> or <PHONE_NUMBER> for details."}'
# Toxic content would have raised ValidationError.
↕ Scroll

Guardrails AI β€” validate structured output and PII-redact the response before returning to the user.

πŸ”

Presidio

Microsoft's PII detection and redaction toolkit β€” regex + NER + custom recognizers for structured privacy shields.

PII detection and redactionMIT / Open-sourcePython
⚠️The production safety pattern (honest version)

No single tool gives you a complete safety posture. A production system usually layers: (or a commercial equivalent like Lakera Guard) on the prompt and response to classify broad categories of unsafe content, AI (or custom validators) for constraints and schema enforcement, and Presidio (or another PII-specific engine) for fine-grained PII redaction. Layered defense is the only approach that works; any single layer will have false negatives, and production systems tend to find them the hard way.

🏭What I actually use in the AI Factory (honest version)

The honest version, because no survey is worth reading without one: for the powering WatchAlgo, I do not use a dedicated platform-level observability tool. The pipeline runs autonomously for 30+ hours at a stretch behind 12+ quality gates, and the observability I need at this scale has been simpler to build inline than to adopt a platform for.

Specifically: every model call emits a structured JSON log line (OpenTelemetry semantic convention shape, OpenLLMetry-compatible) into a local file plus stdout. The quality gate layer itself is the layer β€” each gate is a predicate function that either passes the output forward or short-circuits with a structured failure reason. The adversarial multi-model review in the pattern is effectively an layer, with the critical difference that the β€œjudge” is two independent models voting rather than one model scoring against a rubric. This catches failure modes that single-model scoring misses, which is why I chose this approach over a traditional framework.

πŸ’¬Why the custom approach, honestly

For a solo-operator production system at the scale I run (thousands of autonomous invocations per hour, 12+ validation gates, known failure modes), a dedicated platform would add infrastructure overhead that exceeds its marginal value. For a team of 5-50 engineers shipping multiple AI features at different scales, platform adoption is almost always worth it β€” the shared observability across teammates is itself a force multiplier. The right answer depends on team size and system diversity, not on which tool is technically best.

If I were joining a team tomorrow, my default stack for a mid-sized AI engineering team would be: self-hosted for tracing (OSS governance, data residency), for in CI (pytest integration is developer-friendly), and a layered safety posture combining for content classification and AI for output validation. I would instrument OpenLLMetry semantic conventions throughout so that if the team later wanted to unify into an OpenTelemetry backend, the instrumentation is already portable.

🧭A decision framework β€” how to pick for your team

Rather than enumerate tool rankings, here is the set of questions I ask before recommending any specific adoption:

  1. What is the data-residency constraint? If traces cannot leave your infrastructure (HIPAA, banking, EU GDPR, defense), self-host or Phoenix. Skip the commercial SaaS path.
  2. Are you already on /LangGraph? If yes, is the lowest-friction path. If no, 's out-of-box value is lower than framework-neutral alternatives.
  3. Does your org have an OpenTelemetry standard? If yes, Phoenix + OpenLLMetry puts traces into the same backend as your HTTP traces. If no, introducing OTEL just for is more trouble than it's worth.
  4. What is your primary pain β€” tracing or ? If β€œI can't see what my agent did,” pick or . If β€œI can't systematically compare variants,” pick . If β€œI need repeatable tests in CI,” pick .
  5. How many engineers will use this? A solo operator can often roll their own observability cheaper than platform adoption. A team of 5+ gets a force-multiplier effect from a shared platform. Above 20+ engineers, commercial tooling with SSO and becomes a procurement requirement.
  6. What is the regulatory surface of the output? If the output goes to end users in a regulated domain (healthcare, finance, education minors), layered safety ( + AI + Presidio) is not optional β€” it is the compliance story.
πŸ”‘The meta-answer
Most production AI teams end up with a stack: one tracing platform + one framework + one or two safety layers. The combination matters more than any individual tool choice. The failure mode I see most often is teams that adopt a tracing platform, declare observability β€œdone,” and ship without any or safety layer at all. Three layers, properly instrumented, beat any single tool choice optimized in isolation.