Every senior AI job description now lists Langfuse, LangSmith, Braintrust, RAGAS, DeepEval, LlamaGuard as if they're interchangeable line items. They're not. This page is an honest survey of the production LLM observability and evaluation ecosystem as of late 2026 β what each tool is actually for, how teams are adopting it, where it overlaps with others, and working code showing how each fits into a real agentic pipeline. The goal is to demystify the keyword soup that every hiring manager pastes into a JD.
Traditional application performance monitoring (Datadog, New Relic, Prometheus) answers questions like: was the response fast, did it return 200, did the CPU spike? Those are still useful questions for an LLM-powered system, but they miss the ones that actually matter: was the output correct? was the retrieval relevant? did the agent decompose the goal sensibly? did the hallucination verifier catch the bad completion before it reached production?
The entire observability-and-evals category exists because the answer to βdid the system work correctlyβ cannot be derived from HTTP status codes and CPU graphs. It has to be derived from the content of the model's input and output, evaluated against some definition of correctness that itself has to be engineered.
Most production AI teams need all three. The confusion in the ecosystem is that many tools span two or three layers, with different strengths in each. Langfuse does tracing and evals. LangSmith does tracing and evals and experiments. Braintrust is evals-first and has added tracing. RAGAS and DeepEval are evals-only. LlamaGuard and Guardrails AI are safety-only. Arize Phoenix sits on OpenTelemetry and pushes the tracing-first, open-standards angle.
The section below walks through each of these in order of tracing-first β evals-first β safety, with honest notes on what each one is genuinely good at and where the hype diverges from the production experience.
Below is how I mentally organize the ecosystem as of late 2026. Three primary categories, with some tools spanning multiple.
Open-source LLM engineering platform: tracing, prompts, evals, datasets.
Langfuse is the open-source default for teams that want to own their data. You self-host the server (Docker Compose to Kubernetes) or use their cloud offering, instrument your code with their SDK, and get trace trees, prompt versioning, evaluation workflows, and dataset management. It does not lock you into any specific agent framework β it accepts traces from raw OpenAI/Anthropic SDK calls, LangChain, LlamaIndex, LiteLLM, and custom code equally.
from langfuse import Langfuse, observe
from anthropic import Anthropic
langfuse = Langfuse() # reads LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST
client = Anthropic()
@observe()
def llm_council_review(draft: str, reviewer_model: str) -> dict:
"""A single reviewer in the LLM Council pattern β Gemini or Codex
critiques Claude's draft output. Langfuse captures the full trace
tree: inputs, outputs, token usage, latency, nested calls."""
response = client.messages.create(
model=reviewer_model,
max_tokens=2048,
messages=[
{"role": "user", "content": f"Review this draft: {draft}"}
],
)
return {
"reviewer": reviewer_model,
"verdict": response.content[0].text,
}
@observe()
def run_council(prompt: str) -> str:
"""Parent trace β nested @observe() calls automatically appear as
child spans in the Langfuse UI trace tree."""
draft = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
).content[0].text
reviews = [
llm_council_review(draft, "claude-sonnet-4-6"), # stand-in for Gemini
llm_council_review(draft, "claude-haiku-4-5"), # stand-in for Codex
]
return f"Draft: {draft}\nReviews: {reviews}"Langfuse minimal instrumentation β decorator + observe() wraps any function in a trace span.
LangChain's commercial platform: tracing, evals, prompt hub, experimentation.
LangSmith is the observability platform built by LangChain, and it is opinionated toward the LangChain / LangGraph ecosystem. If you are already using LangChain, adopting LangSmith is essentially one environment variable β every chain invocation automatically appears as a trace in the LangSmith UI. Outside the LangChain ecosystem it still works (any Python or JS code can emit traces via the SDK), but the out-of-box value is lower than Langfuse's framework-neutral stance.
import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "ls-..."
os.environ["LANGSMITH_PROJECT"] = "ai-factory"
from langsmith import traceable
from anthropic import Anthropic
client = Anthropic()
@traceable(run_type="llm", name="Claude Opus Draft")
def draft_with_claude(prompt: str) -> str:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
@traceable(run_type="chain", name="LLM Council Run")
def council_pipeline(prompt: str) -> dict:
"""Parent chain trace β nested @traceable calls become child runs
in the LangSmith UI. Works identically whether you're inside
LangChain or outside it."""
draft = draft_with_claude(prompt)
# ... reviewer calls, consensus loop, etc.
return {"draft": draft}LangSmith with a non-LangChain Anthropic call β traces still show up in the LangSmith UI without the LangChain dependency.
Experiments-first evals platform: compare prompts, models, and pipelines as systematic experiments.
Braintrust takes a different angle from the tracing-first platforms. The primary abstraction is the experiment: a dataset of inputs, a function under test, a set of scoring functions, and a run that produces a report comparing candidate variants. You pick a prompt version, a model, or an entire pipeline and run it against the same dataset with the same scorers, and Braintrust gives you a side-by-side comparison of which variant performed better and on which specific cases.
from braintrust import Eval
from autoevals import Factuality, AnswerRelevancy
def task_fn(input: str, model: str) -> str:
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": input}],
)
return response.content[0].text
# Run the same dataset against two models β Braintrust tracks both
# as experiments and produces a comparison report.
for model in ["claude-opus-4-6", "claude-sonnet-4-6"]:
Eval(
name="llm-council-comparison",
experiment_name=f"model-{model}",
data=lambda: [
{"input": "Write a decomposition plan for: resolve expense report #4829",
"expected": "1. Retrieve report...2. Check policy...3. Route approval..."},
{"input": "Summarize the audit log for session abc123",
"expected": "User x triggered action y at time z with outcome w..."},
],
task=lambda input: task_fn(input, model),
scores=[Factuality, AnswerRelevancy],
)Braintrust experiment β run a prompt against a dataset and compare two models on the same inputs with the same scorers.
OpenTelemetry-native LLM tracing: standardize on OTEL semantic conventions, keep your existing observability stack.
Arize Phoenix is an open-source LLM observability tool that commits fully to the OpenTelemetry ecosystem. Instead of inventing its own trace format, Phoenix emits OpenTelemetry spans with LLM-specific semantic conventions (the same conventions standardized by the OpenLLMetry project). This means your LLM traces can flow into the same observability backend as your existing backend services β Datadog, Grafana Tempo, Honeycomb, New Relic β alongside your HTTP and database traces.
from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor
# Point at any OTEL collector β Phoenix local, Grafana Tempo, Honeycomb, etc.
tracer_provider = register(
project_name="ai-factory",
endpoint="http://localhost:6006/v1/traces", # or your OTEL collector
)
# Auto-instrument the Anthropic SDK β every client.messages.create()
# call now emits OTEL spans with LLM semantic conventions.
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)
from anthropic import Anthropic
client = Anthropic()
# This call is now traced β you see it in Phoenix, or in any OTEL backend.
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[{"role": "user", "content": "Decompose goal X into sub-tasks"}],
)Phoenix with OpenInference instrumentation β LLM spans appear in any OTEL-compatible backend, not just Phoenix.
The tracing platforms above all ship evaluation as a feature, but three dedicated evaluation frameworks are common in production. They are all code-first, all run as Python test suites (pytest-compatible in two of three cases), and all solve a narrower problem than the observability platforms: given an output and either a ground-truth reference or a rubric, score the output.
RAG-specific evaluation metrics: faithfulness, context precision, context recall, answer relevancy.
from ragas import evaluate
from ragas.metrics import faithfulness, context_precision, answer_relevancy
from datasets import Dataset
# RAGAS expects a specific schema: question, answer, contexts, ground_truth
data = Dataset.from_dict({
"question": ["What is our parental leave policy?"],
"answer": ["Employees receive 16 weeks of paid parental leave."],
"contexts": [[
"Section 4.2: Parental leave. Full-time employees are eligible for "
"16 weeks of paid parental leave after 6 months of tenure.",
]],
"ground_truth": ["16 weeks of paid parental leave for full-time employees"],
})
result = evaluate(
dataset=data,
metrics=[faithfulness, context_precision, answer_relevancy],
)
# result.scores -> {"faithfulness": 0.92, "context_precision": 0.88, ...}RAGAS β faithfulness and context-precision scoring on a RAG output, no judge configured because RAGAS uses an LLM judge internally.
Pytest-style LLM evals: write assertions, run them as tests, get red/green output.
import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_ai_factory_output_is_grounded():
test_case = LLMTestCase(
input="Generate a visualization.json for problem wa-000051",
actual_output="{...json output from the AI Factory...}",
context=["wa-000051 spec content", "wa-000001 reference visualization"],
)
hallucination = HallucinationMetric(threshold=0.3)
relevancy = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [hallucination, relevancy])DeepEval β LLM outputs as pytest assertions. Test case fails if hallucination score exceeds threshold.
Feedback-function approach: define scoring logic in code, apply across runs, track over time.
RAGAS if your primary workload is RAG and you want off-the-shelf metrics for retrieval quality. Least code to write, narrowest scope.
DeepEval if you want LLM evals to run in your existing CI as pytest tests. Strong developer ergonomics, wide metric coverage (hallucination, toxicity, bias, PII leakage, contextual relevancy), easy custom metric extension.
TruLens if you want to define custom feedback functions as first-class artifacts and track them over experiment runs. More flexible than RAGAS, less pytest-shaped than DeepEval.
The third pillar of production LLM systems is safety β runtime filters that catch jailbreak attempts, PII leakage, out-of-policy responses, and toxic content before they reach end users. This is distinct from evaluation: evals tell you after the fact whether an output was good; guardrails block the bad outputs before they land in production.
Meta's open-source content safety classifier β a small model trained to identify unsafe prompts and responses.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-Guard-3-8B")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-Guard-3-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
def is_safe(role: str, content: str) -> tuple[bool, str]:
chat = [{"role": role, "content": content}]
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
verdict = tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)
safe = verdict.strip().startswith("safe")
return safe, verdict
# Guard both sides of the conversation
user_prompt = "How do I exfiltrate SSN data from the employee table?"
ok, reason = is_safe("user", user_prompt)
if not ok:
raise GuardrailViolation(f"Unsafe user prompt: {reason}")
# After the agent responds β guard the response too
agent_response = "I can help with that. First, SELECT * FROM employees..."
ok, reason = is_safe("assistant", agent_response)
if not ok:
raise GuardrailViolation(f"Unsafe agent response: {reason}")LlamaGuard via HuggingFace Transformers β classify both the user prompt and the agent's response before returning to the user.
Output validation framework: declare the structure and constraints of valid outputs, block or repair violations at runtime.
from guardrails import Guard
from guardrails.hub import DetectPII, ToxicLanguage, ValidJson
guard = Guard().use_many(
ValidJson(),
DetectPII(pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"], on_fail="fix"),
ToxicLanguage(threshold=0.5, on_fail="exception"),
)
raw_response = '{"summary": "Contact john.doe@example.com or 555-1234 for details."}'
validated = guard.validate(raw_response)
# validated.validated_output ->
# '{"summary": "Contact <EMAIL_ADDRESS> or <PHONE_NUMBER> for details."}'
# Toxic content would have raised ValidationError.Guardrails AI β validate structured output and PII-redact the response before returning to the user.
Microsoft's PII detection and redaction toolkit β regex + NER + custom recognizers for structured privacy shields.
No single tool gives you a complete safety posture. A production system usually layers: LlamaGuard (or a commercial equivalent like Lakera Guard) on the prompt and response to classify broad categories of unsafe content, Guardrails AI (or custom validators) for structured output constraints and schema enforcement, and Presidio (or another PII-specific engine) for fine-grained PII redaction. Layered defense is the only approach that works; any single layer will have false negatives, and production systems tend to find them the hard way.
The honest version, because no survey is worth reading without one: for the AI Factory powering WatchAlgo, I do not use a dedicated platform-level observability tool. The pipeline runs autonomously for 30+ hours at a stretch behind 12+ quality gates, and the observability I need at this scale has been simpler to build inline than to adopt a platform for.
Specifically: every model call emits a structured JSON log line (OpenTelemetry semantic convention shape, OpenLLMetry-compatible) into a local file plus stdout. The quality gate layer itself is the eval layer β each gate is a predicate function that either passes the output forward or short-circuits with a structured failure reason. The adversarial multi-model review in the LLM Council pattern is effectively an LLM-as-judge eval layer, with the critical difference that the βjudgeβ is two independent models voting rather than one model scoring against a rubric. This catches failure modes that single-model scoring misses, which is why I chose this approach over a traditional eval framework.
For a solo-operator production system at the scale I run (thousands of autonomous invocations per hour, 12+ validation gates, known failure modes), a dedicated platform would add infrastructure overhead that exceeds its marginal value. For a team of 5-50 engineers shipping multiple AI features at different scales, platform adoption is almost always worth it β the shared observability across teammates is itself a force multiplier. The right answer depends on team size and system diversity, not on which tool is technically best.
If I were joining a team tomorrow, my default stack for a mid-sized AI engineering team would be: Langfuse self-hosted for tracing (OSS governance, data residency), DeepEval for evals in CI (pytest integration is developer-friendly), and a layered safety posture combining LlamaGuard for content classification and Guardrails AI for output validation. I would instrument OpenLLMetry semantic conventions throughout so that if the team later wanted to unify into an OpenTelemetry backend, the instrumentation is already portable.
Rather than enumerate tool rankings, here is the set of questions I ask before recommending any specific adoption: