Back to AI/ML Overview
Agent Frameworks & Harnesses

How I think about LangChain, CrewAI, AutoGen β€” and why I built my own harness

Every AI job description lists β€œexperience with LangChain, CrewAI, AutoGen” as if they're interchangeable. They're not. This page walks through the major agent frameworks, shows you what each one actually does, tells you honestly when to reach for one, and ends with the case for skipping the framework entirely and writing a 200-line harness over the raw SDK β€” which is what I did for WatchAlgo, and what let it run autonomously for 30+ hours across 1,600+ AI-authored solutions.

8 frameworks comparedReal code for eachHonest pros / consOpinionated take

πŸ—ΊοΈThe landscape β€” eight ways to build an agent

The agent framework space exploded between 2023 and 2025. Every few months a new library appeared claiming to solve agent orchestration once and for all, which is how we ended up with LangChain, LangGraph, LlamaIndex, CrewAI, AutoGen, Swarm, Pydantic AI, Mastra, Smolagents, Haystack, Semantic Kernel, and at least a dozen others. The honest truth is that most of them solve the same problem with different amounts of abstraction, and the difference between them matters less than the fundamentals they all sit on top of.

Below is a map of the eight frameworks I'll walk through on this page, plotted on two axes that matter more than anything a marketing page will tell you: how much machinery the framework forces on you (lightweight to heavyweight) and whether the native unit of work is a single agent or a team (single-agent to multi-agent).

The Agent Framework Landscape β€” 8 frameworks plotted by weight and multi-agent support
LIGHTWEIGHT Γ— MULTI-AGENTHEAVYWEIGHT Γ— MULTI-AGENTLIGHTWEIGHT Γ— SINGLE-AGENTHEAVYWEIGHT Γ— SINGLE-AGENT← LightweightFRAMEWORK WEIGHTHeavyweight β†’Multi-agent ↑AGENT COUNT↓ Single-agentβš™οΈRaw Anthropic SDKthe primitiveπŸ”’Pydantic AItyped, newer🐝Swarmminimal handoffπŸ› οΈWatchAlgo Harness~200 lines, customπŸ‘₯CrewAIrole-based teams🀝AutoGenconversationalπŸ¦™LlamaIndexRAG-focused🦜LangGraphstate-machineNo axis is "good" β€” the question is what your project needs. Lightweight + custom harness is my default for deterministic pipelines.
πŸ”‘Read this map before you read the rest of the page
The axes aren't β€œgood vs bad.” Lightweight isn't better than heavyweight, and single-agent isn't better than multi-agent. The question is what your project needs. A 200-line harness over the raw SDK is perfect for deterministic pipelines where you want full control (WatchAlgo). A heavyweight framework like LangGraph is worth the learning curve when you need durable state, graph branching, and multi-user session management. There is no single right answer, and any candidate who tells you there is has only used one framework.

βš™οΈThe primitive β€” what every framework is wrapping

Before we look at any framework, you need to see what they're all abstracting. An agent, at the most fundamental level, is a while loop around an LLM call with tool use. That's the whole mental model. Every framework on this page is a different way to dress up this loop β€” sometimes adding real value, sometimes adding ceremony.

python
from anthropic import Anthropic

client = Anthropic()

tools = [{
    "name": "search_kb",
    "description": "Search the HR knowledge base",
    "input_schema": {
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"],
    },
}]

def run_tool(name: str, args: dict) -> str:
    if name == "search_kb":
        return f"Policy document for: {args['query']}"
    raise ValueError(f"Unknown tool: {name}")

messages = [{"role": "user", "content": "What's our parental leave policy?"}]

# The agent loop β€” every framework is wrapping this
while True:
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4096,
        tools=tools,
        messages=messages,
    )

    if response.stop_reason == "end_turn":
        print(response.content[0].text)
        break

    if response.stop_reason == "tool_use":
        tool_use = next(b for b in response.content if b.type == "tool_use")
        result = run_tool(tool_use.name, tool_use.input)
        messages.append({"role": "assistant", "content": response.content})
        messages.append({
            "role": "user",
            "content": [{
                "type": "tool_result",
                "tool_use_id": tool_use.id,
                "content": result,
            }],
        })
↕ Scroll

The Anthropic SDK agent loop in 30 lines β€” this is what LangChain, CrewAI, and the rest all wrap.

πŸ’‘Why this framing matters
Most explanations of β€œhow an agent works” start with a framework β€” that's backwards. The framework is an abstraction; the primitive is a while loop with LLM call + tool dispatch + message history append + termination check. Understand the primitive first, and every framework on this page becomes a straightforward variation on the same theme. Explain it that way to anyone who asks, and the conversation immediately skips past the usual framework-religion arguments into the actual engineering.

🦜1. LangChain / LangGraph β€” the heavyweight incumbent

🦜

LangChain + LangGraph

The framework everyone either loves or has strong opinions about not using.

Heavyweight orchestrationMost mature, most criticisedPython + TypeScript

LangChain is the framework that started the wave. It's essentially a massive toolkit of wrappers β€” around LLMs, vector stores, document loaders, memory abstractions, prompt templates, and output parsers β€” plus an orchestration layer. The orchestration layer has evolved twice: first with LCEL (the pipe-operator expression language), and then with LangGraph, which treats agents as explicit state machines with nodes and edges. LangGraph is the modern pattern and the one worth studying β€” it's what the company itself now positions as the serious production story.

python
from typing import TypedDict, Annotated
import operator
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langchain_core.messages import BaseMessage

class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], operator.add]

@tool
def search_kb(query: str) -> str:
    """Search the HR knowledge base."""
    return f"Policy document for: {query}"

llm = ChatAnthropic(model="claude-sonnet-4-5")
llm_with_tools = llm.bind_tools([search_kb])

def agent_node(state: AgentState):
    return {"messages": [llm_with_tools.invoke(state["messages"])]}

def should_continue(state: AgentState):
    return "tools" if state["messages"][-1].tool_calls else END

graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode([search_kb]))
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
graph.add_edge("tools", "agent")

app = graph.compile()
result = app.invoke({"messages": [("user", "What's our parental leave policy?")]})
↕ Scroll

LangGraph β€” the modern state-machine pattern. The agent loop is now a graph with an 'agent' node, a 'tools' node, and a conditional edge.

Strengths
  • βœ“Massive ecosystem β€” integrations for every vector DB, LLM provider, document loader, and tool imaginable. If you need to prototype fast, LangChain has a wrapper for it.
  • βœ“LangGraph is genuinely powerful for stateful, branching workflows β€” think human-in-the-loop approval gates, checkpoint/resume, and multi-user session management.
  • βœ“Built-in observability via LangSmith β€” trace every LLM call, every tool call, token usage, latency. Hard to replicate this cheaply with a custom harness.
  • βœ“Largest hiring pool β€” if you need to staff a team, LangChain engineers are the easiest to find.
Trade-offs
  • ⚠Abstraction leak β€” debugging a failing LangChain pipeline often means reading the library’s source code to understand what's actually being sent to the LLM. The abstraction saves time on day one and costs time on day thirty.
  • ⚠Rapid API churn β€” LangChain has rewritten its core abstractions at least three times (LCEL, RunnableSequence, now LangGraph). Code written 18 months ago rarely works unchanged today.
  • ⚠Heavyweight dependency footprint β€” pulls in dozens of transitive dependencies. Startup time is noticeably slower than raw SDK code.
  • ⚠Over-engineered for simple pipelines β€” a 50-line use case can become a 200-line graph with nodes and edges that adds ceremony without value.
πŸ’‘When to pick LangGraph

Pick it when you have durable state requirements (checkpointing, resume after crash), or complex branching with conditional edges, or you need LangSmith observability and don't want to build your own tracing. Skip it for deterministic pipelines that don't need those features.

πŸ’¬My honest take

LangGraph is a genuinely good piece of software β€” the graph abstraction fits certain problems perfectly. But it's overkill for 80% of the agent use cases I've seen, and the API churn has burned enough teams that I understand why people are suspicious. I keep it in my toolbox for complex stateful workflows and reach for raw SDK + a custom harness everywhere else.

πŸ¦™2. LlamaIndex β€” the data-first framework

πŸ¦™

LlamaIndex

Started as a RAG toolkit, grew into a full agent framework β€” still most opinionated about the data side.

Data-first RAGMaturePython + TypeScript

LlamaIndex started as GPT-Index β€” a way to let LLMs query your documents β€” and evolved into a general agent framework with a strong focus on everything data-adjacent: ingestion, indexing, chunking, retrieval, query engines. If LangChain is the Swiss Army knife of LLM integrations, LlamaIndex is the specialty knife for RAG-heavy systems. Its agent layer (ReAct agents, OpenAIAgent, FunctionCallingAgent) is capable but the framework's real gravity is in the data pipeline.

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.anthropic import Anthropic

# Ingestion and indexing β€” the LlamaIndex native strength
documents = SimpleDirectoryReader("./hr_policies").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)

# Wrap the query engine as a tool the agent can call
hr_kb = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="hr_knowledge_base",
        description="Search HR policies, benefits, and compliance documents",
    ),
)

# The ReAct agent β€” reason-then-act loop with the KB tool
agent = ReActAgent.from_tools(
    tools=[hr_kb],
    llm=Anthropic(model="claude-sonnet-4-5"),
    verbose=True,
)

response = agent.chat("What's our parental leave policy for California employees?")
print(response)
↕ Scroll

LlamaIndex ReAct agent backed by a vector index β€” the data pipeline is the primary abstraction.

Strengths
  • βœ“Best-in-class data ingestion β€” loaders for 100+ document types, chunking strategies, embedding pipeline, incremental indexing. If your problem is 90% data wrangling, LlamaIndex saves the most time.
  • βœ“Strong opinions on retrieval β€” hybrid search, reranking, sentence-window retrieval, auto-merging retrieval. The right defaults are often already picked.
  • βœ“Query engine abstraction is powerful β€” you can compose query engines into tool-calling agents without re-writing the retrieval code.
  • βœ“Less API churn than LangChain β€” the core data abstractions have been relatively stable.
Trade-offs
  • ⚠Agent layer feels bolted on β€” it works but the framework's heart is clearly the data pipeline, not agent orchestration.
  • ⚠Heavy imports β€” similar dependency footprint to LangChain, slow startup time.
  • ⚠Less community momentum on the agent side β€” when people say 'we use LangGraph for agents,' they often mean they picked LangGraph specifically because LlamaIndex agents felt secondary.
  • ⚠Over-abstraction on the data side can make debugging retrieval tricky β€” you have to know which layer to poke at.
πŸ’‘When to pick LlamaIndex

Pick it when your project is primarily a RAG system and the data pipeline is the hard part β€” many source types, complex chunking requirements, hybrid search, incremental indexing. Skip it when the data side is simple and the hard problem is agent orchestration.

πŸ’¬My honest take

LlamaIndex is the framework I'd reach for if I were building an enterprise RAG system from zero and didn't want to hand-write the ingestion pipeline. Its defaults for chunking and retrieval are sensible. But for agent orchestration specifically, I'd still use something else β€” or raw SDK.

πŸ‘₯3. CrewAI β€” role-based multi-agent teams

πŸ‘₯

CrewAI

Pretend your agents are a team of humans with roles and backstories β€” and let the framework orchestrate them.

Role-based multi-agentRapidly growingPython

CrewAI took off in 2024 because its abstraction is emotionally compelling: you define agents with roles (β€œResearch Analyst”), goals (β€œFind information about X”), and backstories (β€œYou are a senior analyst with 15 years of experience”), then you define tasks, then you assemble a crew that runs the tasks sequentially or in parallel. Under the hood it's just the primitive loop with multi-agent message passing β€” but the mental model of β€œhiring a team” is sticky enough that engineers and product people can reason about it together.

python
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool

search_tool = SerperDevTool()

researcher = Agent(
    role="HR Policy Researcher",
    goal="Find accurate, up-to-date information about {topic}",
    backstory=(
        "You are a senior HR analyst with 15 years of experience. "
        "You always cite specific policy documents and verify effective dates."
    ),
    tools=[search_tool],
    verbose=True,
)

writer = Agent(
    role="Policy Explainer",
    goal="Turn research briefs into clear, citable answers for employees",
    backstory="You write HR responses that are clear to non-experts.",
    verbose=True,
)

research_task = Task(
    description="Research {topic} and gather 3-5 authoritative sources",
    expected_output="A detailed research brief with inline citations",
    agent=researcher,
)

writing_task = Task(
    description="Write a clear 2-paragraph answer based on the research brief",
    expected_output="Final response with citations",
    agent=writer,
    context=[research_task],  # writer sees the researcher's output
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, writing_task],
    process=Process.sequential,
)

result = crew.kickoff(inputs={"topic": "parental leave in California"})
↕ Scroll

CrewAI β€” two agents, two tasks, sequential execution. The role/backstory pattern is the distinguishing feature.

Strengths
  • βœ“Ergonomic mental model β€” non-engineers can read a CrewAI script and understand what's happening. 'Here's the researcher, here's the writer, here's their tasks.'
  • βœ“Fastest way to prototype multi-agent workflows β€” minimal boilerplate, sensible defaults, agents just work.
  • βœ“Great for delegated workflows β€” research β†’ analyse β†’ write, or triage β†’ investigate β†’ respond.
  • βœ“Active community, lots of examples, growing ecosystem of crewai_tools.
Trade-offs
  • ⚠The role/backstory abstraction hides the prompt engineering from you β€” great until you need to debug why an agent is ignoring instructions, at which point you need to dig into how CrewAI actually assembles prompts.
  • ⚠Less expressive for non-linear flows β€” sequential and hierarchical are easy, but complex branching or human-in-the-loop feels bolted on.
  • ⚠Framework-authored instructions can bloat your prompts β€” CrewAI adds its own system prompt scaffolding on top of yours, which you don't control directly.
  • ⚠Still young β€” API is stabilising but expect breaking changes.
πŸ’‘When to pick CrewAI

Pick it when you need to ship a multi-agent prototype fast, when the stakeholders reviewing the code include non-engineers, or when the workflow maps cleanly onto β€œhumans with roles delegating to each other.” Skip it when you need tight control over the prompt, or when the flow is a single-agent loop.

πŸ’¬My honest take

CrewAI is the framework I'd recommend to someone building their first multi-agent system, specifically because the mental model is intuitive. I've seen teams go from β€œwe should try multi-agent” to a working prototype in a day with CrewAI, and that kind of velocity matters. The escape hatch is easy: once you outgrow it, you can translate to a custom harness without rewriting your prompts.

🀝4. AutoGen β€” conversational multi-agent

🀝

AutoGen (Microsoft / ag2)

Agents talk to each other in turn-based conversations β€” with optional human-in-the-loop.

Conversational multi-agentMature, forkingPython

AutoGen is Microsoft Research's bet on multi-agent systems, built around the idea that agents should have conversations with each other β€” literally. You instantiate `ConversableAgent` objects and call `initiate_chat` to kick off a turn-based exchange between them. It's a different mental model from CrewAI's β€œroles and tasks”: AutoGen thinks in terms of β€œwho is in the conversation and who talks next.” (Note: in 2024 the project forked as β€œag2” alongside Microsoft's original AutoGen β€” the community is still sorting out which branch to follow.)

python
from autogen import ConversableAgent, UserProxyAgent

config_list = [{"model": "claude-sonnet-4-5", "api_type": "anthropic"}]

analyst = ConversableAgent(
    name="analyst",
    system_message=(
        "You are an HR analyst. When asked questions, ask the "
        "researcher to look up specifics before answering."
    ),
    llm_config={"config_list": config_list},
)

researcher = ConversableAgent(
    name="researcher",
    system_message="You search policy documents and return findings with citations.",
    llm_config={"config_list": config_list},
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
)

# Kick off a conversation β€” agents now talk to each other in turns
user_proxy.initiate_chat(
    analyst,
    message="What's our parental leave policy in California?",
    max_turns=5,
)
↕ Scroll

AutoGen β€” two agents having a conversation with a user proxy as the orchestrator.

Strengths
  • βœ“Conversation mental model is powerful for scenarios where agents genuinely need to negotiate or clarify β€” e.g., a coder and a reviewer going back and forth on a change.
  • βœ“First-class human-in-the-loop β€” UserProxyAgent can ask a real human for input mid-flow, which is non-trivial to build from scratch.
  • βœ“Group chat patterns β€” more than two agents in the same conversation with an orchestrator that picks who speaks next.
  • βœ“Research credentials β€” comes out of Microsoft Research with solid papers backing the design.
Trade-offs
  • ⚠Conversational framing can be wasteful β€” many agent flows aren't actually conversations, and forcing them into one adds token cost and latency.
  • ⚠The ag2 fork has fractured momentum β€” you now have to pick which branch to use, and the two will drift.
  • ⚠API is less ergonomic than CrewAI β€” more ceremony for simple cases.
  • ⚠Configuration sprawl β€” llm_config, config_list, code_execution_config, human_input_mode β€” lots of dials to turn before you get to work.
πŸ’‘When to pick AutoGen

Pick it when your workflow is genuinely a back-and-forth conversation β€” code review, negotiation, iterative refinement with human checkpoints. Skip it when the flow is a one-shot pipeline.

πŸ’¬My honest take

AutoGen solves a narrower problem than it thinks it does. The conversation metaphor fits some use cases beautifully and others awkwardly. The fork between Microsoft AutoGen and ag2 is a real concern for anyone adopting it today β€” I'd wait for the dust to settle before starting a new production project with it.

🐝5. OpenAI Swarm β€” minimal handoff pattern

🐝

OpenAI Swarm

OpenAI's experimental bet on 'keep it minimal and let agents hand off to each other.'

Lightweight routingExperimentalPython

Swarm is OpenAI's minimalist counterpoint to everyone else β€” released in late 2024 as an intentionally tiny library (under 500 lines) to demonstrate one pattern: agents handing off to other agents by returning them from functions. An agent has instructions and a list of functions; if a function returns another Agent, control transfers to that one. That's the whole abstraction. It's marked as experimental and educational, not production-ready, but the pattern it teaches is genuinely useful.

python
from swarm import Swarm, Agent

# Define the experts first
policy_expert = Agent(
    name="Policy Expert",
    instructions=(
        "You answer questions about HR policies with citations. "
        "If asked about benefits, hand off to the benefits expert."
    ),
)

benefits_expert = Agent(
    name="Benefits Expert",
    instructions="You answer questions about benefits and eligibility.",
)

# Handoff functions β€” returning an Agent transfers control
def transfer_to_policy_expert():
    return policy_expert

def transfer_to_benefits_expert():
    return benefits_expert

# The triage agent is the entry point
triage_agent = Agent(
    name="Triage",
    instructions=(
        "You route HR queries. For policies, hand off to the policy expert. "
        "For benefits, hand off to the benefits expert."
    ),
    functions=[transfer_to_policy_expert, transfer_to_benefits_expert],
)

client = Swarm()
response = client.run(
    agent=triage_agent,
    messages=[{"role": "user", "content": "What's my parental leave policy?"}],
)
print(response.messages[-1]["content"])
↕ Scroll

Swarm β€” the whole framework is one idea: functions can return agents, and that triggers a handoff.

Strengths
  • βœ“Tiny surface area β€” you can read the entire framework in 20 minutes and understand exactly what it does.
  • βœ“The handoff pattern is legitimately useful and works with any LLM provider, not just OpenAI.
  • βœ“Educational value β€” studying Swarm's source is one of the best ways to understand what any agent framework is actually doing.
  • βœ“No dependencies to speak of β€” minimal footprint if you want a lightweight multi-agent starter.
Trade-offs
  • ⚠Officially marked experimental β€” OpenAI explicitly says not to use it in production.
  • ⚠OpenAI-branded but the recommended production story is now the Responses API + Assistants, not Swarm.
  • ⚠No durable state, no checkpointing, no observability β€” it's just the handoff pattern.
  • ⚠Community has largely forked it (OpenAI Agents SDK is the newer official library); Swarm itself is more of a teaching tool now.
πŸ’‘When to pick Swarm

Pick it to learn β€” it's the fastest way to see what a minimal agent framework looks like in code. For production, skip it in favor of a custom harness that uses the same handoff pattern, or use OpenAI's newer Agents SDK directly.

πŸ’¬My honest take

Swarm isn't really a framework β€” it's a demonstration. I recommend reading the source code cover to cover because it clarifies what every other agent framework is doing under the hood. But I wouldn't deploy it.

πŸ”’6. Pydantic AI β€” typed, production-first, newer

πŸ”’

Pydantic AI

Type-safe agents with structured outputs as first-class citizens β€” built by the Pydantic team, released 2024.

Typed productionNewer but mature authorsPython

Pydantic AI is the newest entry on this list and the one I'm most personally interested in. It comes from the Pydantic team β€” the same people who authored the data validation library that half of Python now depends on β€” and the framework's distinguishing feature is type-safe agent outputs. You declare a Pydantic model for what the agent should return, and the framework enforces that shape. It feels like what FastAPI did for web handlers, applied to agents.

python
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext

class HRAnswer(BaseModel):
    answer: str
    policy_citations: list[str]
    effective_date: str
    confidence: float

class HRContext(BaseModel):
    employee_id: str
    work_state: str

agent = Agent(
    "anthropic:claude-sonnet-4-5",
    result_type=HRAnswer,
    deps_type=HRContext,
    system_prompt=(
        "You are an HR assistant. Answer questions with citations. "
        "Always flag low-confidence answers for human review."
    ),
)

@agent.tool
async def search_policies(ctx: RunContext[HRContext], query: str) -> str:
    return f"Searched for '{query}' (employee in {ctx.deps.work_state})"

result = agent.run_sync(
    "What's my parental leave?",
    deps=HRContext(employee_id="u-8821", work_state="CA"),
)

# result.data is typed as HRAnswer β€” no JSON parsing needed
print(f"Confidence: {result.data.confidence}")
print(f"Citations: {result.data.policy_citations}")
↕ Scroll

Pydantic AI β€” structured output is a Pydantic model. No JSON parsing, no string-to-object ceremony.

Strengths
  • βœ“Type safety all the way through β€” if your result_type is HRAnswer, that's what you get. No JSON.loads, no try/except on parse errors.
  • βœ“Dependency injection via RunContext β€” clean way to pass request-scoped data (user, session) into tools without globals.
  • βœ“Great observability via structured validation errors β€” you can immediately see when the LLM returned something that violated the schema.
  • βœ“From the Pydantic team β€” the people most likely to get the API ergonomics right, and unlikely to thrash the API every six months.
Trade-offs
  • ⚠Newer β€” smaller community, fewer examples, less Stack Overflow coverage when you hit edge cases.
  • ⚠Python-only β€” if your stack is TypeScript-native, Pydantic AI isn't an option.
  • ⚠Multi-agent story is less developed than CrewAI or AutoGen β€” it's primarily a great single-agent framework right now.
  • ⚠Pydantic dependency can be heavy if you're not already using it elsewhere.
πŸ’‘When to pick Pydantic AI

Pick it when your downstream code cares about the exact shape of the agent's output β€” typed fields, validated values, enumerated options β€” and you'd otherwise be writing manual JSON parsing and error handling. Skip it for free-form text responses or TypeScript stacks.

πŸ’¬My honest take

Pydantic AI is the framework I'd reach for on a new Python project today if I wanted the convenience of a framework and typed outputs mattered. It's opinionated in the ways FastAPI is opinionated β€” the defaults are sensible and the API feels cohesive. The single-agent focus fits most production needs better than the multi-agent frameworks admit.

πŸ› οΈ7. My hand-built harness β€” the WatchAlgo AI Factory

πŸ› οΈ

WatchAlgo AI Factory (custom harness)

200 lines of Python over raw anthropic SDK β€” the framework I reach for most often.

Custom harnessShipped, 1,600+ solutionsPython

When I built the AI Factory for WatchAlgo β€” the pipeline that generated visualizations for 1,600+ algorithm problems, running autonomously for 30+ hours at a stretch β€” I evaluated every framework on this page and chose none of them. Instead, I wrote a ~200-line harness over the raw Anthropic SDK with three components: a single-agent loop (while loop + tool dispatch + validation), a multi-agent orchestrator (ThreadPoolExecutor running N parallel single-agents), and a tool SDK with dual registration (one source of truth for Anthropic API schema + Python implementation).

python
from anthropic import Anthropic
from concurrent.futures import ThreadPoolExecutor
from tools import TOOL_DEFINITIONS, TOOL_IMPLEMENTATIONS

MAX_TURNS = 20

def run_agent(problem_id: str, model: str = "claude-sonnet-4-5") -> dict:
    """Single-agent loop β€” generates one visualization.json end-to-end."""
    client = Anthropic()
    messages = [{
        "role": "user",
        "content": f"Generate visualization.json for problem {problem_id}",
    }]

    for turn in range(MAX_TURNS):
        response = client.messages.create(
            model=model,
            max_tokens=4096,
            tools=TOOL_DEFINITIONS,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            return {"status": "success", "problem_id": problem_id}

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = TOOL_IMPLEMENTATIONS[block.name](**block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result),
                    })
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

    return {"status": "max_turns_exceeded", "problem_id": problem_id}


def run_batch(problem_ids: list[str], workers: int = 5) -> list[dict]:
    """Multi-agent orchestrator β€” N parallel workers via ThreadPoolExecutor.

    This is all it takes to go from 'one agent' to 'team of N agents running
    the same task in parallel.' No framework required.
    """
    with ThreadPoolExecutor(max_workers=workers) as executor:
        return list(executor.map(run_agent, problem_ids))
↕ Scroll

The core WatchAlgo agent β€” the agent loop and the orchestrator, both shipped, both running in production.

Strengths
  • βœ“Zero abstraction between you and the LLM β€” when something goes wrong, the stack trace points directly at your code, not a framework's internals.
  • βœ“No API churn β€” you own the API. The only dependency is the official anthropic SDK, which has a stable contract.
  • βœ“Trivial to customise β€” want to add retry logic, or log every tool call to Postgres, or swap models mid-flow? Edit 5 lines. In LangChain you'd be writing a custom callback handler.
  • βœ“Composes with ThreadPoolExecutor for multi-agent β€” 6 lines of standard-library code gets you parallel execution. No need for a multi-agent framework.
Trade-offs
  • ⚠You write everything yourself β€” observability, retries, cost tracking, structured output parsing. Frameworks give you these 'for free' even if they also give you other problems.
  • ⚠No pre-built integrations β€” if you need a Weaviate adapter or a specific document loader, you're writing it.
  • ⚠Less resume-friendly than 'I used LangChain' β€” the hiring pool rewards name recognition, fairly or not.
  • ⚠Multi-agent patterns beyond simple parallelism require you to design them yourself (e.g., handoff, delegation, voting).
πŸ’‘When to pick a custom harness

Pick it when you know exactly what you want the agent to do, the workflow is deterministic or near-deterministic, and the cost of a framework's API churn outweighs the value of its abstractions. This is more workflows than the industry admits.

πŸ’¬My honest take β€” the case for rolling your own

After a year of shipping agents in production, my strongest-held opinion is this: most agent workflows do not need a framework. They need a while loop, a tool dispatcher, and discipline. The frameworks exist because engineers reach for abstraction when they're uncertain about the problem, and abstraction feels like progress. But every framework adds weight you'll pay for when debugging, and takes control away from you in exchange for velocity you only needed on day one.

When someone asks β€œwhy didn't you use LangChain,” the honest answer is: I evaluated it, I understand what it gives me, I understand what it takes away, and for this workload the trade was wrong. That's a much stronger position than β€œI used LangChain because everyone does.”

🎯Decision framework β€” which one should I pick?

The honest answer depends on your project, but here's a decision table that captures the heuristics I actually use. Read the scenario in the left column, see what I'd reach for in the middle, and see the reasoning in the right.

Scenario
Pick
Why
Deterministic batch pipeline (e.g., generate N artifacts from N inputs with validation)
Custom harness
You know exactly what each agent should do. Framework ceremony adds no value, API churn adds risk. This is what I used for WatchAlgo.
Durable stateful workflows with checkpoint/resume and branching
LangGraph
This is the one use case where LangGraph's graph abstraction earns its weight. Durable state is hard to build yourself.
RAG-heavy system where data ingestion is the hard part
LlamaIndex
Its data pipeline primitives save weeks of work. Wrap it in a custom agent if you don't love the ReAct agent.
Multi-agent prototype that non-engineers need to understand
CrewAI
The role/backstory abstraction is legible to stakeholders. Easiest framework to demo.
Iterative conversation between agents with human checkpoints
AutoGen (or ag2)
The conversation metaphor and human-in-the-loop primitives are AutoGen's strongest fit.
Single-agent production workload with strictly-typed outputs
Pydantic AI
Structured output is the hardest thing to get right by hand. Pydantic AI makes it a type declaration.
You want to learn what a framework is actually doing under the hood
OpenAI Swarm
Tiny, readable source. One weekend of reading it teaches you more than a month of using LangChain.
You need to understand a framework quickly to evaluate it for a new project
LangGraph hello-world + this page
Read this page, write a LangGraph hello-world, understand the primitive. That gives you more signal than a week of reading LangChain documentation.

🧭The meta-take β€” what the ecosystem gets right and wrong

After a year of shipping agents and evaluating every major framework in this space, here's the pattern I see: the agent framework ecosystem is simultaneously the most important and the most over-built part of the LLM-application stack.

πŸ”‘What the frameworks get right

Every framework on this page is genuinely trying to solve a real problem: how do you turn an LLM call + tools into a reliable, observable, composable, testable component of a larger system? That's a hard question and the answers are genuinely non-trivial. LangGraph's state-machine pattern, Pydantic AI's type safety, CrewAI's role-based abstraction, LlamaIndex's data pipeline primitives β€” each of these is a real insight worth knowing about.

⚠️What they get wrong

Most projects using these frameworks would be simpler, faster to debug, and more maintainable if they used the raw SDK instead. The frameworks promise β€œwrite less code” and usually deliver on day one, then charge you back on day thirty when you have to debug why your prompts are being silently modified, or why the framework's retry logic is interfering with your rate limiter, or why upgrading a minor version broke the import path. The ecosystem's biggest failure is convincing engineers that β€œI used LangChain” is the same thing as β€œI understand agents,” when it's often the opposite.

πŸ’¬The answer I actually give

When someone asks which agent framework I use, my answer is: β€œI've evaluated LangChain, CrewAI, LlamaIndex, AutoGen, and Pydantic AI, and for the workloads I ship I typically write a thin harness over the raw SDK. Here's what each framework is good at, and here's why I picked differently.” That's the posture of an engineer who has evaluated the landscape rather than one who picked whatever Google suggested. Every framework on this page has a legitimate use case β€” the skill is knowing which one.