🎯Overview 🌌Cosmic Managed AI 📚Foundations 🏛️Model Committee 🧭RAG Anatomy 🗂️Enterprise RAG 🛠️Agent Frameworks 🏗️Platform Anatomy 🔭Observability & Evals 🧪Local LLM Field Notes 💻Polyglot 🏭AI Factory ⌨️CosmicKeys 📊WatchAlgo ⚖️AI Underwriting

🏭 Production Agentic System

The AI FactoryMulti-Agent Orchestration at Production Scale

A three-layer agentic framework I built from scratch that autonomously generates, validates, and self-corrects structured content — running for 30+ hours without human intervention, across provider quota cycles, with per-output cost tracking baked in.

🎧Audio Edition

14 min listen

Sam Muthu's Custom Agentic AI Factory

Prefer to listen? A two-host conversation walking through this page as a deep dive — the three-layer architecture, Report Cards as automated quality gates, the model-routing economics, the trade-offs Sam made (custom framework vs LangChain, threading vs async, files vs database), and the discipline of stopping at 1,600 outputs. Same story as this page, in a different medium.

Download for offline listening•Same story as this page, as a conversation

1,600+

AI-authored solutions

30+ hrs

autonomous operation

12+

quality criteria

3 langs

per output

3 flavors

content variants

$0.03

per generation

🎯The Problem

I needed to generate structured educational content for 3,247 algorithm problems across three programming languages and three content flavors to match different learning styles. Doing this manually would require a team of 10+ content engineers working for a year.

The naive approach — prompt an LLM with “generate content for problem X” and accept the output — fails at production scale. Outputs are inconsistent, schema violations creep in, edge cases break downstream renderers, and cost runs unbounded. I've seen teams burn tens of thousands of dollars on “AI content pipelines” that produce unusable output because they treated the LLM as a magic box instead of architecting around its failure modes.

🔑The goal

Prove that a single developer with the right agentic architecture can produce what traditionally requires an engineering team — with deterministic quality gates, cost discipline, and the ability to run autonomously for days.

🏗️Three-Layer Architecture

The AI Factory separates concerns across three distinct layers. Each layer has a single responsibility, making the system debuggable, extensible, and easy to reason about under failure.

System Overview

🔁

Layer 1: Single-Agent Loop

The core observe → decide → act loop. One LLM, one conversation state, tool-use via the Anthropic SDK. This is the atomic unit of work.

🧭

Layer 2: Multi-Agent Orchestrator

A ThreadPoolExecutor dispatches N parallel workers, each running an independent single-agent loop with isolated state. Handles model routing and retry logic.

🔌

Layer 3: Modular Tool SDK

Each tool is dual-registered: an API schema for LLM consumption and a runtime implementation for execution. Adding a new tool requires zero changes to the orchestration layer.

🔁Layer 1: The Single-Agent Loop

Every agent system in the world is a variation on this pattern:

python

def run_agent(problem_code: str, *, model: str, dry_run: bool, verbose: bool) -> dict:
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": initial_prompt(problem_code)}]
    turns = 0

    while turns < MAX_TURNS:
        turns += 1

        # 1. LLM decides what to do
        response = client.messages.create(
            model=model,
            system=SYSTEM_PROMPT,
            messages=messages,
            tools=TOOL_DEFINITIONS,
            max_tokens=16384,
        )

        # 2. Agent signals completion
        if response.stop_reason == "end_turn":
            return {"success": True, "problem": problem_code, "turns": turns}

        # 3. Execute tool calls
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = TOOL_IMPLEMENTATIONS[block.name](block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result),
                    })

            # 4. Feed results back and loop
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

    return {"success": False, "problem": problem_code, "turns": turns}

↕ Scroll

agent.py — the core agentic loop

🔑Why this structure matters

Framework abstractions (LangChain, CrewAI, AutoGen) hide this loop behind class hierarchies and YAML configs. That's fine for prototyping, but it turns production debugging into an archaeology expedition. Keeping the loop explicit means: I can read the code and know exactly what the agent is doing, what it saw last, and why it made the decision it made. For enterprise SLAs, that transparency is worth more than any convenience feature.

Four Invariants I Enforce in Every Agent Loop

🛑

MAX_TURNS Limit

Hard cap on loop iterations (20 by default). Prevents runaway agents that get stuck in validation-repair spirals.

🔒

Sandboxed Tool Execution

File I/O tools enforce path containment. The agent literally cannot escape its designated directory.

📊

Token and Cost Tracking

Every call logs input tokens, output tokens, and dollar cost. Unit economics by default, not an afterthought.

🧪

Dry-Run Mode

The --dry-run flag short-circuits destructive tool calls. I can validate the full loop without side effects.

🧭Layer 2: The Multi-Agent Orchestrator

One agent is a toy. Production throughput comes from running many in parallel, with shared failure handling and intelligent model routing. The orchestrator is where that happens.

python

def run_orchestrator(*, start, limit, workers, model, dry_run, verbose):
    problems = json.loads(list_problems(start=start, limit=limit))
    if not problems:
        return {"total": 0, "success": 0, "failed": 0}

    results = {"total": len(problems), "success": 0, "failed": 0,
               "succeeded": [], "failed_list": []}

    with ThreadPoolExecutor(max_workers=workers) as executor:
        futures = {
            executor.submit(run_agent, problem,
                            model=model, dry_run=dry_run, verbose=verbose): problem
            for problem in problems
        }

        for future in as_completed(futures):
            problem = futures[future]
            try:
                result = future.result()
                if result["success"]:
                    results["success"] += 1
                    results["succeeded"].append(problem)
                else:
                    results["failed"] += 1
                    results["failed_list"].append({
                        "problem": problem,
                        "error": result["message"]
                    })
            except Exception as e:
                results["failed"] += 1
                results["failed_list"].append({"problem": problem, "error": str(e)})

    return results

↕ Scroll

orchestrator.py — parallel worker dispatch

Model Routing: Where Economics Becomes Architecture

Not every task deserves the most expensive model. I benchmarked each capability against cost and made routing decisions per workload type:

Model Routing Decision Matrix

🔑The optimal blend

Sonnet at $0.03/generation with 60% success + auto-retry became my default. The 40% failure rate sounds bad until you do the math: even with two retries, blended cost stayed under $0.10/output with ~95% final success. Opus would have cost $0.21/generation up front. On 1,600 outputs, that's a $192 savings — not enormous in absolute terms, but it's the discipline that matters. Every production AI system needs this kind of routing from day one, not as an optimization later.

🔌Layer 3: The Modular Tool SDK

Tools are how LLMs interact with the world. I use a dual-registration pattern: each tool is defined once as an API schema (for the LLM) and once as a runtime implementation (for execution). This separation keeps the LLM's view of the tool decoupled from how it's actually wired up.

python

# Runtime implementations
def read_file(path: str) -> str:
    """Read a file and return its contents."""
    full = (CONTENT_DIR / path).resolve()
    # Enforce sandbox: path must stay within CONTENT_DIR
    if not str(full).startswith(str(CONTENT_DIR.resolve())):
        return f"ERROR: Access denied — path must be under {CONTENT_DIR}"
    try:
        return full.read_text(encoding='utf-8')[:50000]  # Cap at 50K chars
    except FileNotFoundError:
        return f"ERROR: File not found: {path}"

def write_file(path: str, content: str) -> str:
    """Write content to a file (sandboxed)."""
    full = (CONTENT_DIR / path).resolve()
    if not str(full).startswith(str(CONTENT_DIR.resolve())):
        return f"ERROR: Access denied"
    full.parent.mkdir(parents=True, exist_ok=True)
    full.write_text(content, encoding='utf-8')
    return f"OK: Written {len(content)} chars to {path}"

# API schemas — what the LLM sees
TOOL_DEFINITIONS = [
    {
        "name": "read_file",
        "description": "Read a file from the content directory.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "Relative path"}
            },
            "required": ["path"]
        }
    },
    # ... more tools
]

# Runtime dispatch map — what actually executes
TOOL_IMPLEMENTATIONS = {
    "read_file": lambda args: read_file(args["path"]),
    "write_file": lambda args: write_file(args["path"], args["content"]),
    "validate_visualization": lambda args: validate_visualization(args["path"]),
}

↕ Scroll

tools.py — dual-registration pattern

💡Why dual-registration matters

The LLM's view of a tool (name, description, parameters) drives what it decides to call. The runtime implementation decides what actually happens. Keeping these separate means I can evolve the implementation (add sandboxing, add logging, swap backends) without touching anything the LLM sees. It's the same separation of concerns that REST APIs use: interface vs implementation.

📋Report Cards: Automated Quality Gates

This is the piece everyone skips when they build AI pipelines, and it's why their pipelines fail in production. Every output must pass a structured validation check before it reaches downstream consumers.

What a Report Card Checks

📐

Schema Compliance

Top-level keys, nested structure, required fields all present.

🔢

Cardinality

Minimum counts enforced: 3+ test cases, 3 languages, 3 flavors.

🌐

Language Coverage

JavaScript, Python, and Java variants all present and parseable.

📝

Content Completeness

Every required field populated — no empty strings, no placeholders.

🐛

Known Failure Patterns

Specific checks for bugs we've seen before (valueSlots nesting, etc).

✨

Content Quality Rubric

LLM-as-judge scoring for helpfulness and pedagogical correctness.

python

def validate_visualization(path: str) -> str:
    with open(path) as f:
        data = json.load(f)
    issues = []

    # Top-level schema check
    for key in ['testCases', 'annotatedCode', 'algorithmMeta', 'thinkingContent']:
        if key not in data:
            issues.append(f"Missing required key: {key}")

    # Cardinality: at least 3 test cases
    tcs = data.get('testCases', [])
    if len(tcs) < 3:
        issues.append(f"Need 3+ test cases, found {len(tcs)}")

    # Per-state structural completeness
    for i, tc in enumerate(tcs):
        for j, state in enumerate(tc.get('states', [])):
            for field in ['step', 'codeLineId', 'phase', 'description',
                          'variables', 'dataStructureState', 'pointers',
                          'annotation', 'calculation']:
                if field not in state:
                    issues.append(f"testCase[{i}].states[{j}] missing '{field}'")

    # Language coverage
    ac = data.get('annotatedCode', {})
    for lang in ['javascript', 'python', 'java']:
        if lang not in ac:
            issues.append(f"Missing annotatedCode.{lang}")
        elif 'valueSlots' not in ac[lang]:
            # Known bug: LLMs sometimes put valueSlots at the wrong nesting level
            issues.append(f"annotatedCode.{lang} missing valueSlots "
                          f"(should be inside lang dict)")

    # Thinking content flavors
    tc = data.get('thinkingContent', {})
    for flavor in ['technical', 'fun', 'spiritual']:
        if flavor not in tc:
            issues.append(f"Missing thinkingContent.{flavor}")

    if issues:
        return "VALIDATION FAILED:\n" + "\n".join(f"  • {i}" for i in issues)

    return f"VALID: {len(tcs)} test cases, {sum_of_states} states"

↕ Scroll

validate_visualization — a slice of the Report Card logic

💬The payoff

Zero manual review needed for 1,600+ generated solutions. The agent produces output, Report Cards validate, failures trigger automated self-correction, and only persistently-failing cases escalate. The reason I could walk away from this pipeline for 30+ hours is that the quality gate was tighter than any human reviewer would have been.

🛡️Safety: Circuit Breakers

Autonomous systems need explicit stop conditions. I built circuit breakers at four levels, each addressing a different failure mode.

🗂️

File System Sandboxing

All I/O tools enforce path containment. The agent literally cannot read or write outside its designated directory.

⏱️

Turn Limits

MAX_TURNS = 20 per problem. Prevents runaway loops where the agent keeps retrying a broken pattern.

🔄

Retry Limits

MAX_RETRIES = 3 per validation failure. Stops the "validate → fix → re-validate → fix" death spiral.

👁️

Human-in-the-Loop Gates

External actions (publishing, deletion, sending) require explicit human approval. Nothing destructive runs autonomously.

⚖️Design Decisions and Trade-offs

Every architectural decision is a trade-off. Here are the choices I made and the alternatives I rejected.

Custom Framework vs LangChain / CrewAI / AutoGen

Chose: Custom 300-line Python framework. Rejected: LangChain, CrewAI, AutoGen.

Framework abstractions obscure what's happening, making production debugging painful. I needed explicit control over retry logic, model fallback, cost tracking per request, and per-tenant observability — all of which required working around frameworks rather than through them. For a team new to agents, I'd recommend LangGraph. For my production use case, custom was the right call. The trade-off is velocity versus control, and you earn the right to build custom by first understanding why the frameworks exist.

Threading vs Async/Await

Chose: ThreadPoolExecutor. Rejected: asyncio.

Both work. Threading is simpler when the Anthropic SDK handles I/O blocking internally and each worker runs a sequential agent loop. Async adds complexity (coroutines, event loops, context propagation) without proportional benefit for this workload. If I were building a high-concurrency HTTP server I'd pick async. For a batch pipeline with N independent workers, threading wins on simplicity.

In-Memory State vs Persistent State

Chose: File system as the source of truth. No database. Rejected: Postgres or Redis for agent state.

The content directory (where agents read/write) doubles as the audit log and the resume-from-failure state. After a crash I can see exactly which problems completed, which partially completed, and which never started — by listing files. Adding a database would add operational surface area for no benefit. If this were a multi-tenant production system I'd add a database for coordination; for a single-tenant pipeline, plain files win.

✋Why I Paused Generation

After generating 1,600+ AI-authored solutions and proving the full visualization pipeline on 200+ problems, I stopped the pipeline. Not because it failed — because the infrastructure was proven and continuing would be a pure token-budget decision with no new learning.

⚠️Engineering discipline over vanity metrics

The point of the project was always to prove that the framework works. It does. Running it on the full 3,247 problem catalog would have been a scaling exercise, not an architectural one — and the cost would have been meaningful without producing any additional signal. As a solo practitioner, I allocate resources carefully. If I were at a funded team with a clear revenue signal from the content, I'd keep going. Knowing when to stop is part of the discipline.

📊Production Metrics

Numbers from real runs, rounded for clarity, measured at the pipeline level (not cherry-picked from best runs):

Throughput

~80/hr

AI-authored solutions at 10 parallel workers

First-Pass Success Rate

~60%

Sonnet, before auto-retry layer

After Retry Layer

~95%

With up to 3 auto-retries on validation failure

Cost per Output

~$0.05

Blended cost including retries

Longest Autonomous Run

30+ hrs

Across provider quota cycles, zero human intervention

Manual Review Required

Report Cards caught and auto-fixed structural issues

🎯

Leadership Takeaway

What this removes for a team: the bottleneck of human review on high-volume AI-generated output. With Report Cards and sandboxed tool execution, agents can run overnight without someone watching them — which means the team's time goes to the work AI can't do yet, not to babysitting the work AI can.

How it scales beyond a solo build: every layer of this architecture is reusable across products. The same orchestrator powered both live products at Zen Algorithms — CosmicKeys and WatchAlgo — plus an MVP build of an AI-first professional network (ZoomedIn.us), demonstrating reusability across fundamentally different domains. Within a company, the same pattern would let one platform team serve many product teams — each one writing its own tool definitions and quality criteria while sharing the orchestration core. That's how an AI platform multiplies across an engineering organization instead of being duplicated per team.

The leadership insight: AI-native development isn't about one person being 10x faster. It's about building reusable agentic infrastructure that makes every engineer on a team 10x faster at the work that matters. The AI Factory is the shape that infrastructure takes when you design it for production, not for demos.

Related Architecture

⌨️

CosmicKeys Architecture →

Multi-region typing platform — voice narration, 7 languages, anycast routing, real-time analytics.

📊

WatchAlgo Architecture →

AI content generation pipeline — spec-driven, RAG, model routing, self-correction.

📚

AI Foundations →

How I think about LLMs, agents, and production AI — from first principles to shipped systems.

🎯

Back to AI/ML Overview →

All architecture deep dives, methodology, and production work.