Back to AI/ML Overview
πŸ“Š AI-Native Content Platform

WatchAlgoSpec-Driven AI Content Generation at Scale

An algorithm learning platform with 3,247 problems defined, 1,600+ AI-authored solutions generated, and a full RAG-based content pipeline with Report Card validation. Stress-test case study for the AI Factory β€” proving spec-driven AI-native development produces production-grade output.

3,247
problems defined
1,600+
AI-authored solutions
3 langs
JS / Python / Java
3 flavors
tech / fun / spiritual
200+
visualization end-to-end proof
12+
Report Card criteria

🎯The Thesis

WatchAlgo is the proof that spec-driven AI-native development outperforms every lower tier of AI-assisted coding. It's not β€œa typing app with a few AI features.” It's a complete learning platform whose content β€” visualizations, solutions, explanations, multi-flavor narratives β€” was generated, validated, and self-corrected by AI agents running under an architecture I designed specifically to prove that this kind of output is possible at production quality.

πŸ”‘The specific thing WatchAlgo proves
A single developer, using a well-architected agentic framework, can produce educational content across 3,247 algorithm problems Γ— 3 programming languages Γ— 3 content flavors β€” with schema consistency, pedagogical accuracy, and zero manual review β€” in weeks instead of years. The platform is the artifact; the methodology is the lesson.

πŸ—‚οΈThe Content Taxonomy

Before I wrote a single agent, I defined the shape of the content. Every piece of output had to conform to this structure, and the structure became the contract that validation enforced.

Content Structure per Problem
Problem Definitionproblem.json β€” shared input across all variantsLANG: JavaScriptannotated solutionLANG: Pythonannotated solutionLANG: Javaannotated solutionTechnical flavornarrative variantFun flavornarrative variantSpiritual flavornarrative variantPer-flavor, per-language structured outputβ€’ Annotated solution codeβ€’ Step-by-step visualization (9 fields per state)β€’ Test cases (3+ required)β€’ Explanation / reasoningβ€’ Complexity analysis
πŸ’‘Why three flavors?
Different learners engage with different narrative framings. The β€œtechnical” flavor reads like a clean engineering explanation. The β€œfun” flavor uses analogies and humor. The β€œspiritual” flavor connects the algorithm to broader principles β€” patience, persistence, seeing structure. Same algorithm, three narrative paths, one for each learner's brain.

πŸ—οΈThe Generation Pipeline

The heart of WatchAlgo is a multi-stage content pipeline. Each problem flows through five stages, each of which can fail, retry, and self-correct.

Per-Problem Generation Flow
1. READ SPECload problem.json + metadataconstraints, examples, schema2. RETRIEVE REFERENCESfetch Golden Examples (RAG)wa-000001 as canonical pattern3. GENERATEagent loop: LLM + tool usestructured JSON output4. VALIDATEReport Card: 12+ criteriaschema, cardinality, coveragePASSFAIL< MAX_RETRIESauto-retry5. STOREpersist to content directoryzero manual reviewSelf-correction loop: failures feed back into generation with explicit error context.

Stage 1: Read Spec

Load the problem definition from problem.json β€” includes the problem statement, constraints, examples, and the desired output shape for each language and flavor.

Stage 2: Retrieve References (RAG)

Fetch Golden Examples β€” reference visualizations from wa-000001/ β€” as in-context examples. This is classic RAG: retrieve high-quality prior work, inject it into the prompt, let the model use it as a pattern for the new output.

Stage 3: Generate

Agent runs its loop with the spec, references, and tool access. It generates structured output matching the expected schema, typically in one or two turns.

Stage 4: Validate (Report Card)

The Report Card runs 12+ checks on the generated output. Schema compliance, cardinality, language coverage, structural completeness, known failure patterns. If any check fails, the pipeline loops back to Stage 3 with an error-aware retry.

πŸ“šRAG with Golden Examples

Every generated output looks back at a canonical reference. This is the simplest, highest-leverage RAG pattern I know: pick a small number of exemplary outputs and use them as in-context examples for every future generation. The model learns by imitation.

Why Golden Examples Beat Fine-Tuning for This Use Case

πŸ’°

Zero Training Cost

Fine-tuning would have cost time, API credits, and iteration cycles. RAG with Golden Examples works instantly with no preparation.

πŸ”„

Instant Iteration

To improve output quality, I update the Golden Example and re-run. Fine-tuning would require a new training run for each change.

πŸŽ›οΈ

Per-Problem Context

Different problem types can pull different Golden Examples. Fine-tuning produces one global model; RAG produces per-problem adaptation.

🧭

Auditability

I can inspect exactly what context the model saw. Fine-tuned behavior is opaque; in-context learning is transparent.

πŸ”‘The Golden Example I use
wa-000001/visualization.json is the canonical reference. I spent disproportionate time making it excellent β€” schema-clean, structurally complete, pedagogically strong. Every subsequent generation reads it first, which means the quality of all 1,600+ outputs is anchored to the quality of that one reference. Investing heavily in one Golden Example paid back across the entire pipeline.

πŸ“‹The Report Card β€” 12+ Validation Criteria

This is the piece that turns β€œAI generated some output” into β€œAI generated verified production output.” Every generated file must pass a structured Report Card before it reaches storage.

Schema Compliance (5 checks)

Top-level keys present. testCases, annotatedCode, algorithmMeta, thinkingContent, and nested structure per the content taxonomy.

Cardinality (2 checks)

At least 3 test cases per problem. At least N states per visualization. Prevents β€œone token and done” outputs that technically parse but have no substance.

Language Coverage (3 checks)

JavaScript, Python, and Java variants all present in annotatedCode. Each language must include its valueSlots field in the correct nesting position β€” a known LLM failure mode that bypasses basic schema checks.

Content Structural Completeness (3 checks)

Every visualization state must include all nine required fields: step,codeLineId, phase, description,variables, dataStructureState,pointers, annotation, calculation. Missing any of these would break downstream rendering.

Content Flavor Coverage (1 check)

All three narrative flavors (technical / fun / spiritual) must be present with non-empty content.

🧭Model Routing: Cost-Quality Tradeoffs

I benchmarked Claude Sonnet and Opus against the Report Card and made a deliberate routing decision based on cost-per-successful-output, not cost-per-call.

ModelCost/callFirst-pass successEffective cost
Claude Sonnet~$0.03~60%~$0.05 (with retries)
Claude Opus~$0.21~100%~$0.21
πŸ’‘The routing decision
Sonnet-plus-retry became the default for bulk generation. Opus is 4Γ— more reliable but also 4Γ— more expensive; the blended cost of Sonnet with auto-retry stayed well under Opus's per-call cost. For the Golden Example itself β€” where quality matters enormously and there's no retry fallback β€” I used Opus. Different models for different jobs, routed by the orchestrator automatically.

πŸ”Self-Correction: Learning From Report Card Failures

When a Report Card check fails, the pipeline doesn't throw the output away. It feeds the specific failure back to the agent and asks it to fix the problem. This is how success rates climb from ~60% to ~95% without changing models.

Self-Correction Flow (on validation failure)
Generated Output
agent produces candidate structured JSON
Report Card Validation
12+ checks run in parallel
on pass β†’ STORE and exit. On fail β†’ continue.
Extract Specific Failures as Text
machine-readable error list
e.g., "testCases[2].states[4] missing required field 'calculation'"
Re-prompt Agent with Error Context
"Your output had these issues: ... Fix and retry."
the agent now has exact targeting for the regeneration
Agent Regenerates with Error Focus
Report Card runs again, up to MAX_RETRIES
~60% first-pass success climbs to ~95% after retry layer
πŸ”‘Why specific errors beat generic ones
The difference between β€œvalidation failed” and β€œtestCases[2].states[4] missing required field 'calculation'” is enormous. Specific errors tell the agent exactly what to fix. Generic errors lead to speculative regeneration that often re-introduces the same bug. The pipeline surfaces exact, machine- readable failure messages β€” which is another reason for the dual-registration tool pattern. The validator function owns both the error format and the regeneration prompt.

βœ‹Why I Stopped at 1,600+ Solutions

After generating 1,600+ solutions and proving the full visualization pipeline on 200+ problems, I stopped. The framework is proven. The output is live. And continuing would be pure token-budget spending without new architectural learning.

⚠️Knowing when to stop is part of the discipline
The point of WatchAlgo was never to ship 3,247 live visualizations. The point was to prove that a spec-driven, multi-agent pipeline could produce production-quality content at scale. It does. Running it on the full catalog would be a scaling exercise β€” more throughput, more cost, no new lessons. A disciplined CTO stops when the hypothesis is validated, not when the runway is exhausted.

If WatchAlgo were inside a funded company with clear revenue signals from the content, I'd keep going. The pipeline is ready to scale the moment that signal appears. That readiness β€” not the current output volume β€” is what the architecture is designed to prove.

🎯

Leadership Takeaway

What this removes for a content team: WatchAlgo demonstrates that the content pipeline bottleneck β€” the reason content teams always miss deadlines β€” can be broken with the right agentic architecture. One engineer can produce what used to require a dedicated content team, provided the validation layer is tight and the reference examples are curated.

The generalizable pattern: any domain where content has a well-defined schema, a clear quality rubric, and canonical reference examples can be automated this way. Educational content, product descriptions, documentation, API references, compliance reports β€” the list of use cases is enormous. The methodology transfers; only the tools and Report Cards change.

The leadership insight: the hardest part of AI-native content generation isn't the generation step. It's the validation step. Teams that skip building Report Cards produce unusable output at scale and conclude β€œAI isn't ready for production.” Teams that invest in the validation layer produce reliable output and ship real products. The difference is architectural, not technological.