An algorithm learning platform with 3,247 problems defined, 1,600+ AI-authored solutions generated, and a full RAG-based content pipeline with Report Card validation. Stress-test case study for the AI Factory β proving spec-driven AI-native development produces production-grade output.
WatchAlgo is the proof that spec-driven AI-native development outperforms every lower tier of AI-assisted coding. It's not βa typing app with a few AI features.β It's a complete learning platform whose content β visualizations, solutions, explanations, multi-flavor narratives β was generated, validated, and self-corrected by AI agents running under an architecture I designed specifically to prove that this kind of output is possible at production quality.
Before I wrote a single agent, I defined the shape of the content. Every piece of output had to conform to this structure, and the structure became the contract that validation enforced.
The heart of WatchAlgo is a multi-stage content pipeline. Each problem flows through five stages, each of which can fail, retry, and self-correct.
Load the problem definition from problem.json β includes the problem statement, constraints, examples, and the desired output shape for each language and flavor.
Fetch Golden Examples β reference visualizations from wa-000001/ β as in-context examples. This is classic RAG: retrieve high-quality prior work, inject it into the prompt, let the model use it as a pattern for the new output.
Agent runs its loop with the spec, references, and tool access. It generates structured output matching the expected schema, typically in one or two turns.
The Report Card runs 12+ checks on the generated output. Schema compliance, cardinality, language coverage, structural completeness, known failure patterns. If any check fails, the pipeline loops back to Stage 3 with an error-aware retry.
Every generated output looks back at a canonical reference. This is the simplest, highest-leverage RAG pattern I know: pick a small number of exemplary outputs and use them as in-context examples for every future generation. The model learns by imitation.
Fine-tuning would have cost time, API credits, and iteration cycles. RAG with Golden Examples works instantly with no preparation.
To improve output quality, I update the Golden Example and re-run. Fine-tuning would require a new training run for each change.
Different problem types can pull different Golden Examples. Fine-tuning produces one global model; RAG produces per-problem adaptation.
I can inspect exactly what context the model saw. Fine-tuned behavior is opaque; in-context learning is transparent.
wa-000001/visualization.json is the canonical reference. I spent disproportionate time making it excellent β schema-clean, structurally complete, pedagogically strong. Every subsequent generation reads it first, which means the quality of all 1,600+ outputs is anchored to the quality of that one reference. Investing heavily in one Golden Example paid back across the entire pipeline.This is the piece that turns βAI generated some outputβ into βAI generated verified production output.β Every generated file must pass a structured Report Card before it reaches storage.
Top-level keys present. testCases, annotatedCode, algorithmMeta, thinkingContent, and nested structure per the content taxonomy.
At least 3 test cases per problem. At least N states per visualization. Prevents βone token and doneβ outputs that technically parse but have no substance.
JavaScript, Python, and Java variants all present in annotatedCode. Each language must include its valueSlots field in the correct nesting position β a known LLM failure mode that bypasses basic schema checks.
Every visualization state must include all nine required fields: step,codeLineId, phase, description,variables, dataStructureState,pointers, annotation, calculation. Missing any of these would break downstream rendering.
All three narrative flavors (technical / fun / spiritual) must be present with non-empty content.
I benchmarked Claude Sonnet and Opus against the Report Card and made a deliberate routing decision based on cost-per-successful-output, not cost-per-call.
| Model | Cost/call | First-pass success | Effective cost |
|---|---|---|---|
| Claude Sonnet | ~$0.03 | ~60% | ~$0.05 (with retries) |
| Claude Opus | ~$0.21 | ~100% | ~$0.21 |
When a Report Card check fails, the pipeline doesn't throw the output away. It feeds the specific failure back to the agent and asks it to fix the problem. This is how success rates climb from ~60% to ~95% without changing models.
After generating 1,600+ solutions and proving the full visualization pipeline on 200+ problems, I stopped. The framework is proven. The output is live. And continuing would be pure token-budget spending without new architectural learning.
If WatchAlgo were inside a funded company with clear revenue signals from the content, I'd keep going. The pipeline is ready to scale the moment that signal appears. That readiness β not the current output volume β is what the architecture is designed to prove.
What this removes for a content team: WatchAlgo demonstrates that the content pipeline bottleneck β the reason content teams always miss deadlines β can be broken with the right agentic architecture. One engineer can produce what used to require a dedicated content team, provided the validation layer is tight and the reference examples are curated.
The generalizable pattern: any domain where content has a well-defined schema, a clear quality rubric, and canonical reference examples can be automated this way. Educational content, product descriptions, documentation, API references, compliance reports β the list of use cases is enormous. The methodology transfers; only the tools and Report Cards change.
The leadership insight: the hardest part of AI-native content generation isn't the generation step. It's the validation step. Teams that skip building Report Cards produce unusable output at scale and conclude βAI isn't ready for production.β Teams that invest in the validation layer produce reliable output and ship real products. The difference is architectural, not technological.