HA
12 Feb 2025 3 min read Updated 2 Mar 2025

How we cut exam creation from days to 30 minutes using AI

The approach, what broke, and what I would do differently when building a multi-provider AI pipeline in production.

ainextjsengineering

The problem was not “use AI.” The problem was that manual exam creation took multiple days per test and the content team could not scale output without burning time on repetitive formatting, validation, and rewriting work.

Table of contents

Where the process actually broke

The raw pain point was obvious once I looked at the workflow end to end:

  1. choose the exam structure
  2. draft questions manually
  3. normalize difficulty and topic coverage
  4. review formatting inconsistencies
  5. rebuild the final paper into something the platform could ingest

Every handoff introduced delay. Even when the content quality was strong, the pipeline around it was slow.

The system we built

I built a multi-provider AI pipeline that treated model calls as one step in a larger production workflow, not the workflow itself. The system coordinated:

  • prompt templates scoped to section and difficulty
  • provider fallback across OpenAI, Anthropic, and Groq
  • validation layers before content was accepted
  • structured outputs that mapped cleanly to the platform schema
  • human review where quality still needed a tighter loop

The most important design decision was keeping provider-specific logic at the edges. That let me switch models without rewriting the entire generation pipeline.

Reliability mattered more than novelty

The temptation with AI systems is to optimise for the most impressive demo. In production, the real win is consistent output under constraints. A boring but dependable pipeline beats a spectacular one that breaks under load or drifts in quality.

Fallbacks were mandatory

Providers fail differently. Some time out, some over-explain, some quietly ignore structure. Having multiple providers only helps if the rest of the pipeline is designed to tolerate that variability.

export async function generateWithFallback(tasks: Array<() => Promise<string>>) {
  for (const task of tasks) {
    try {
      return await task();
    } catch {
      continue;
    }
  }
 
  throw new Error("All providers failed");
}

That pattern by itself is simple. The harder part is attaching evaluation, schema validation, retries, and operator visibility around it.

What broke

A few things failed in predictable ways:

  • prompts that worked on one provider degraded badly on another
  • structured output assumptions collapsed when question complexity increased
  • retries amplified latency if validation rules were too loose
  • manual review still got blocked if generated content was hard to scan quickly

The lesson was that model quality was only one variable. Presentation, validation, and failure handling mattered just as much.

What I would do differently

If I rebuilt the pipeline now, I would push even harder on:

  • explicit schema guarantees at every stage
  • narrower prompts with clearer section ownership
  • evaluation datasets earlier in the process
  • a stronger operator dashboard for failure inspection

The headline metric was the big win: exam creation dropped from multiple days to under 30 minutes. But the real engineering value came from turning a fragile manual process into a repeatable system the team could trust.

Written by

Himanshu Agarwal

Founding engineer based in Bengaluru, building product systems that connect web, mobile, backend, and AI without handoff gaps.

Keep reading

Related posts

Browse all posts
18 Jan 2025 3 min read

3 years as the only engineer at a startup

What I learned building two full EdTech platforms solo, from architecture decisions with no one to sanity check them to shipping on App Store and Play Store alone.

careerstartupengineering

Discussion

Comments

Configure the `PUBLIC_GISCUS_*` environment variables to enable GitHub Discussions-powered comments here.