TLDR; Pasting a hard bug into one AI prompt feels productive until it isn’t. Single-model inference hits a ceiling fast; if the model misses the root cause on the first pass, it will cheerfully validate its own wrong answer forever. One way out is to act as human middleware between multiple, architecturally different LLMs: generate parallel hypotheses, swap their outputs for cross-review, and force them to argue until the overlapping signal emerges. It’s more labor than a single chat window, but it’s the difference between a confident hallucination and a fix that actually ships.
There was a ticket sitting in the queue for a throughput chart that kept breaking. The customer’s description was vague enough that our support person couldn’t fully parse what was happening at first, something about the graph going haywire every thirty seconds, metrics scrambling, then correcting itself. A coworker eventually spotted intermittent NaN errors in the console, which gave us a thread to pull. But I didn’t know this codebase, and nobody else had picked it up.
I’d been doing some prompt engineering work on the side and wanted to test how far I could push AI-assisted debugging on a real problem. So I grabbed the ticket. A single-model prompt gave me a reasonable hypothesis, but I wasn’t confident enough in it to start cutting code in a codebase I’d never touched. The bug didn’t really crack until I started running a loop between multiple LLMs; that’s when the hypothesis sharpened into something I could trust. That’s the process worth sharing.
Why a single model isn’t enough
The instinct is reasonable: paste the error, paste some code, ask the model what’s wrong. And for a lot of bugs, that works fine. But intermittent failures in unfamiliar code are a different species. The model doesn’t know the codebase any better than you do, and if it latches onto the wrong causal theory in its first pass, every follow-up response reinforces the mistake. Researchers call this self-anchoring, where the model’s confidence drifts upward even as its accuracy stays flat. It’s an echo chamber of one.
Homogeneous scaling doesn’t help either. Running three instances of the same model and asking them to debate produces redundant reasoning and artificial consensus. The errors are correlated because the training data and architecture are the same. You get three copies of the same blind spot.
The way past the plateau is architectural diversity. Models trained on different data, with different reinforcement pipelines, produce uncorrelated error distributions. A hallucination or logical miss by one model is structurally likely to be caught by another. But the models can’t coordinate on their own; you have to do that part.
Step 1: Gather your clues before you prompt anything
Resist the urge to open a chat window immediately. You need a baseline, something observable, reproducible, and specific enough that any model can reason about it without guessing.
For my throughput bug, the baseline was:
- A temporal pattern: the chart broke on a strict 30-second cycle. It fractured around second 17 and restored around second 47.
- A console artifact: intermittent NaN values appearing in the data pipeline.
- A visual symptom: SVG path elements rendering with corrupted coordinates.
That’s three concrete anchors. Without them, you’re asking the model to theorize in the dark, and it will happily oblige with something plausible and wrong.
Step 2: Generate parallel hypotheses with different models
Feed the same clue set, symptoms, console output, relevant code files, to two or more architecturally distinct models. I used Claude and Codex. The point is not to get the same answer twice; it’s to get different answers.
Both models identified the NaN propagation as the core failure, but they diverged hard on where to assign blame. One favored aggressive upstream data sanitization, fix the numbers before they ever reach the chart. The other favored strict encapsulation at the rendering boundary, let the chart defend itself against bad data. Each produced a different TDD plan.
This divergence is the signal, not the noise. If both models agree immediately, you might have a straightforward bug. If they disagree on root cause or fix strategy, you’re dealing with something layered, and the disagreement itself maps the territory.
Step 3: Cross-pollinate and force critique
This is the core of the process, and it’s where the labor lives. You become the middleware.
Take Model A’s analysis and TDD plan. Hand it to Model B, not as a prompt to build on, but as an artifact to critique. “Here is another model’s analysis of the same bug. Review it for logical gaps, missed failure modes, and risks.” Then do the reverse: take Model B’s output and give it to Model A.
I brought Gemini and ChatGPT into the loop as diagnostic critics. I gave each one the other’s merged analysis and asked them to audit it. The exchange surfaced something neither model had flagged independently: one of the intermediate merged plans treated the upstream data fix as conditional, only apply it if the renderer hardening alone didn’t resolve the visual symptoms. Gemini caught the trap. Renderer-only hardening could “fix the picture” by skipping invalid data points, but it would silently convert real zeros into data gaps. The chart would look correct while producing a quiet correctness regression underneath.
That’s the kind of insight a single-model session almost never produces. The model that wrote the plan had no reason to doubt its own sequencing. It took a different architecture, reading the same plan with fresh weights, to see the risk.
When you’re doing this manually, context management matters. Don’t dump an entire chat transcript from one model into another; you’ll blow past what the receiving model can usefully attend to. Compress the state. Extract the analysis, the proposed fix, the key assumptions. Pass a structured artifact, not a conversation log.
Step 4: Resolve conflicts through your strongest structural model
The models will eventually disagree on something specific, not a broad strategy difference, but a concrete implementation detail. During my loop, Gemini pushed hard for a modern, highly abstracted syntactic approach to a particular logic block. It was clean code. It was also wrong for this codebase.
Route the specific disagreement back to whichever model has the deepest context on the repository’s architectural constraints. In my case, that was Claude, which had ingested the most code. Claude reviewed the broader repository and rejected Gemini’s suggestion; the modern syntax would violate the architectural consistency of a legacy environment that relied on different patterns throughout.
I fed Claude’s reasoning back to Gemini. Gemini processed the legacy constraints, agreed, and withdrew the objection. This is not the model being polite. It’s the model receiving information it didn’t have, the codebase’s actual conventions, and updating its position.
The key discipline: don’t just pick the answer you like. Route the conflict back through the loop with enough context for the models to argue on the merits.
Step 5: Execute the unified plan
By this point you should have a single, peer-reviewed remediation plan where the models have converged on the overlapping necessities and explicitly resolved their disagreements. For my bug, the final plan was defense-in-depth: fix the upstream data semantics and harden the rendering boundary and add a backup sanitizer. Three layers, because any single layer left a gap the models had identified.
Write the tests first; prove the failure exists before you fix it. Implement the fix in the order the plan specifies. Run the suite.
My tests went green on the first run. The NaN errors stopped, the 30-second fracture cycle disappeared, and the patch shipped to staging without incident. The whole process, from picking up the ticket to deployed fix, took less time than I’d spent on previous attempts where I’d tried to solo it with a single chat window.
The human in the loop, for now
I’d love to tell you the human orchestration is the secret sauce. Honestly, if I could automate the copy-paste middleware part of this, I would. Most of what I did was mechanical: compress this output, paste it into that context window, ask for a critique, carry the result back. The models did the analytical heavy lifting.
The one place I actually added reasoning was at the end, when the unified plan produced a concrete code block and I had to evaluate whether it fit the repository’s conventions. That was judgment the models couldn’t fully supply on their own, because they each had partial context of the codebase. Everything else was structured labor, and structured labor is exactly the kind of thing that should eventually be automated.
Until it is, though, the manual loop works. And for hard bugs in unfamiliar territory, it’s the most reliable process I’ve found.
LLM Disclosure: I used AI to find sources that supported my ideas. This is generally frowned upon and as a fan of the scientific method I need to come clean. I started at my conclusion and found supporting evidence to get there from a question. I end up doing this more often than not with things I’ve learned through living.