I Tried to Mislead a Model With Bad Feedback.
Posted Aug 2025
Introduction
One big open question in AI safety is scalable oversight: how do we supervise models that might one day be smarter than us?
Usually, we train against a clear ground-truth signal (labels, rewards). But oversight becomes noisy and expensive. Humans disagree, make systematic mistakes, or just can’t scale to the problem. This raises the classic worry: what happens if a strong model learns to exploit a weak oversight signal? It’s a bit like a smart student learning how to “game” a bad teacher.
I first came across this framing in Anthropic’s posts on scalable oversight and the paper Measuring Progress on Scalable Oversight for Large Language Models (Bowman et al., 2024). These helped me understand how researchers are trying to study oversight empirically, and inspired me to run my own small experiment.
Setup
Researchers generally explore scalable oversight from a few angles:
- Weak-to-strong generalization: can a stronger model actually go beyond the flaws of a weak overseer? (Burns et al., 2023)
- Easy-to-hard generalization: can we supervise reliably on simpler tasks and hope the model generalizes to harder ones? (Ding et al., 2024)
- Adversarial techniques: debate setups or prover–verifier games, where models push each other (Irving et al., 2018; Michael et al., 2023).
I wanted to get a feel for the weak-to-strong
setup: could a weaker model acting as a reviewer actually help improve the outputs of a stronger one?
My experiment:
- Dataset: MMLU high school mathematics subset (multiple-choice math).
- Strong model (student):
gpt-4o-mini
, prompted to solve and returnANS: <LETTER>
. - Weak model (reviewer): also
gpt-4o-mini
, but prompted as a “weak student reviewer” that gives a hint and a binary score (0
/1
), while being “often wrong.”
Scoring: 0 = incorrect
, 1 = correct
.
Process:
- Strong model answers the question.
- Weak reviewer gives a score + feedback.
- Strong model revises its answer using the feedback.
This setup mimics the weak-to-strong idea: can the weak reviewer improve the strong solver’s accuracy, or will it just mislead it?
Results
On 24 questions, here’s what I found:
Outcome | Count | % |
---|---|---|
Stable Incorrect | 15 | 62.5 |
Improved | 5 | 20.8 |
Stable Correct | 3 | 12.5 |
Corrupted | 1 | 4.2 |
- 62.5% Stable Incorrect – wrong, stayed wrong.
- 20.8% Improved – fixed mistake after feedback.
- 12.5% Stable Correct – right, stayed right.
- 4.2% Corrupted – right, but reviewer feedback made it wrong.
The reviewer itself was basically a coin flip (≈50% accurate). So oversight helped a little (20% improved), but also hurt sometimes (4% corrupted).
This matches the broader literature: if your oversight signal is noisy, it won’t consistently help and sometimes it can actively corrupt.
Surprises (a.k.a. How I Tried to Mislead the Strong Model)
Going in, I honestly thought the most interesting result would be corruption: the weak reviewer misleading the strong solver even when the solver had the right answer. I actually designed my prompts to encourage this, basically telling the strong model “you were marked correct, here’s some feedback anyway.”
But it didn’t really work. The strong model almost never flipped from correct → incorrect. Most of the time it either ignored the misleading feedback, or doubled down on its original correct answer.
So in some sense, my attempt to bias the experiment toward corruption backfired. Maybe I didn’t design the reviewer prompt in the right way, or maybe the strong model just wasn’t gullible enough in this setting. Either way, it was a bit of a reality check: if you want to study how a weak overseer actively misleads, you probably need a more structured way of injecting systematic errors rather than just “noisy” feedback.
And when I first tried this experiment without MMLU (just with algebra problems I generated via ChatGPT), it completely failed. The “strong” model just solved everything, and the reviewer’s noisy hints didn’t matter. That’s when I realized that Bowman et al. (2024) used MMLU. It’s a standardized benchmark and hard enough to make strong models fail and hence be able to test if weak oversight helps (or hurts).
Lessons
- Weak oversight sometimes improves strong performance (~20%) but usually doesn’t matter (~62%).
- There’s a small but real chance of corruption (~4%).
- Reviewer accuracy is critical: mine was weak in a bad way (random noise), not weak in a systematic way models could learn from.
- Trying to force misleading didn’t work. Strong models mostly stuck with correct answers when they had them.
Next Steps
Things I’d like to try next:
- Swap roles: use a weaker model as the student and a stronger one as the reviewer.
- See if the “student” can learn to imitate the reviewer’s style just to get approval (weak-to-strong failure mode).
- Use adversarial or decomposition-based oversight, instead of just noisy hints.