I Tried to Mislead a Model With Bad Feedback.

Introduction

One big open question in AI safety is scalable oversight: how do we supervise models that might one day be smarter than us?

Usually, we train against a clear ground-truth signal (labels, rewards). But oversight becomes noisy and expensive. Humans disagree, make systematic mistakes, or just can’t scale to the problem. This raises the classic worry: what happens if a strong model learns to exploit a weak oversight signal? It’s a bit like a smart student learning how to “game” a bad teacher.

I first came across this framing in Anthropic’s posts on scalable oversight and the paper Measuring Progress on Scalable Oversight for Large Language Models (Bowman et al., 2024). These helped me understand how researchers are trying to study oversight empirically, and inspired me to run my own small experiment.

Setup

Researchers generally explore scalable oversight from a few angles:

I wanted to get a feel for the weak-to-strong setup: could a weaker model acting as a reviewer actually help improve the outputs of a stronger one?

My experiment:

Scoring: 0 = incorrect, 1 = correct.

Process:

  1. Strong model answers the question.
  2. Weak reviewer gives a score + feedback.
  3. Strong model revises its answer using the feedback.

This setup mimics the weak-to-strong idea: can the weak reviewer improve the strong solver’s accuracy, or will it just mislead it?

Results

On 24 questions, here’s what I found:

Outcome Count %
Stable Incorrect 15 62.5
Improved 5 20.8
Stable Correct 3 12.5
Corrupted 1 4.2
Results overview: weak oversight helped sometimes, and occasionally hurt.

The reviewer itself was basically a coin flip (≈50% accurate). So oversight helped a little (20% improved), but also hurt sometimes (4% corrupted).

This matches the broader literature: if your oversight signal is noisy, it won’t consistently help and sometimes it can actively corrupt.

Surprises (a.k.a. How I Tried to Mislead the Strong Model)

Going in, I honestly thought the most interesting result would be corruption: the weak reviewer misleading the strong solver even when the solver had the right answer. I actually designed my prompts to encourage this, basically telling the strong model “you were marked correct, here’s some feedback anyway.”

But it didn’t really work. The strong model almost never flipped from correct → incorrect. Most of the time it either ignored the misleading feedback, or doubled down on its original correct answer.

So in some sense, my attempt to bias the experiment toward corruption backfired. Maybe I didn’t design the reviewer prompt in the right way, or maybe the strong model just wasn’t gullible enough in this setting. Either way, it was a bit of a reality check: if you want to study how a weak overseer actively misleads, you probably need a more structured way of injecting systematic errors rather than just “noisy” feedback.

And when I first tried this experiment without MMLU (just with algebra problems I generated via ChatGPT), it completely failed. The “strong” model just solved everything, and the reviewer’s noisy hints didn’t matter. That’s when I realized that Bowman et al. (2024) used MMLU. It’s a standardized benchmark and hard enough to make strong models fail and hence be able to test if weak oversight helps (or hurts).

Lessons

Next Steps

Things I’d like to try next:

← Back to Blog