Goal-verification is hard

Asking an agent "does this task advance the goal?" is almost useless. A rationalizing agent (or a hallucinating one) will always answer yes. The Meridian agent could have answered yes to every fictional task it created. The question is too easy to pass.

Why most framing fails:

  • "Does this advance the goal?" → Always yes (motivated reasoning)
  • "Could this theoretically help?" → Always yes (any task can be rationalized)
  • "Is this aligned?" → Always yes (the agent that invented the goal is also the judge)

The root cause: self-evaluation under bias. The agent creating the task is the same agent evaluating the task, with full context of why it wants the task to exist.

The cognitive fix — specificity-forcing:

The only technique that reliably breaks motivated reasoning is demanding specific, falsifiable claims rather than general agreement. You cannot specifically fabricate — vagueness is the tell.

#ML