Why Holistic LLM Judging Fails

Dec 22, 2025

Single-pass “LLM-as-a-judge” tends to sample the claim space, overloads attention in long contexts, and can produce plausible false critique.

Read →

2 Comments

Sam Illingworth

Dec 22

Thanks Khaled for this excellent post. I agree that holistic judging seems to be a very strange thing for LLMs to be used for. Instead, would it surely just make much more sense for them to be given specific tasks to judge and then to stack these up with fail-safes in between?

Expand full comment

Reply (1)

Khaled Ahmed, PhD

Dec 23

Thanks, Sam. I really appreciate you reading and sharing your thoughts. I’ve also observed the same thing: when evaluations are small, concrete checks, the failure mode is much easier to see. The trick is how to combine the results of those checks without missing issues or adding noise, which is what I’m exploring next.

Expand full comment

Semantics & Systems

Why Holistic LLM Judging Fails