Thanks Khaled for this excellent post. I agree that holistic judging seems to be a very strange thing for LLMs to be used for. Instead, would it surely just make much more sense for them to be given specific tasks to judge and then to stack these up with fail-safes in between?
Thanks, Sam. I really appreciate you reading and sharing your thoughts. I’ve also observed the same thing: when evaluations are small, concrete checks, the failure mode is much easier to see. The trick is how to combine the results of those checks without missing issues or adding noise, which is what I’m exploring next.
Thanks Khaled for this excellent post. I agree that holistic judging seems to be a very strange thing for LLMs to be used for. Instead, would it surely just make much more sense for them to be given specific tasks to judge and then to stack these up with fail-safes in between?
Thanks, Sam. I really appreciate you reading and sharing your thoughts. I’ve also observed the same thing: when evaluations are small, concrete checks, the failure mode is much easier to see. The trick is how to combine the results of those checks without missing issues or adding noise, which is what I’m exploring next.