New AI Benchmark Tests When Models Should 'Know' vs. 'Guess'

Researchers created a contamination-aware, multi-zone benchmark called Know2Guess to evaluate when large language models should answer questions versus abstain. It has 1,200 items across five domains with explicit abstention expectations and contamination-risk metadata.

Researchers unveiled Know2Guess, a new benchmark for evaluating large language models (LLMs). The benchmark is designed to separate supported answering from unsupported guessing, without conflating these with data contamination, prompt idiosyncrasy, or generic refusal behavior. It contains 1,200 items across five domains, each with explicit abstention expectations, contamination-risk metadata, and a dual parsing system with an official strict parser.

This matters because it helps make AI more honest and reliable. Instead of models guessing when they don't have the answer, Know2Guess encourages them to abstain from answering. The benchmark measures the transition from answerable knowledge to abstention-expected unknowns under frozen build-time labels.

To see this in action, check out the Know2Guess paper on ArXiv. The contamination-aware approach ensures results aren't biased by memorized training data, a critical step toward more trustworthy AI assistants.