New Benchmark Highlights Hallucination and Deflection in Large Vision-Language Models

Researchers introduce a dynamic data curation pipeline to better evaluate how LVLMs handle conflicts between visual and textual evidence. The study underscores the importance of models admitting when they lack knowledge.

A new paper on arXiv addresses critical gaps in evaluating Large Vision-Language Models (LVLMs). The study highlights that current benchmarks fail to account for conflicts between visual and textual evidence and the necessity for models to deflect when knowledge is incomplete. Existing benchmarks also become obsolete quickly as LVLM training sets expand, allowing models to answer more questions without retrieval.

The researchers propose a dynamic data curation pipeline to address these issues. This pipeline ensures that benchmarks remain relevant by continuously updating the data to reflect the evolving capabilities of LVLMs. The study emphasizes the importance of models generating deflections, such as "Sorry, I cannot answer," when retrieved knowledge is insufficient. This approach aims to improve the reliability and transparency of LVLMs in knowledge-intensive tasks.

The introduction of this benchmarking method could significantly impact the development of LVLMs. By focusing on deflection and hallucination, researchers can better understand how models handle uncertainty and incomplete information. This could lead to more robust and trustworthy multimodal AI systems. The study also raises questions about the future of benchmarking in AI, particularly how dynamic data curation can be integrated into standard evaluation practices.