Logo Logo

Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

1DAMO Academy, Alibaba Group 2Nanyang Technological University
3IHPC, A*STAR
* Equal Contribution Corresponding Author

Overview

Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multi-modal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our work distinguishes from existing benchmarks through four key features:

  • Essential Multi-Modalities: We analyze hallucinations and evaluate LMMs across the three fundamental modalities: Language, Visual, and Audio.
  • Systematic Hallucination Investigation: Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations.
  • Comprehensive Diagnosis: CMM defines hallucinations with nuanced subcategories and granularities, enabling comprehensive diagnosis of LMM vulnerabilities across various modalities.
  • Quality in Annotations: All data are newly annotated by humans, not from any existing video dataset, ensuring diversity and quality.

data-overview
data-overview data-overview

Logo CMM data compositions.

CMM Leaderboard

🛠️Perception Accuracy (PA) and Hallucination Resistance (HR) scores are reported.

Models are ranked based on the average score of PA and HR. 🥇🥈🥉 indicate the top-3 models.

Visual-Audio-LLMs

# Model Date Overall Spurious Inter-modality Correlations Overreliance on Unimodal Priors
Visual-Language Audio-Language Visual-Audio-Language Visual Dominance Audio Dominance Language Dominance
pa hr pa hr pa hr pa hr pa hr pa hr pa hr
1 Gemini-1.5-flash🥇 2024-10-04 88.4 64.2 93.5 90.0 88.5 39.5 88.5 70.5 79.0 36.5 90.5 86.5 90.5 62.0
2 Gemini-1.5-pro🥈 2024-10-04 87.1 58.3 91.0 90.5 94.0 14.5 86.0 67.0 82.5 34.0 90.5 82.0 78.5 61.5
3 Reka-Core🥉 2024-10-04 63.7 80.9 87.0 94.5 25.0 76.0 76.7 85.1 35.6 69.4 80.8 82.7 75.0 76.0
4 VideoLLaMA2 2024-10-04 71.7 81.1 75.0 86.0 77.5 94.0 78.0 98.0 62.0 75.5 80.0 90.0 57.5 43.0
5 FAVOR 2024-10-04 92.2 42.1 91.0 55.0 94.5 45.0 94.5 69.0 89.0 21.5 92.0 43.5 92.0 18.5
6 GroundingGPT 2024-10-04 96.6 14.3 95.5 36.5 100 0.0 97.5 18.0 99.5 1.0 98.5 23.5 88.5 7.0

Visual-LLMs

# Model Date Overall Spurious Inter-modality Correlations Overreliance on Unimodal Priors
Visual-Language Language Dominance
pa hr pa hr pa hr
1 GPT4o🥇 2024-10-04 85.25 89.75 87.50 95.50 83.00 84.00
2 LLaVA-OneVision🥈 2024-10-04 90.75 78.75 94.00 88.00 87.50 69.50
3 InternVLM-XComposer 2.5🥉 2024-10-04 96.75 59.75 99.00 73.00 94.50 46.50
4 ShareGPT4Video 2024-10-04 83.50 71.75 87.50 85.50 79.50 58.00
5 PLLaVA 2024-10-04 82.25 72.50 89.50 93.00 75.00 52.00
6 VideoChat2 2024-10-04 92.50 50.25 97.00 66.00 88.00 34.50
7 CogVLM2-Video 2024-10-04 98.75 24.50 99.50 44.00 98.00 5.00

Audio-LLMs

# Model Date Spurious Inter-modality Correlations
Audio-Language
pa hr
1 SALMONN🥇 2024-10-04 93.00 59.00
2 GAMA-IT🥈 2024-10-04 94.50 52.00
3 Qwen2-Audio🥉 2024-10-04 98.50 34.50
3 Audio-Flamingo 2024-10-04 89.50 39.00


🚨 To submit your results to the leaderboard, please send to this email with your result json files.

🚨 For more evaluation details, please refer to our github repo.

Logo CMM Benchmark Data

Data Examples

Probing samples constructed for Overreliance on Unimodal Priors.



Probing samples constructed for Spurious Inter-modality Correlations.

Data Statistics

Existent and non-existent object/event frequencies in our probing questions.

Experimental Results

Results on Existing LMMs

BibTeX

@article{leng2024curse,
      title={The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio},
      author={Sicong Leng and Yun Xing and Zesen Cheng and Yang Zhou and Hang Zhang and Xin Li and Deli Zhao and Shijian Lu and Chunyan Miao and Lidong Bing},
      journal={arXiv},
      year={2024},
      url={https://arxiv.org/abs/2410.12787}
    }