MIBench: Evaluating LMMs on Multimodal Interaction

A descriptive caption for the teaser image

Introduction

In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as “multimodal interaction”. How well a model handles various multimodal interactions largely characterizes its multimodal ability.

In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a ($con_v$, $con_t$, $task$) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key perspectives: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. The benchmark comprises over 10,000 vision-text context pairs spanning 32 distinct tasks.

Evaluation of state-of-the-art LMMs show that: (1) LMMs' ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.

Leaderboard

Performance of state-of-the-art LMMs on MIBench and MIBench-mini.
Closed-source Models Open-Source Models

Model	MIBench (Full)				MIBench-mini
Model	Vision	Text	Synergy	Overall	Vision	Text	Synergy	Overall
Gemini-3-pro-preview	-	-	-	-	71.70	87.20	81.00	79.97
Gemini-2.5-pro	-	-	-	-	69.40	84.40	76.00	76.60
o3	-	-	-	-	63.60	85.00	72.00	73.53
GPT-5.1	-	-	-	-	64.40	82.80	71.00	72.73
GPT-5.2	-	-	-	-	64.00	80.00	70.00	71.60
Claude-4.5-Sonnet	-	-	-	-	59.60	80.80	72.00	70.80
Qwen3-VL-235B	68.48	73.17	61.40	67.89	68.00	72.00	60.00	66.67
InternVL3.5-241B	63.42	74.78	58.01	65.82	62.20	73.20	61.00	65.47
Qwen2.5-VL-72B	65.15	69.48	56.37	63.87	65.40	67.20	59.00	63.87
GPT-4o	-	-	-	-	57.40	72.20	57.00	62.20
Qwen2.5-VL-32B	63.55	68.09	51.33	61.22	62.80	69.20	50.00	60.67
InternVL3.5-8B	57.52	67.28	51.03	58.98	57.60	67.00	53.00	59.20
Qwen3-VL-8B	64.08	64.61	53.59	60.85	63.20	60.40	52.00	58.53
LLaVA-OV-1.5-8B	59.41	65.38	47.64	57.75	58.00	65.20	48.00	57.07
InternVL3-8B	61.93	65.82	48.05	58.82	59.20	65.20	45.00	56.47
LLaVA-OV-72B	54.61	66.01	50.10	57.33	54.80	68.40	45.00	56.07
InternVL2.5-8B	59.61	65.03	47.43	57.62	58.80	63.20	45.00	55.67
Qwen2-VL-7B	60.29	62.76	44.66	56.09	59.40	63.80	43.00	55.40
Qwen2.5-VL-7B	59.61	58.88	44.97	54.56	55.80	59.20	49.00	54.67
Deepseek-VL2-27B	61.32	60.49	40.76	54.30	58.40	61.20	39.00	52.87
InternVL2-8B	53.29	62.53	45.79	54.23	49.60	59.00	45.00	51.20
LLaVA-OV-7B	55.50	55.54	42.81	51.37	53.00	57.20	43.00	51.07
Qwen2.5-VL-3B	54.67	56.01	37.78	49.65	53.60	53.00	41.00	49.20
Deepseek-VL-7B	54.16	56.60	35.22	48.86	50.60	57.80	37.00	48.47
LLaVA-1.5-7B	48.73	44.67	35.93	43.06	44.60	47.20	32.00	41.27
LLaVA-1.6-7B	47.00	46.72	33.16	42.37	43.60	43.00	35.00	40.53
Emu3-8B	40.35	47.46	30.49	39.74	45.40	43.60	25.00	38.00
Chameleon-30B	33.71	44.26	34.80	37.94	30.00	39.00	45.00	38.00
Chameleon-7B	29.12	39.58	37.06	35.55	24.80	43.00	34.00	33.93

MIBench introduces a comprehensive evaluation framework designed to deeply assess the multimodal interaction capabilities of LMMs. It evaluates LMMs on three interaction patterns: Vision-centric, Text-centric, and Synergy, which require models to source information from visual cues, textual cues, or a combination of both modalities, respectively. These patterns are evaluated across three progressive cognitive levels: Recognition, Understanding, and Reasoning.

Within this framework, each evaluation instance is formulated as a ($\textbf{con}_\textbf{v}$, $\textbf{con}_\textbf{t}$, $\textbf{task}$) triplet. For vision- and text-centric tasks, they can be resolved by leveraging cues from the centric modality. We introduce various contexts from another modality to evaluate the model's ability to selectively utilize cues from the target modality, which range from helpful contexts (e.g., correct guidance, concept visualization) to misleading guidance and unrelated contents. For the synergy part, the model is presented with one coupled visual-textual pair with complementary cues, necessitating effective cross-modal collaboration.

To meet these requirements, we first identify suitable tasks based on the proposed taxonomy, and subsequently collect, re-annotate, or synthesize test samples following the carefully designed annotation pipeline. Ultimately, this structured approach culminates in a large-scale benchmark comprising 32 distinct tasks and 2,980 base samples, which are expanded into 10,030 context pairs through diverse contextual variations, enabling a fine-grained analysis of model interaction capabilities.

Sample format of Vision-centric tasks and test case.

Sample format of Text-centric tasks and test case.

Sample format of Synergy tasks and test case.

Overview of the sample annotation pipeline.

more statistics of MIBench and proportion of sources for collected images.

Impact of version iteration (Left) and parameter scaling (Right) on multimodal interaction capabilities.

Performance of open-source LMMs across three progressively cognitive levels: Recognition (Level 1), Understanding (Level 2), and Reasoning (Level 3).

A comparison of all LMMs' modality source selection consistency evaluated on MIBench-mini, with dot sizes scaled according to model parameter counts.

Impact of parameter scaling on synergy capabilities (on Qwen2.5-VL).

Comparison of LMMs' performance within different series ({Left: LLaVA; Right: Qwen-VL)

Left: non-native vs. native LMMs. Right: Performance of SOTA open-source and closed-source LMMs.

LMM's performance on different task types.

Results Analysis

Building on MIBench, we evaluate the interaction capabilities of current open-source and closed-source models. This analysis yields several key findings:

Current LMMs remain constrained in their ability to perform effective multimodal interaction, despite increases in model scale and data sources.
LMMs demonstrate limited proficiency in selectively extracting cues from the target modality. This issue is particularly severe in vision-centric tasks, as the models often fail to prioritize visual evidence over the textual context.
While LMMs show acceptable performance in basic cross-modal alignment, their effectiveness drops in tasks requiring deep interactive understanding and reasoning, for which parameter scaling is insufficient to address.
Native LMMs struggle with fundamental perception, creating a bottleneck for complex interactions. Non-native LMMs appear to lean on powerful LLM foundations for better synergy but are hindered by a strong text bias that prevents deeper cross-modal collaboration.

By focusing on multimodal interaction, MIBench offers fresh insights into whether current LMMs are truly achieving multimodal intelligence. It also provides clear direction and inspiration for the next steps in developing more advanced, genuine synergistic capabilities.

Case Study

A case where o3 fails in counting when faced with a misleading description of the visual context.

A case where Gemini 2.5 Pro fails in attribute comparison when faced witha misleading description of the visual context.

A case that GPT 5.1 fails in spatial reasoning when faced with a totally irrelevant textual context, indicating its shortcomings in deep visual reasoning.

A case that GPT-4o fails in checking for grammatical errors when facing a visual context with related concepts, indicating that visual context affects the model's understanding of the textual context.

A case that Claude Sonnet 4.5 fails in determining the similarity relation between sentences when facing a visual context with misleading visual guidance.

A case that GPT 5.1 fails in solving a textual chemistry problem when faced with an unrelated image.

A case that Gemini 3 Pro Preview fails in basic position alignment, indicating a strong textual bias and limited attention to visual content.

A case that o3 fails in rule execution tasks, indicating its deficiencies in handling frequent complex cross-modal interactions in natural scenes.

A case that o3 fails in solving business problem.

BibTeX

@article{miao2026mibench,
      title={MIBench: Evaluating LMMs on Multimodal Interaction}, 
      author={Yu Miao and Zequn Yang and Yake Wei and Ziheng Chen and Haotian Ni and Haodong Duan and Kai Chen and Di Hu},
      journal={arXiv preprint arXiv:2603.13427},
      year={2026}
}