MokA: Multimodal Low-Rank Adaptation for MLLMs

Yake Wei1,2,3, Yu Miao1,2,3, Dongzhan Zhou4, Di Hu1,2,3
1Gaoling School of Artificial Intelligence Renmin University of China Beijing, China 2 Beijing Key Laboratory of Research on Large Models and Intelligent Governance 3 Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE 4 Shanghai Artificial Intelligence Laboratory

Overview

In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities.

We argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs:

\[ \begin{align} h &= W_0 \mathbf{x} + \Delta W \mathbf{x} = W_0 \mathbf{x} + \Delta W [\mathbf{x}^{\text{m}_1};\mathbf{x}^{\text{m}_2};\cdots; \mathbf{x}^{\text{m}_n}], \\ &= W_0\mathbf{x} + \underbrace{ [\Delta W_1 \mathbf{x}^{\text{m}_1};\Delta W_2 \mathbf{x}^{\text{m}_2} ;\cdots;\Delta W_n \mathbf{x}^{\text{m}_n} ]}_{\texttt{unimodal adaptation}} + \underbrace{\Delta W_{\texttt{cross}} [\mathbf{x}^{\text{m}_1};\mathbf{x}^{\text{m}_2};\cdots;\mathbf{x}^{\text{m}_n}]}_{\texttt{cross-modal adaptation}}, \end{align} \]
where $W_0$ is the original pretrained weight, $\Delta W$ is the low-rank adaptation matrix, $\mathbf{x}^{\text{m}_i}$ is the input of modality $i$. $W_i$ is the low-rank adaptation matrix for modality $i$, and $\Delta W_{\texttt{cross}}$ is the cross-modal adaptation matrix.

From this perspective, we propose Multimodal Low-Rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation.

Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method.

Comparison and Results

Difference between other LoRAs and MokA

Most former LoRA methods ignore the modality characteristics and directly apply the same low-rank decomposition matrices to all modalities. To better support multimodal adaptation, we argue that both unimodal and cross-modal updates should be considered during fine-tuning.

Therefore, MokA retains the widely adopted low-rank decomposition matrices, but it redefines the roles of matrices $A$ and $B$ to better accommodate multimodal characteristics. Modality-specific matrix $A$, cross-attention, and shared multimodal matrix $B$ jointly ensure both unimodal and cross-modal adaptation.

Comparision under audio-visual-text scenario

Comparision under visual-text scenario

Comparision under speech-text scenario

Different variants of MokA

In the original MokA, cross-attention is employed to explicitly strengthen the interaction between text and non-text tokens, thereby facilitating improved cross-modal adaptation. As previously discussed, alternative modules that similarly enhance this interaction can also be considered. In this section, we explore several variants of the cross-modal interaction module.

The cross-attention* variant also adopts a cross-attention mechanism; however, it uses text tokens as queries. Consequently, the updated text tokens integrate information from the relevant non-text tokens—reversing the direction of interaction compared to the original MokA. The naive interaction variant performs a simple, uniform mapping from text tokens to non-text tokens without employing any attention mechanism.

Results show that all proposed variants outperform the LoRA baseline, demonstrating that the core idea of explicitly reinforcing cross-modal interactions is beneficial, and the effectiveness is not restricted to one specific module design.

BibTeX

@article{wei2025moka,
  title={MokA: Multimodal Low-Rank Adaptation for MLLMs},
  author={Wei, Yake and Miao, Yu and Zhou, Dongzhan and Hu, Di},
  journal={arXiv preprint arXiv:2506.05191},
  year={2025}
}