In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities.
We argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs:
From this perspective, we propose Multimodal Low-Rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation.
Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method.
|
|
@article{wei2025moka,
title={MokA: Multimodal Low-Rank Adaptation for MLLMs},
author={Wei, Yake and Miao, Yu and Zhou, Dongzhan and Hu, Di},
journal={arXiv preprint arXiv:2506.05191},
year={2025}
}