This project is oriented to complex dynamic audio-visual scenes. By mining the context information of long-video sequences, it explores the internal connections of multiple speakers in the spatial and temporal dimensions, proposes a cross-domain matching model to mine the consistent representation of audiovisual modalities, and establishes a dialogue mechanism, finally aiming to improve speaker tracking and diarization performance.