Knowledge distillation which proposes to transfer knowledge from a pretrained model to a tiny model, has been widely studied in a variety of tasks such as image classification, semantic segementation, etc. Recently, with the emergence of videos which contains sufficient multi-modal information (e.g., vision, audio, optical flow, etc.), cross-modal knowledge distillation has become a hot topic to transfer cross-modal knowledge. However, it is challenging to distill knowledge from one modality to the other, due to the domain gap and semantic gap between different modalities. In the project, we propose to bridge these gap and improve the performance of cross-modal distillation by considering the heterogeneous audio-visual correspondence in unconstrained videos.