This project is oriented to make machine perceive and understand the world around with multi-sensory knowledge. We propose to resolve this problem both in applications and theory aspects. For the applications, we focus on the audio-visual scenarios, imporve the performance with multi-sensory knowledge (e.g., Sounding Object Localization, Audio-Visual Speaker Diarization, etc.) and enhance uni-modal performance with the extra multi-modal knowledge (e.g., Cross-Modal Transfer Learning). On the other hand, we explore the learning mechanism and balance the performance in multi-modal scenarios.