Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany
Multimodal Representation Learning(On going)
Multimodal representation learning has gained increasing importance in various real-world multimedia applications. Inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization.
Action Recognition with Deep Learning(On going)
Multimodal Video Represenation Learning for Action Recognition
Video contains rich information such as appearance, motion and audio to help us understand its content. Recent works have shown the combination of appearance(spatial) and motion(temporal) clues can significant improve human action recognition performance in videos. In order to further explore the multimodal representation of video in action recognition, this work proposes a framework for learning multimodal representations of video appearance, motion as well as audio data. Our proposed fusion approach achieves 85.1% accuracy in fusing spatial-temporal on UCF101(split 1), which is very competitive to state-of-the-art works.
Deep Siamese Network for Action Recognition
This project aims to present a novel approach for video feature embedding via deep Siamese Neural Network (SNN). Different from existing feature descriptor-based methods, we propose a metric learning-based approach to train deep SNN that builds on two-stream Convolutional Neural Network(CNN) by using generated similar and dissimilar video pairs. SNN features are learned by minimizing the distance between similar videos and maximizing the distance between dissimilar videos. Our experimental results show that training SNN is beneficial in discriminative task like human action recognition. Our approach achieves very competitive performance on open benchmark UCF101 compare to state-of-the-art work
- C. Wang, H. Yang and C. Meinel, “Exploring Multimodal Video Representation for Action Recognition”, The annual International Joint Conference on Neural Networks (IJCNN 2016) (to appear)
- C. Wang, H. Yang and C. Meinel, “A Deep Semantic Framework for Multimodal Representation Learning”, International Journal of MULTIMEDIA TOOLS AND APPLICATIONS (MTAP, IF:1.346), DOI: 10.1007/s11042-016-3380-8, online ISSN:1573-7721, Print ISSN:1380-7501, Special Issue: “Representation Learning for Multimedia Data Understanding”, March 2016.Link
- C. Wang, H. Yang and C. Meinel, “Deep Semantic Mapping for Cross-Modal Retrieval”, the 27th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2015),pp. 234-241, Vietri sul Mare, Italy, Novermber 9-11, 2015. Link
- C. Wang, H. Yang and C. Meinel, “Visual-Textual Late Semantic Fusion Using Deep Neural Network for Document Categorization”, the 22nd International Conference on Neural Information Processing (ICONIP2015), pp. 662-670, Istanbul, Turkey, Novermber 9-12, 2015. Link
- C. Wang, H. Yang and C. Meinel, “Does Multilevel Semantic Representation Improve Text Categorization?”, the 26th International Conference on Database and Expert Systems Applications (DEXA 2015), LNCS, Volume 9261, pp 319-333 Valencia, Spain, September 1-4, 2015. Link
- H. Yang, C. Wang, X. Che and C. Meinel. “An Improved System For Real-Time Scene Text Recognition”, ACM International Conference on Multimedia Retrieval (ICMR 2015), Shanghai, June 23-26, 2015. Link
- C. Wang, H. Yang, X. Che and C. Meinel, “Concept-Based Multimodal Learning for Topic Generation”, the 21st MultiMedia Modelling Conference (MMM2015), LNCS, Volume 8935, pp 385-395, Sydney, Australia, Jan 5-7, 2015. Link