Title
Building semantic understanding beyond deep learning from sound and vision
Date Issued
01 January 2016
Access level
metadata only access
Resource Type
conference paper
Author(s)
De Souza F.
Sarkar S.
Universidade Federal de Ouro Preto
Publisher(s)
Institute of Electrical and Electronics Engineers Inc.
Abstract
Deep learning-based models have recently been widely successful at outperforming traditional approaches in several computer vision applications such as image classification, object recognition and action recognition. However, those models are not naturally designed to learn structural information that can be important to tasks such as human pose estimation and structured semantic interpretation of video events. In this paper, we demonstrate how to build structured semantic understanding of audio-video events by reasoning on multiple-label decisions of deep visual models and auditory models using Grenander's structures for imposing semantic consistency. The proposed structured model does not require joint training of the structural semantic dependencies and deep models. Instead they are independent components linked by Grenander's structures. Furthermore, we exploited Grenander's structures as a means to facilitate and enrich the model with fusion of multimodal sensory data; in particular, auditory features with visual features. Overall, we observed improvements in the quality of semantic interpretations using deep models and auditory features in combination with Grenander's structures, reflecting as numerical improvements of up to 11.5% and 12.3% in precision and recall, respectively.
Start page
2097
End page
2102
Volume
0
Language
English
OCDE Knowledge area
Ciencias de la computación Educación general (incluye capacitación, pedadogía) Lingüística
Scopus EID
2-s2.0-85019084617
ISBN
9781509048472
ISSN of the container
10514651
ISBN of the container
978-150904847-2
Conference
Institute of Electrical and Electronics Engineers Inc. - 23rd International Conference on Pattern Recognition, ICPR 2016
Sponsor(s)
This research was supported in part by NSF grants 1217676. The authors would like to thank the Brazilian National Research Council - CNPq (Grant # 234272/2014-7)
Sources of information: Directorio de Producción Científica Scopus