|
[full paper] |
Katerina Pastra, Yorick Wilks
Multimodal human to human interaction requires integration of the contents/meaning of the modalities involved. Artificial Intelligence (AI) multimodal prototypes attempt to go beyond technical integration of modalities to this kind of meaning integration that allows for coherent, natural, “intelligent” communication with humans. In this paper, we address the issue of vision-language content integration and attempt to shed some light on how, why and to what extent this type of integration takes place within AI. We present a taxonomy of vision-language integration prototypes which resulted from an extensive survey of the field and which uses integration purposes as the guiding criterion for classification of systems that span many decades and many different AI multimedia-related research areas. We look at the integration resources and mechanisms used in such prototypes and correlate them with theories of integration that emerge indirectly from computational models of the mind. We argue that state of the art vision-language prototypes fail to address core integration challenges automatically, because of human intervention in stages during the integration procedure that are tightly coupled with inherent characteristics of the integrated media. Last, we present VLEMA, a prototype that attempts to perform vision-language integration with minimal human intervention in these core integration stages.
Keywords: Intelligent User interfaces, Intelligent Multimedia Systems, Vision-Language Meaning Integration
Citation: Katerina Pastra, Yorick Wilks: Vision-Language Integration in AI: a reality check. In R.López de Mántaras and L.Saitta (eds.): ECAI2004, Proceedings of the 16th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2004, pp.937-941.