Development of operational models

Structured Documents

One main difficulty for IR on structured documents is, due to computational cost, the impossibility to handle complex relationships between documents parts on large collections. The MRIM research dedicated to structured documents retrieval adapts and extends current state of the art theoretical works on language models of atomic documents for information retrieval. More precisely, we formalize propagation of terms occurrences between XML documents parts in a way to contextualize these parts according to the whole document and their structural neighbours. This leads to modification of document parts index, as a smoothing, yet keeping such extensions tractable for large corpus of documents. We experimented our proposal during the Inex 2009 campaign, and we obtained the best results on 3 evaluation measures on the 6 official measures. This work led to one international publication and 3 national publications (2 conferences and one journal).

Multimedia: Semantic Indexing of Video Documents

Searching in image, video and audio collections has proper specificities, among which the “semantic gap” problem is the most challenging. The semantic gap refers to the “distance” between the signal samples (audio samples or pixels) of which raw multimedia documents are made of and the concepts and/or relations that make sense to human beings. Concept indexing or document categorization is very important for multimedia content-based search. The most common approach is based on supervised learning from labeled examples. Several challenges need to be addressed for efficient and practical multimedia content-based indexing and retrieval:

the level of concept classification performance, still quite low on “wild” conditions (in the 0.2-0.3 range in a scale from 0 to 1);
the generalization capabilities: the performance of the classifiers significantly degrades when they are used in domains different from those on which they were trained;
scalability: classification methods need to be still operational when applied large numbers of documents, target concepts and content types.

We addressed these challenges using a generalized and sophisticated classification pipeline, working on all important stages: descriptor extraction and aggregation, descriptor optimization, classification for highly imbalanced datasets, fusion of classifiers, and re-ranking using the temporal and conceptual contexts. We also worked on the process of efficiently producing annotations using active learning and active cleaning approaches. We experimented these approaches in the context of the TRECVid1 and MediaEval2 international evaluation campaigns. In the 2013 issues, we got the second place within 26 participating groups at the TRECVid semantic indexing task and the first (resp. second) place within 5 (resp. 9) participants at the MediaEval subjective (resp. objective) violence detection task. This work led to five publications (of which two are to appear) in international journals, one as a book chapter, and several in national and international conferences.

Multimedia: Person Identification in Video Documents

People are among the main elements of interest for users searching in video collections. Therefore, indexing who is appearing and who is speaking, or who is mentioned either in the speech or in the image track, is a major goal for content-based video indexing. Though the problem is similar to general concept indexing, quite specific techniques can be used to obtain the maximal performance in this very important practical case. In this domain, we focused on:

written name extraction by developing and improving overlaid text recognition techniques;
unsupervised naming of persons using written names, pronounced names or both;
multimodal fusion for person identification in video documents.

We experimented these approaches in the context of the REPERE3 national evaluation campaign where we were often ranked first within the three participants. This work led to one publication in a national journal, and several in national and international conferences. A tool for overlaid text extraction4 has been made publicly available.

Multimedia and contextual mobile information access: Indexing and mobile device compatible Retrieval of Still Images

The MRIM group is also involved in still images indexing and retrieval. We believe that spatial organization of images are important to be taken in account during indexing and retrieval, so we developed several approaches to include such spatial elements to represent images for information retrieval: integration of spatial locality in image annotation processes, integration of spatial relationships in vector space and extensions of language models in a way to integrate graph-based representations of images. The IOTA system has been developed on these works: it can index images on a server, and the retrieval can be performed on both servers and autonomously on mobile devices. Since 2012, part of these works are licensed to the Globe-VIP EyeSnap5 company, dedicated to indexing and retrieval of still images on servers and on mobile devices. At the academic level, this work led to two publications in international journals, one in a national journal, and several in national and international conferences. The applications developed with Globe-VIP led to prototypes tested in a museum.

Semantic and multilingual Access to Textual Information

Semantic indexing consists in using explicit concepts as index instead of keywords, as in traditional IR. Transforming a natural language text into a sequence of concepts leads to specific problems due to the mapping processes from text to concepts. The weighting scheme for conceptual indexing, published in DEXA 2012, proposes a new counting function that is adapted to concept weighing, when one single portion of text can be mapped on several overlapping concepts. This work uses Medical knowledge from UMLS meta-thesaurus and is applied on medical test collections from the CLEF initiative. Finally, we have worked on the introduction of knowledge extracted from Wikipedia source, into a language model matching function. This conceptual indexing research is strongly linked with the logical modeling of the matching function. Multilingual Access to Textual Information needs the production of multilingual resources. These resources are efficiently and automatically produced using parallel or comparable multilingual corpora. A reliable comparability measure of comparable corpora (Bo Li’s PhD Thesis) is an important result in the automatic building of large multilingual terminological resources using existing Web comparable multilingual corpora.