Modeling of IR systems, Semantic indexing, Multimedia Indexing

Modeling of IR systems

Modelling an Information Retrieval Systems (IRS) consists in building a formal description of an IRS module (Analysis, Indexing, Matching, or Ranking). We worked on modelling matching and ranking IRS activities by using logical models. The use of logic for this modelling, rises from the following hypothesis: a document is an answer to a query, if there exists a logical deduction chain that starts from the document and ends to the query. This deduction chain can be a fuzzy one, i.e. a probability to deduce the query from the documents and used for ranking. We have proposed a new IR logic matching model using logical Boolean lattice mixed with a probabilistic function over this lattice. This modelling enables matching functions to be decomposed into a direct matching function (deduction from the document to the query), and a reverse matching function, that evaluate the strength of the deduction from the query to the document. This was the PhD work of Karam Abdulahhad and also part of it is included in a survey published at C-CAM surveys.

The term mismatch problem, which happens when query terms fail to appear in relevant documents to the query, is a long standing problem in information retrieval. However, it is not clear how often term mismatch happens in retrieval, how important it is for retrieval, or how it affects retrieval performance.
An essential component for achieving term mismatch probability reduction is the knowledge resource that defines terms and their relationships. A variety of knowledge resources have been exploited, in our proposals, in order to produce effective modifications on documents or queries. More particularly, we proposed a query expansion approach based on neural language models. Neural language models are proposed to learn term vector representations, called distributed neural embeddings. And we obtained impressive results comparing with state of the art approaches in term similarity tasks. This was the PhD work of Mohannad Almasri and was also published in ECIR.

The term mismatch problem, which happens when query terms fail to appear in relevant documents to the query, is a long standing problem in information retrieval. However, it is not clear how often term mismatch happens in retrieval, how important it is for retrieval, or how it affects retrieval performance.
An essential component for achieving term mismatch probability reduction is the knowledge resource that defines terms and their relationships. A variety of knowledge resources have been exploited, in our proposals, in order to produce effective modifications on documents or queries. More particularly, we proposed a query expansion approach based on neural language models. Neural language models are proposed to learn term vector representations, called distributed neural embeddings. And we obtained impressive results comparing with state of the art approaches in term similarity tasks. This was the PhD work of Mohannad Almasri and was also published in ECIR.

Semantic Indexing

Semantic Information Retrieval deals with the usage of knowledge sets to annotate documents and solve queries. The idea behind this research track lies in the following hypothesis: using only computed word distribution statistics, may not be sufficient to find the semantic links that join query to relevant documents in some IR situations. These situations are, for example, when high search precision is required, like in specialized domains (ex: medical), or when document textual descriptors are too short for efficient term statistic evaluations. In these specific situations, the usage of such explicit knowledge resources can make the difference. Semantic indexing is then the usage of a specific knowledge resources (Ex: UMLS, Wordnet, etc.) to automatically annotate textual document with entries to theses resources. The matching can then exploit the links in theses semantic resources to enhance the matching quality. For example, a painting only described by the text “the last supper”, can match a query about “painting and Da Vinci” only if the system has this link encoded in some resources.

Term based, or semantic based, resources must be large enough and consistent to be correctly exploited by IR systems. Unfortunately, these resource are rare, because very costly to build manually. Also, because most of existing resources are oriented toward human usage, they are neither complete nor consistent. For example, this is the case for the UMLS Meta-thesaurus: despite its very large term and multi-term coverage of the medical domain, in more that 18 languages, and the strong human effort to maintain this resource of about 1 million of “concepts”, and the twice a year constant updates, this medical domain specific resource is not consistent. For example, the simple ‘is-a’ hierarchical relationship have some looping path. The main problem in such a resource is a lack of explicit mathematical logical semantics. Hence, the important first step to exploit these very large and valuable resources, like UMLS, is first to `clean’ them and force their structure to respect some logical language. They can then be partially transformed into ontology expressed on explicit logical language. This work has been done in the Demeke Ayele PhD thesis, in which he has proposed a Knowledge Acquisition Framework from Unstructured Bio-medical Knowledge Sources. The work is continuing with the PhD of Jibril Frej, that started in September 2017.
This research track is oriented toward the usage of Machine Leaning for transforming discrete knowledge set into a continuous multidimensional space (embedding).
A proposal to extend the IR Language Model using learned term embedding has been proposed in CORIA 2018.

Multimedia Indexing

Several works were conducted in the context of multimedia indexing.

In the context of the QCompere project, we developed methods for the multimodal recognition of persons in video documents. We worked on the design of multimodal descriptors for violent scenes detection in movies.
We worked on the optimization of visual descriptors by simultaneously significantly reducing their size and increasing their performance in classification tasks, on methods for improving the efficiency of multi-label classification for large scale multimedia, and on methods for the fusion of a large number of multimodal descriptors.

We worked on the detection of several concepts simultaneously in images or in video shots. We found that just combining detection scores using an appropriate function performs at least as well as directly training detectors for frequent combinations of two or three target concepts.
We investigated the use of conceptual and temporal relations for concept detection in video shots. We exploited both the implicit (co-occurrences) and explicit (generic-specific, exclusion) relations between concepts. For the temporal aspect, we exploit the fact that, in videos when a concept appears (or not) in one shot, it is more likely (or less likely) to appear in the few previous or next shots. We obtained significant performance improvement for both types of relations.
We also worked on active learning for minimizing the annotation effort per person recognition in video documents.

In the previously described work, we used deep learning as this became the most efficient approach for most tasks but we also did specific studies on its use in multimedia indexing. We compared engineered (classical) features versus learned ones. Using many classical ones can be on par with learned ones
and that fusing both can perform even better. However, this approach is costly 
and unlikely to remain advantageous in the future, given the rapid progresses of deep learning methods.
We also investigated the use of concept hierarchies for improving their recognition.
In order to investigate whether the obtained gain was really due to the use of relations between concepts or to an ensemble learning, effect, we indirectly discovered that doing ensemble learning with a joint training could significantly improve the detection accuracy.

Three PhDs (Nadia Derbas, Abdelkader Hamadi, and Mateusz Budnik) related to multimedia indexing were defended over the period and one (Anuvabh Dutt) will be defended on the December 17th 2019.


Contact
MRIM
Laboratoire d'Informatique de Grenoble
Bâtiment IMAG
700 avenue Centrale
CS 40700, 38058 Grenoble Cedex 9 - France
Phone: +33 4 57 42 15 48
Group Leader: Georges QUÉNOT
News

Anuvabh Dutt is defending his PhD thesis on December 17th 2019.

Le 2 décembre 2019, un article publié sur le blog BInaire sur le testing algorithmique.

Kodicare (continuous evaluation of web search engines) ANR International Research Project acccepted for funding. Coopertion with RSA Vienna and Qwant. 3 years, from 11/2019.

The paper "Quelques pas vers l'Honnêteté et l'Explicabilité de moteurs de recherche sur le Web." Philippe Mulhem, Lydie du Bousquet, Sara Lakah, was awarded Best paper of the CORIA 2019 conference,

We organized the 40th European Conference in Information Retrieval (ECIR) in march 2018