Our group will propose a novel search engine for the Longitudinal Evaluation of Model Performance (LongEval) task at CLEF 2023 [1]; it will also be the final work of the subject Search Engines at the University of Padova. Our system focuses on the short-term and long-term temporal persistence of the systems’ performance, for a collection of both English and French documents. Our approach involves considering both English and French versions of the documents using whitespace tokenization, stopword removal and stemming. We generate character N-grams to identify recurring word structures (as prefixes or suffixes) repeated over documents. We also use query expansion with synonyms (in English) and some Natural Language Processing (NLP) techniques as Named Entity Recognition (NER) to further refine our system. The similarity function utilized in our approach is BM25. Our system was developed in Java and primarily utilized the Lucene library. After extensive experiments on these techniques, we came up with five systems that have produced the best results in terms of MAP and NDCG scores. We analyzed these five selected systems by examining their MAP, NDCG, and Rprec scores on the test data. Additionally, we performed a Two-Way ANOVA to assess the AP of these systems. To compare our systems with each other, we will utilize the Tukey Honestly Significant Difference (HSD) test. In summary, our analysis indicates that incorporating French queries enhances search results, larger N-gram sizes contribute to improved effectiveness, while our NER approach negatively affects the scores.

SEUPD@CLEF: Team JIHUMING on Enhancing Search Engine Performance with Character N-Grams, Query Expansion, and Named Entity Recognition

Ferro N.
2023

Abstract

Our group will propose a novel search engine for the Longitudinal Evaluation of Model Performance (LongEval) task at CLEF 2023 [1]; it will also be the final work of the subject Search Engines at the University of Padova. Our system focuses on the short-term and long-term temporal persistence of the systems’ performance, for a collection of both English and French documents. Our approach involves considering both English and French versions of the documents using whitespace tokenization, stopword removal and stemming. We generate character N-grams to identify recurring word structures (as prefixes or suffixes) repeated over documents. We also use query expansion with synonyms (in English) and some Natural Language Processing (NLP) techniques as Named Entity Recognition (NER) to further refine our system. The similarity function utilized in our approach is BM25. Our system was developed in Java and primarily utilized the Lucene library. After extensive experiments on these techniques, we came up with five systems that have produced the best results in terms of MAP and NDCG scores. We analyzed these five selected systems by examining their MAP, NDCG, and Rprec scores on the test data. Additionally, we performed a Two-Way ANOVA to assess the AP of these systems. To compare our systems with each other, we will utilize the Tukey Honestly Significant Difference (HSD) test. In summary, our analysis indicates that incorporating French queries enhances search results, larger N-gram sizes contribute to improved effectiveness, while our NER approach negatively affects the scores.
2023
CEUR Workshop Proceedings
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3506616
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact