Our group will propose a novel search engine for the Longitudinal Evaluation of Model Performance (LongEval) task at CLEF 2023 [1]; it will also be the final work of the subject Search Engines at the University of Padova. Our system focuses on the short-term and long-term temporal persistence of the systems’ performance, for a collection of both English and French documents. Our approach involves considering both English and French versions of the documents using whitespace tokenization, stopword removal and stemming. We generate character N-grams to identify recurring word structures (as prefixes or suffixes) repeated over documents. We also use query expansion with synonyms (in English) and some Natural Language Processing (NLP) techniques as Named Entity Recognition (NER) to further refine our system. The similarity function utilized in our approach is BM25. Our system was developed in Java and primarily utilized the Lucene library. After extensive experiments on these techniques, we came up with five systems that have produced the best results in terms of MAP and NDCG scores. We analyzed these five selected systems by examining their MAP, NDCG, and Rprec scores on the test data. Additionally, we performed a Two-Way ANOVA to assess the AP of these systems. To compare our systems with each other, we will utilize the Tukey Honestly Significant Difference (HSD) test. In summary, our analysis indicates that incorporating French queries enhances search results, larger N-gram sizes contribute to improved effectiveness, while our NER approach negatively affects the scores.
SEUPD@CLEF: Team JIHUMING on Enhancing Search Engine Performance with Character N-Grams, Query Expansion, and Named Entity Recognition
Ferro N.
2023
Abstract
Our group will propose a novel search engine for the Longitudinal Evaluation of Model Performance (LongEval) task at CLEF 2023 [1]; it will also be the final work of the subject Search Engines at the University of Padova. Our system focuses on the short-term and long-term temporal persistence of the systems’ performance, for a collection of both English and French documents. Our approach involves considering both English and French versions of the documents using whitespace tokenization, stopword removal and stemming. We generate character N-grams to identify recurring word structures (as prefixes or suffixes) repeated over documents. We also use query expansion with synonyms (in English) and some Natural Language Processing (NLP) techniques as Named Entity Recognition (NER) to further refine our system. The similarity function utilized in our approach is BM25. Our system was developed in Java and primarily utilized the Lucene library. After extensive experiments on these techniques, we came up with five systems that have produced the best results in terms of MAP and NDCG scores. We analyzed these five selected systems by examining their MAP, NDCG, and Rprec scores on the test data. Additionally, we performed a Two-Way ANOVA to assess the AP of these systems. To compare our systems with each other, we will utilize the Tukey Honestly Significant Difference (HSD) test. In summary, our analysis indicates that incorporating French queries enhances search results, larger N-gram sizes contribute to improved effectiveness, while our NER approach negatively affects the scores.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.