Information Retrieval (IR) evaluation deeply relies on human-made relevance judgments. To overcome the high costs of the judgment collection process, a potential solution is to utilize LLMs as judges to replace human annotators. However, the validation of LLM-generated judgments is fundamental for informed use. Standard validation approaches typically rely on simple sampling techniques to collect a sample of the LLM-generated judgments and estimate the LLM agreement with the human. In this work, we propose using stratified sampling, a more sophisticated sampling strategy that, by leveraging appropriate stratification features, reduces human involvement in the validation process while still providing statistical guarantees on the human-LLM agreement estimate. Through the analysis of various candidate features, we identify the LLM-generated judgments themselves as the most promising one. Our approach achieves up to an 85% reduction in the required human involvement in the validation process.
Reducing Human Effort to Validate LLM Relevance Judgements via Stratified Sampling
Simone Merlo
;Stefano Marchesin
;Guglielmo Faggioli
;Nicola Ferro
2026
Abstract
Information Retrieval (IR) evaluation deeply relies on human-made relevance judgments. To overcome the high costs of the judgment collection process, a potential solution is to utilize LLMs as judges to replace human annotators. However, the validation of LLM-generated judgments is fundamental for informed use. Standard validation approaches typically rely on simple sampling techniques to collect a sample of the LLM-generated judgments and estimate the LLM agreement with the human. In this work, we propose using stratified sampling, a more sophisticated sampling strategy that, by leveraging appropriate stratification features, reduces human involvement in the validation process while still providing statistical guarantees on the human-LLM agreement estimate. Through the analysis of various candidate features, we identify the LLM-generated judgments themselves as the most promising one. Our approach achieves up to an 85% reduction in the required human involvement in the validation process.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




