Efficient analysis of overdispersed data using an accurate computation of the Dirichlet multinomial distribution

Migliardi, Mauro; Languasco, Alessandro; Baddar, Sherenaz Al-Haj

doi:10.1109/TPAMI.2024.3489645

Modeling count data using suitable statistical distributions has been instrumental for analyzing the patterns it conveys. However, failing to address critical aspects, like overdispersion, jeopardizes the effectiveness of such an analysis. In this paper, overdispersed count data is modeled using the Dirichlet Multinomial (DM) distribution by maximizing its likelihood using a fixed-point iteration algorithm. This is achieved by estimating the DM distribution parameters while comparing the recent Languasco-Migliardi (LM), and the Yu-Shaw (YS) procedures, which address the well-known computational difficulties of evaluating its log-likelihood. Experiments were conducted using multiple datasets from different domains spanning polls, images, and IoT network traffic. They all showed the superiority of the LM procedure as it succeeded at estimating the DM parameters at the designated level of accuracy in all experiments, while the YS procedure failed to produce sufficiently accurate results (or any results at all) in several experiments. Moreover, the LM procedure achieved a speedup that ranged from 2-fold to 20-fold over YS.