Hubs are a few points that frequently appear in the k-nearest neighbors (kNN) of many other points in a high-dimensional data set. The hubs' effects, called the hubness phenomenon, degrade the performance of kNN based models in high dimensions. We present SamHub, a simple sampling approach to efficiently identify hubs with theoretical guarantees. Apart from previous works based on approximate kNN indexes, SamHub is generic and applicable to any distance measure with negligible additional memory footprint. Empirically, by sampling only 10% of points, SamHub runs significantly faster and offers higher accuracy than existing hub detection methods on many real-world data sets with dot product, L1, L2, and dynamic time warping distances. Our ablation studies of SamHub on improving kNN-based classification show potential for other high-dimensional data analysis tasks.

On Finding Hubs in High Dimensions with Sampling

Silvestri F.;
2025

Abstract

Hubs are a few points that frequently appear in the k-nearest neighbors (kNN) of many other points in a high-dimensional data set. The hubs' effects, called the hubness phenomenon, degrade the performance of kNN based models in high dimensions. We present SamHub, a simple sampling approach to efficiently identify hubs with theoretical guarantees. Apart from previous works based on approximate kNN indexes, SamHub is generic and applicable to any distance measure with negligible additional memory footprint. Empirically, by sampling only 10% of points, SamHub runs significantly faster and offers higher accuracy than existing hub detection methods on many real-world data sets with dot product, L1, L2, and dynamic time warping distances. Our ablation studies of SamHub on improving kNN-based classification show potential for other high-dimensional data analysis tasks.
2025
Proc. 39th Annual AAAI Conference on Artificial Intelligence
Annual AAAI Conference on Artificial Intelligence
   Big-Mobility
   Uni-Impresa

   AHeAD
   MUR PRIN
   20174LF3T8

   Marsden Fund
   MFP-UOA2226
File in questo prodotto:
File Dimensione Formato  
33261-Article Text-37329-1-2-20250410 (3).pdf

accesso aperto

Tipologia: Published (Publisher's Version of Record)
Licenza: Accesso libero
Dimensione 540.35 kB
Formato Adobe PDF
540.35 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3553501
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
  • OpenAlex 1
social impact