Frequent subgraph mining is a fundamental task in the analysis of collections of graphs. While several exact approaches have been proposed, it remains computationally challenging on large graph datasets due to its inherent link to the subgraph isomorphism problem and the huge number of candidate patterns even for fairly small subgraphs.In this work, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solution of the frequent subgraph mining problem. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold from an unknown generative process using samples from such process. Our extensive experimental evaluation on real datasets shows that our bounds lead to efficiently computable, high-quality approximations for both applications.

VC-dimension and Rademacher Averages of Subgraphs, with Applications to Graph Mining

Vandin F.
2023

Abstract

Frequent subgraph mining is a fundamental task in the analysis of collections of graphs. While several exact approaches have been proposed, it remains computationally challenging on large graph datasets due to its inherent link to the subgraph isomorphism problem and the huge number of candidate patterns even for fairly small subgraphs.In this work, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solution of the frequent subgraph mining problem. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold from an unknown generative process using samples from such process. Our extensive experimental evaluation on real datasets shows that our bounds lead to efficiently computable, high-quality approximations for both applications.
2023
Proceedings of the International Conference on Data Engineering
979-8-3503-2227-9
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3496177
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact