A fundamental operation in computational genomics is the reduction of input sequences into their constituent k-mers. Designing space-efficient ways to represent a k-mer collection is essential to improve the scalability of bioinformatics analyses. A widely used approach involves converting the k-mer set into a de Bruijn graph and then producing a compact plain text representation by identifying the minimum path cover. In this article, we present USTAR-CR, a novel algorithm for compressing multiple k-mer sets. USTAR-CR leverages node connectivity principles in the colored de Bruijn graph for a more compact plain text representation, combined with an efficient encoding of k-mers colors. We tested USTAR-CR on real read datasets and compared it with the state-of-the-art GGCAT. USTAR-CR demonstrated superior performance in terms of compression, requiring less memory and being significantly faster (up to 51x) https://github.com/enricorox/USTAR-CR.

Fast and Succinct Compression of k-mer Sets with Plain Text Representation of Colored de Bruijn Graphs

Rossignolo, Enrico;Comin, Matteo
2026

Abstract

A fundamental operation in computational genomics is the reduction of input sequences into their constituent k-mers. Designing space-efficient ways to represent a k-mer collection is essential to improve the scalability of bioinformatics analyses. A widely used approach involves converting the k-mer set into a de Bruijn graph and then producing a compact plain text representation by identifying the minimum path cover. In this article, we present USTAR-CR, a novel algorithm for compressing multiple k-mer sets. USTAR-CR leverages node connectivity principles in the colored de Bruijn graph for a more compact plain text representation, combined with an efficient encoding of k-mers colors. We tested USTAR-CR on real read datasets and compared it with the state-of-the-art GGCAT. USTAR-CR demonstrated superior performance in terms of compression, requiring less memory and being significantly faster (up to 51x) https://github.com/enricorox/USTAR-CR.
2026
Lecture Notes in Computer Science
13th International Conference on Computational Advances in Bio and Medical Sciences, ICCABS 2025
9783032024886
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3582099
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 0
social impact