Many bioinformatics tools heavily rely on k-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive k-mer dictionaries are very memory inefficient, requiring very large amount of storage space to save each k-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work we discuss how to build an indexed linear reference containing a set of input k-mers, and its application to the compression of quality score in FASTQ files. Most of the entropy of sequencing data lies in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNPs calling. We show how a dictionary of significant k-mers, obtained from SNPs databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: the software is freely available at https://github.com/yhhshb/yalff.

Indexing k-mers in linear-space for quality value compression

SHIBUYA, YOSHIHIRO;Comin M.
2019

Abstract

Many bioinformatics tools heavily rely on k-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive k-mer dictionaries are very memory inefficient, requiring very large amount of storage space to save each k-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work we discuss how to build an indexed linear reference containing a set of input k-mers, and its application to the compression of quality score in FASTQ files. Most of the entropy of sequencing data lies in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNPs calling. We show how a dictionary of significant k-mers, obtained from SNPs databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: the software is freely available at https://github.com/yhhshb/yalff.
2019
BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms, Proceedings; Part of 12th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2019
10th International Conference on Bioinformatics Models, Methods and Algorithms, BIOINFORMATICS 2019 - Part of 12th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2019
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3307237
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 3
  • OpenAlex ND
social impact