Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads) the performances are often poor. One of the reasons is the fact that the reads in a sample can be very different from the corresponding reference genomes, e.g. viral genomes are usually highly mutated. To address this issue, in this paper we propose ClassGraph a new taxonomic classification method that makes use of the reads overlap graph and applies a label propagation algorithm to refine the result of existing tools. We evaluated the performance on simulated and real datasets against several taxonomic classification tools and the results showed an improved sensitivity and F-measure, while preserving high precision. ClassGraph is able to improve the classification accuracy especially on difficult cases like Virus and real datasets, where traditional tools are not able to classify many reads. Availability: https://github.com/CominLab/ClassGraph

Boosting Metagenomic Classification with Reads Overlap Graphs

Cavattoni M.;Comin M.
2021

Abstract

Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads) the performances are often poor. One of the reasons is the fact that the reads in a sample can be very different from the corresponding reference genomes, e.g. viral genomes are usually highly mutated. To address this issue, in this paper we propose ClassGraph a new taxonomic classification method that makes use of the reads overlap graph and applies a label propagation algorithm to refine the result of existing tools. We evaluated the performance on simulated and real datasets against several taxonomic classification tools and the results showed an improved sensitivity and F-measure, while preserving high precision. ClassGraph is able to improve the classification accuracy especially on difficult cases like Virus and real datasets, where traditional tools are not able to classify many reads. Availability: https://github.com/CominLab/ClassGraph
2021
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
978-3-030-91414-1
978-3-030-91415-8
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3411066
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact