Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads) the performances are often poor. One of the reasons is the fact that the reads in a sample can be very different from the corresponding reference genomes, e.g. viral genomes are usually highly mutated. To address this issue, in this paper we propose ClassGraph a new taxonomic classification method that makes use of the reads overlap graph and applies a label propagation algorithm to refine the result of existing tools. We evaluated the performance on simulated and real datasets against several taxonomic classification tools and the results showed an improved sensitivity and F-measure, while preserving high precision. ClassGraph is able to improve the classification accuracy especially on difficult cases like Virus and real datasets, where traditional tools are not able to classify many reads. Availability: https://github.com/CominLab/ClassGraph
Boosting Metagenomic Classification with Reads Overlap Graphs
Cavattoni M.;Comin M.
2021
Abstract
Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads) the performances are often poor. One of the reasons is the fact that the reads in a sample can be very different from the corresponding reference genomes, e.g. viral genomes are usually highly mutated. To address this issue, in this paper we propose ClassGraph a new taxonomic classification method that makes use of the reads overlap graph and applies a label propagation algorithm to refine the result of existing tools. We evaluated the performance on simulated and real datasets against several taxonomic classification tools and the results showed an improved sensitivity and F-measure, while preserving high precision. ClassGraph is able to improve the classification accuracy especially on difficult cases like Virus and real datasets, where traditional tools are not able to classify many reads. Availability: https://github.com/CominLab/ClassGraphPubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.