CLsquared: A Cleaning and Clustering Tool for Viral Genomic Data

Mazzotti, G.; Bado, M.; Lavezzo, E.; Toppo, S.

doi:10.2174/0115748936416627250905170048

Introduction During the COVID-19 pandemic, millions of viral genomic sequences were produced and deposited in public databanks. This unprecedented volume of data introduced inaccuracies and errors requiring effective management to ensure reliable scientific outcomes. Despite this, no bioinformatics tools have been developed specifically to comprehensively filter viral genomic datasets.Methods To address this need, we developed CLsquared, a tool suite implemented in Python3 and Bash for the selection of high-quality viral sequences. CLsquared flags sequences exhibiting unverified mutation patterns or metadata. It offers fully customizable filtering parameters and is adaptable to both public and private datasets. The tool supports multiprocessing, significantly reducing runtime on multi-core systems.Results CLsquared detects ambiguous, biologically implausible, and underrepresented mutation sets. Its modular architecture ensures efficient processing of large-scale datasets, optimizing both speed and memory usage.Discussion By systematically addressing sequencing and annotation errors, CLsquared fills a critical gap in current viral bioinformatics workflows. Its flexible and scalable design supports diverse research applications, improving data quality and reproducibility.Conclusion CLsquared is a robust resource for researchers working with large volumes of viral sequence data. It is freely available on GitHub (https://github.com/giorgia-m-95/CLsquared-multiprocessing and https://github.com/giorgia-m-95/CLsquared-base) and Docker Hub (giorgiam95/clsquared_parallel and giorgiam95/clsquared_base).