Introduction During the COVID-19 pandemic, millions of viral genomic sequences were produced and deposited in public databanks. This unprecedented volume of data introduced inaccuracies and errors requiring effective management to ensure reliable scientific outcomes. Despite this, no bioinformatics tools have been developed specifically to comprehensively filter viral genomic datasets.Methods To address this need, we developed CLsquared, a tool suite implemented in Python3 and Bash for the selection of high-quality viral sequences. CLsquared flags sequences exhibiting unverified mutation patterns or metadata. It offers fully customizable filtering parameters and is adaptable to both public and private datasets. The tool supports multiprocessing, significantly reducing runtime on multi-core systems.Results CLsquared detects ambiguous, biologically implausible, and underrepresented mutation sets. Its modular architecture ensures efficient processing of large-scale datasets, optimizing both speed and memory usage.Discussion By systematically addressing sequencing and annotation errors, CLsquared fills a critical gap in current viral bioinformatics workflows. Its flexible and scalable design supports diverse research applications, improving data quality and reproducibility.Conclusion CLsquared is a robust resource for researchers working with large volumes of viral sequence data. It is freely available on GitHub (https://github.com/giorgia-m-95/CLsquared-multiprocessing and https://github.com/giorgia-m-95/CLsquared-base) and Docker Hub (giorgiam95/clsquared_parallel and giorgiam95/clsquared_base).

CLsquared: A Cleaning and Clustering Tool for Viral Genomic Data

Lavezzo E.
Methodology
;
Toppo S.
Supervision
2025

Abstract

Introduction During the COVID-19 pandemic, millions of viral genomic sequences were produced and deposited in public databanks. This unprecedented volume of data introduced inaccuracies and errors requiring effective management to ensure reliable scientific outcomes. Despite this, no bioinformatics tools have been developed specifically to comprehensively filter viral genomic datasets.Methods To address this need, we developed CLsquared, a tool suite implemented in Python3 and Bash for the selection of high-quality viral sequences. CLsquared flags sequences exhibiting unverified mutation patterns or metadata. It offers fully customizable filtering parameters and is adaptable to both public and private datasets. The tool supports multiprocessing, significantly reducing runtime on multi-core systems.Results CLsquared detects ambiguous, biologically implausible, and underrepresented mutation sets. Its modular architecture ensures efficient processing of large-scale datasets, optimizing both speed and memory usage.Discussion By systematically addressing sequencing and annotation errors, CLsquared fills a critical gap in current viral bioinformatics workflows. Its flexible and scalable design supports diverse research applications, improving data quality and reproducibility.Conclusion CLsquared is a robust resource for researchers working with large volumes of viral sequence data. It is freely available on GitHub (https://github.com/giorgia-m-95/CLsquared-multiprocessing and https://github.com/giorgia-m-95/CLsquared-base) and Docker Hub (giorgiam95/clsquared_parallel and giorgiam95/clsquared_base).
2025
   Genomic epidemiology and phylogenesis of SARS-CoV-2 in Italy
   TOPP_PRIN22_01
   MIUR
   145000

   Next Generation EU, Mission 4, Component 2
   European Union

   Next Generation EU, Mission 1, Component 8
   European Union
File in questo prodotto:
File Dimensione Formato  
BMS-CBIO-2025-197.pdf

accesso aperto

Tipologia: Published (Publisher's Version of Record)
Licenza: Creative commons
Dimensione 2.21 MB
Formato Adobe PDF
2.21 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3573267
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 0
social impact