Provenance-driven nanopublications: representing source lineage and trust networks for multi-source assertions

Menotti, L.; Marchesin, S.; Giachelle, F.; Silvello, G.

doi:10.1007/s00799-025-00431-x

Nanopublishing is a paradigm enabling the representation of scientific claims in a distinctive, identifiable, citable, and reusable format, i.e., as a named graph. This approach can be applied to sentences extracted from scientific publications or triples within a Knowledge Base (KB). This way, one can track the provenance of assertions derived from a specific publication or database. However, nanopublications do not natively support multi-source scientific claims generated by aggregating different bodies of knowledge. This work extends the nanopublication model with knowledge provenance, capturing provenance information for assertions derived by an aggregation algorithm or a truth discovery process , e.g., an information extraction system aggregating several sources of knowledge to populate a Knowledge Base (KB). In these cases, provenance information cannot be attributed to a single source, but it is the result of an ensemble of evidence, that can comprehend supporting and conflicting pieces of evidence and truth values. Knowledge provenance is represented as a named graph following the PROV-K ontology, developed for the case. To show how knowledge provenance applies to a real-world scenario, we serialized gene expression-cancer associations generated by the Collaborative Oriented Relation Extraction (CORE) System. To demonstrate the value of trust relationships, we present a use case leveraging an existing scientific KB to construct a trust network employing three Large Language Model (LLM) agents. We analyzed the ability of LLMs to evaluate trustworthiness, exploiting techniques from KB accuracy estimation. We published 197, 511 assertions generated by the CORE system in the form of extended nanopublications with knowledge provenance. PROV-K also defines trust relationships between agents or between an agent and a proposition. Starting from these assertions, we leveraged external agents - namely, multiple LLMs - to assess their trusted truth value. Based on these values, we defined trust relationships between the agents and the facts, yielding an exemplar trust network comprising over 45,000 facts and four agents. The knowledge provenance graph allows the tracking of provenance for each piece of evidence contributing to the support or refutation of an assertion. To capture the semantics of the newly presented graph, we define the PROV-K ontology, designed to represent provenance information for multi-source assertions. The two use cases serve as a template to show how to serialize extended nanopublications and showcase the trust relationships' capabilities.