Exploring Manually Curated Annotations of Intrinsically Disordered Proteins with DisProt

DisProt is the major repository of manually curated data for intrinsically disordered proteins collected from the literature. Although lacking a stable tertiary structure under physiological conditions, intrinsically disordered proteins carry out a plethora of biological functions, some of them directly arising from their flexible nature. A growing number of scientific studies have been published during the last few decades in an effort to shed light on their unstructured state, their binding modes, and their functions. DisProt makes use of a team of expert biocurators to provide up‐to‐date annotations of intrinsically disordered proteins from the literature, making them available to the scientific community. Here we present a comprehensive description on how to use DisProt in different contexts and provide a detailed explanation of how to explore and interpret manually curated annotations of intrinsically disordered proteins. We describe how to search DisProt annotations, using both the web interface and the API for programmatic access. Finally, we explain how to visualize and interpret a DisProt entry, p53, a widely studied protein characterized by the presence of unstructured N‐terminal and C‐terminal regions. © 2020 Wiley Periodicals LLC.


INTRODUCTION
Intrinsically disordered proteins (IDPs) are characterized by the presence of unstructured and highly flexible segments, termed "intrinsically disordered regions" (IDRs), which lack a stable tertiary structure. IDRs can be easily detected by several biophysical and biochemical methods, among which X-ray and NMR are the most commonly used (Tompa, 2009;van der Lee et al., 2014). Missing electron density regions that cannot be detected on X-ray crystal structures are due to unobserved atoms that fail to properly scatter X-rays, denoting their structural flexibility (Mészáros, Tompa, Simon, & Dosztányi, 2007;Uversky & Dunker, 2010). NMR spectroscopy studies are also widely used to assess the presence of unstructured protein segments, being able to recognize disordered regions that in crystal structures are visible due to the formation of crystal contacts (Kobe et al., 2008;Mizutani et al., 2008). Several additional methods can assess the presence of intrinsic disorder in a protein, such as circular dichroism, sensitivity to proteolysis, and small-angle X-ray scattering (Uversky & Dunker, 2010). Moreover, several prediction methods have been developed during the years to characterize intrinsically disordered regions. These prediction tools are frequently based on the amino acid composition and sequence of a protein, or on environmental factors, such as pH and redox potential (Mészáros, Erdos, & Dosztányi, 2018).
Intrinsically disordered proteins can also exist as partially structured folding intermediates, pre-molten globules and molten globules, that exhibit a higher degree of secondary structure than random coils while being less compact than native structures (Ptitsyn, 1995;Ptitsyn, Bychkova, & Uversky, 1995;van der Lee et al., 2014). IDPs can play a crucial role in several biological processes, such as membrane localization and interaction with protein chaperones, to name a few (Uversky & Dunker, 2010). A main feature of IDPs is their peculiar mode of interaction. Interaction surfaces of IDPs are characterized by a unique set of chemo-physical properties, e.g., a higher percentage of hydrophobic residues compared to the rest of the IDR (Jones & Thornton, 1996;Mészáros et al., 2007). They exhibit a larger exposed interaction area per residue, even in their folded state induced by binding, that they use to contact their physiological partners (Mészáros et al., 2007). The lack of structure in IDR segments in their unbound state provides a multiplicity of advantages due to their largely extended conformation, such as (1) the possibility for a single IDR to be involved in interactions with more structurally different partners, (2) several structured partners being able to bind to a single region, (3) an increased speed of interaction due to their ability to explore the interaction space, and (4) a reduced binding strength that allows for transient interactions (Mészáros et al., 2007(Mészáros et al., , 2009Uversky & Dunker, 2010). IDRs can undergo a disorder-to-order transition upon binding of a partner, enabling them to play a central role as protein hubs (Mészáros et al., 2007(Mészáros et al., , 2011, as in the case of p53 (DisProt identifier: DP00086) and α-synuclein (DisProt identifier: DP00070). Finally, IDPs can also be involved in the regulation of several biological processes, interacting with different types of binding partners such as proteins, nucleic acids, lipids and small molecules, therefore acting as molecular recognition effectors and assemblers (Cumberworth, Lamour, Babu, & Gsponer, 2013;Tompa, 2002Tompa, , 2005van der Lee et al., 2014). Strikingly, some of the most well characterized and crucial functions of IDPs arise from their flexible nature: they can be flexible linkers connecting structured domains of a protein, or they can act as entropic clocks, bristles, and springs due to their entropic features (Tompa, 2009;Uversky & Dunker, 2010).
DisProt is a service of the Italian node of ELIXIR, the European infrastructure for biological data, and a key resource for the recently established ELIXIR IDP user community (Davey et al., 2019). It is also the largest repository of manually curated annotations of intrinsically disordered proteins (IDPs) collected from the literature (Hatos et al., 2020;Piovesan et al., 2017). A team of expert DisProt curators looks for new data on IDPs/IDRs from relevant publications and annotates them through a dedicated curation interface by means of intrinsic disorder−related annotation terms that are codified into the IDP ontology. The IDP ontology (https:// disprot.org/ about) includes four main branches, corresponding to the four disorder aspects annotated in DisProt: structural state, structural transition, interaction partner, and disorder function. A DisProt entry corresponds to a protein isoform, and unambiguously maps to a UniProt entry. DisProt annotations describe local properties of the protein sequence (e.g., intrinsically disordered regions) which are always supported by experimental evidence taken from the literature. Each DisProt annotation is uniquely identified by the DisProt entry accession number followed by a suffix starting with a lowercase letter r (for example, DP00086r003).
In this article, we provide detailed protocols explaining how to perform a search in Dis-Prot (Basic Protocol 1), visualize and interpret annotations of a DisProt entry (Basic Protocol 2), and submit a new evidence of intrinsic disorder in DisProt (Basic Protocol 3). We also describe the downloading options in DisProt (Support Protocol 1) and programmatic access with the DisProt REST API (Support Protocol 2).

PERFORMING A SEARCH IN DISPROT
DisProt is freely accessible at the URL https:// disprot.org/ . This protocol describes how to search entries and retrieve information in DisProt. From the home page, users can also navigate the DisProt blog (https:// disprot.org/ blog) to read posts describing our updates or explore the DisProt Twitter account (https:// twitter.com/ disprot_db) (Fig. 1).

Necessary Resources Hardware
While DisProt works best on laptop or desktop computers, it is also easily accessible from smartphones and tablets. An active and stable internet connection is required.

Input data
Free text search against the database

of 16
Current Protocols in Bioinformatics  Performing a text search 1a. Open a web browser and connect to DisProt at https:// disprot.org/ .
2a. Searches in DisProt can be performed either using the "Search" boxes on the topright and top-middle of the DisProt home page or by clicking on the "Browse" button available on the top-left of the home page. i. Users can perform a search using the "Search" boxes on the top-right or topmiddle of the DisProt home page to look for protein entries or entries referencing a specific publication (Fig. 2).
Users can look for a specific protein, e.g., Alpha-synuclein from Homo sapiens, by submitting the protein name, e.g., Alpha-synuclein, or its corresponding UniProtKB accession number, P37840. They will be redirected to the corresponding DisProt entry, in this case DisProt identifier DP00070.
Users could also be interested in looking for a specific publication. In this case, please enter in the search box the corresponding PubMed identifier (PMID) of the publication. All entries that have at least one evidence referencing that publication will be displayed.
ii. Alternatively, it is possible to perform an advanced search by clicking on the "Browse" button, available on the top-left of the home page (Fig. 3). Users will be redirected to an advanced search page, where they can refine their search and look for a specific query, or a combination of them, e.g., a protein name and an organism.
3a. Select "Text search" on the top-left side of the Browse page, then select a term from the drop-down menu.
Users can look for the following aspects: i. A specific protein: select a "Protein name," e.g., "Alpha-synuclein" and "UniProt ACC, e.g., P37840. ii. A set of proteins from a specific organism: choose an "Organism," e.g., "Gallus gallus," the "Taxon," or "NCBI Taxon ID." iii. UniProt Reference Clusters (UniRef): UniRef databases cluster UniProtKB sequences by gathering together proteins based on their sequence similarity (Suzek et al., 2015). Terms available are "UniRef50," "UniRef90," and "UniRef100" (clustering the sequences at 50%, 90%, and 100% identity, respectively). iv. Entries from a specific curator: select the "Curator name" term and start typing the name you are looking for. v. A specific reference: users can look for a specific PMID, e.g., 8632448, by selecting the "Reference ID" term, or for the title of the corresponding publication, e.g., "Alternative arrangements of the protein chain are possible for the adenovirus single-stranded DNA binding protein," by selecting the "Reference name" term. vi. A specific term from our ontology: select one among "Disorder Ontology ID" or "Disorder Ontology name." Users that wish to have a better insight on the terms of our ontology and read their descriptions can refer to the Disorder Ontology description available from the URL https:// disprot.org/ about. vii. It is also possible to perform a "free text" search by selecting the corresponding term in the drop-down menu.
4a. It is possible to customize the table columns to visualize more details of an entry in the displayed results. Default columns include "DisProt ID," "UniProt ACC," "protein name," "organism," and "disorder content." We suggest adding at least the "annotated terms" column to have an insight on the disorder aspects available for each entry.
5a. Download the search results using the "Download selected" button at the top-left of the Browse page. File formats available for download are JSON, TSV, or FASTA. Users can also choose to include ambiguous and/or obsolete entries by selecting the corresponding buttons above "Download selected." Performing a sequence similarity search 1b. Open a web browser and connect to DisProt at https: 2b. Click on the "Browse" button on the top-left side of the home page ( Fig. 4) to be redirected to the advanced search page.
3b. Select "BLAST" on the top-left side of the Browse page to perform a BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990) sequence similarity search against DisProt entries.
4b. Insert a protein sequence in the corresponding box and click on "Submit." 5b. DisProt entries that match the query will be displayed in the results. Available columns are "DisProt ID," "UniProt ACC," "protein name," "organism," and "disorder content" along with "Bit-score," "E-value," "Identity," and "Coverage." Quaglia et al.

of 16
Current Protocols in Bioinformatics

PROGRAMMATIC ACCESS WITH DISPROT REST API
DisProt can be accessed programmatically via REST API to retrieve a single entry (or region) and to perform large-scale database searches. All API endpoints are available from the URL https:// disprot.org/ api/ {endpoint_name}. In this support protocol, we introduce three different endpoints: the first one can be used to retrieve a single entry, the other two to search entries in the database. Please refer to https:// disprot.org/ help#api for all the tables describing identifiers, query parameters, and input/output formats mentioned in this support protocol.

Necessary Resources Hardware
Laptop or desktop computer. An active and stable internet connection is required.
Quaglia et al.

of 16
Current Protocols in Bioinformatics

Input data
No input data are required 1. Get a single entity.
Users can retrieve a single entity, i.e., a protein entry or one of its manually curated regions, by using its corresponding identifier. The following syntax must be used to retrieve a single entity from DisProt: disprot.org/api/{identifier}, where the "identifier" must be a valid DisProt ID, DisProt region ID, or a UniProt accession. The query is customizable with various parameters, e.g., file format and release. Here we provide two pieces of code to retrieve a single entry in JSON format (Sample code 1) and in FASTA format (Sample code 2). In Sample code 2, the API version of DisProt is also specified.

Get results.
DisProt currently provides three output formats: JSON (default), FASTA, and TSV. Due to the inherent limitations of the FASTA and TSV file formats, the JSON format renders the most comprehensive description of intrinsic disorder. The TSV and FASTA files provide details about regions or different types of consensus.

Searching entries in DisProt database 1. Perform a text search.
DisProt provides an extensively customizable search engine. It is possible to perform a free text search or formulate complex queries against combined fields, e.g., organism and UniRef50. The search query is sent to https:// disprot.org/ api/ search with URL parameters. Note that whitespace and other special characters must be converted into a valid ASCII format; the space is usually replaced with "%20." Multiple search fields can be combined in the same query by joining them with an AND operator ("&" symbol), e.g., "http:// disprot.org/ api/ search?organism=homo% 20sapiens&name=kinase" returns all the human proteins with "kinase" in the protein name. Given that some fields are interpreted as regular expressions, it is also possible to use the OR operator ("|" symbol). This is the case for the following query: e.g., "https:// disprot.org/ api/ search?organism=homo%20sapiens|mus%20musculus," which returns both human and mouse entries. The user can choose to customize the output format. Currently available output formats are JSON, FASTA, and TSV. By default, the endpoint returns the results in JSON; however, users can select another format using the "format" field in the parameters or headers. It is possible to use an older version of the API for legacy reasons by specifying the "accept-version" in the URL header of a request. By default, the server responds with the latest version of the API.

Get results.
DisProt returns an object with "data" and "size" fields. "Data" contains a list of entries, and these entry objects are the same as those described in the previous section. "Size" corresponds to the number of matched entries. Note that when the pagination parameters are provided, only the data field is affected, whereas the size field always refers to the full query result.

Performing a sequence similarity search.
The users can also perform a BLAST sequence similarity search against the database with a POST request to https:// disprot.org/ api/ blast. The output provided is the same available for the text search described above, i.e., JSON (by default), TSV, or FASTA. In addition, DisProt returns the corresponding "Bit-score," "E-value," "Identity," and "Coverage" as provided by BLAST.

VISUALIZING AND INTERPRETING DISPROT ENTRIES-THE p53 USE CASE
Here we present a use case, human p53 (DisProt entry: DP00086), to explain how to visualize and interpret a DisProt entry page and its annotations. The human p53 entry, also shown in the home page examples, has been recently updated (DisProt release 2020_06) with more than 40 new annotations coming from 15 scientific articles. p53, one of the most well-characterized IDPs, is a tumor suppressor playing a crucial role in several cell functions, such as apoptosis and regulation of DNA repair (Tompa, 2009). p53 is a hub protein, involved in protein-protein interactions with a large number of partners (Uversky & Dunker, 2010). p53 is characterized by the presence of four domains; two of them, the N-terminal transactivation domain (TAD) and the C-terminal tetramerization and regulatory domain, are unstructured as determined by various methods such as NMR, circular dichroism, and SAXS (Tompa, 2009;Uversky & Dunker, 2010). Several experimental studies have been carried out in the last two decades that shed light on protein complexes involving p53. Specific short segments of the TAD and C-terminal domains of p53 with their partners are associated with folding-upon-binding events and are sufficient for interaction-mediated functions (Mészáros et al., 2007). DisProt entries are annotated by a team of expert curators who aim at collecting all experimental evidence related to disorder available from a publication. In DisProt, an entry corresponds to a protein isoform, and each IDR annotation is an evidence about its flexible nature or function. The minimal information required to annotate a region in DisProt include reference to the publication (PMID or a DOI), the boundaries of the region (start and end position on the amino acid sequence), the experimental method and type of information, i.e., an IDP ontology term (structural state, structural transition, interaction partner or disorder function). In order to support annotations, when possible, curators report authors statements as snippets of text from the corresponding publication. Finally, a selected team of reviewers carefully checks all annotations, to ensure a high-quality standard. Each entry page consists of two main sections. The first provides information about the protein, and includes a feature viewer to visualize DisProt region annotations on the sequence. The second section lists all annotations in a tabular format.

Necessary Resources Hardware
While DisProt works best on laptop or desktop computers, it is also easily accessible from smartphones and tablets. An active and stable internet connection is required.
3. Users can select the release they want to visualize from the "Release" drop-down menu on the top-right of the entry page. All the annotations described in this example correspond to the 2020_06 (latest) release of DisProt.
4. To show/hide ambiguous and/or obsolete regions of an entry, please check/uncheck their corresponding boxes on the top-right of the entry page.
5. The feature viewer, which can be expanded and collapsed, allows users to visualize region annotations on the sequence. By default, two tracks are shown, the first showing DisProt annotations and the other including domain data as defined by Pfam (El-Gebali et al., 2019), which provides conserved domain families, and Gene3D (Lewis et al., 2018), which provides globular domains. It is possible to expand the feature viewer to visualize the sub tracks and each disorder evidence annotated for a specific functional or structural aspect. By hovering over each region on the sequence viewer, a tool tip provides additional information such as annotated terms, identifiers, cross-references, the name of the curator who annotated the region, the experimental method, and the reference supporting that annotation.
6. Users can open ("toggle") the sequence viewer, which dynamically highlights amino acids of the selected IDR directly on the protein sequence. 7. It is also possible to select a subset of annotations using the "Filter" box under the sequence viewer.
Quaglia et al.

of 16
Current Protocols in Bioinformatics    The bottom section of the entry page lists all DisProt annotations. The N-terminal tail of p53 consists of a transactivation domain (TAD), described in the annotation DP00086r024 (Fig. 7), spanning from residues 1 to 93 of the protein sequence. The transactivation domain is composed of two subdomains, TAD I and TAD II, and was determined to be unstructured by Fersht and colleagues (Wells et al., 2008).
The TAD II subdomain is involved in the interaction with the pleckstrin homology (PH) domain from human TFIIH basal transcription factor complex p62 (Okuda & Nishimura, 2014). Binding of the TFIIH PH domain induces a disorder-to-order transition in p53. This interaction plays a crucial role in increasing the affinity of p53 for the transcriptional machinery, and might regulate its selectivity for the expression of various genes (Okuda & Nishimura, 2014), therefore supporting the function of p53 as a molecular recognition effector. These examples of a p53 interaction (Fig. 8), its transition ( Fig. 9), and the function associated with this binding (Fig. 10) are shown in the region annotations available from the p53 entry in DisProt.

PROVIDING FEEDBACK AND SUBMITTING NEW INTRINSIC DISORDER-RELATED DATA
Feedback on site experience and on technical and/or data issues can be submitted using the DisProt feedback form (https:// disprot.org/ feedback). On the Feedback page, two tabs are available: "Leave a comment" (Fig. 11) and "Submit a new annotation" (Fig. 12). The first tab allows users to submit a generic feedback, the second to submit a new annotation. DisProt entries are annotated by a team of expert curators, and carefully reviewed by a small team of reviewers. However, the submission of new literature annotations by knowledgeable users is encouraged. Each submitted evidence is reviewed by a team of reviewers and made available in the next release of the database.

Necessary Resources Hardware
While DisProt works best on laptop or desktop computers, it is also easily accessible from smartphones and tablets. An active and stable internet connection is required.

Input data
No input data are required Figure 11 Feedback page-leave a comment. Users can provide feedback on site experience, bugs or issues with data.

Figure 12
Feedback page-curation mode. Users can submit new pieces of evidence of disorder from literature to the DisProt curators' team.
2. Click on the "Feedback" button on the top-right of the DisProt bar.
3. Provide your contact information, name and e-mail address, in the corresponding boxes.
Submitting a feedback 4a. Select the "Leave a comment" tab.
5a. Add a subject of your message in the dedicated field, e.g., "technical issue." 6a. Use the "Message" box to add a detailed comment or feedback. The minimum length of the message should be 15 characters.
7a. Click on the green "Send" button to send your feedback to the DisProt team.
Submitting a new evidence of intrinsic disorder from the literature 4b. Select the "Submit a new annotation" tab.
5b. Provide an identifier of the protein you want to annotate, using the "DisProt ID" or the "UniProtKB ACC" boxes.
If the protein is already in DisProt, please provide the DisProt ID, e.g., DP00003.
If the protein is not yet annotated in DisProt, please provide its corresponding UniProt identifier, e.g., P03265.
6b. Provide a reference for the publication describing a new evidence of intrinsic disorder for the protein of interest using the "Reference" box. The provided reference must be a valid PubMed ID or DOI of the publication.
7b. In the "Experimental method" box, add the method used to assess the presence of intrinsic disorder in the publication, e.g., NMR, circular dichroism, or small angle X-ray scattering.
8b. Add details about the intrinsically disordered region described in the publication. Users can add more than one intrinsically disordered region described in the publication by clicking on the "Add new region" button. To remove a region please click on the "Remove this region." i. Provide the boundaries of the intrinsically disordered region: add the start position in the "Start" box and the end position in the "End" box. Region boundaries must correspond to those specified in the publication. ii. In the "Statement" box, add a sentence from the publication that describes the intrinsically disordered region.
9. Click on the green "Send" button to submit your annotation to the DisProt team.

GUIDELINES FOR UNDERSTANDING RESULTS
In DisProt, a team of expert biocurators manually curates experimental intrinsic disorder data from peer-reviewed publications. Each DisProt entry corresponds to a UniProt entry, i.e., the canonical sequence or one of its isoforms. An entry consists of a set of manually curated intrinsically disordered regions; each one of them is an evidence, together with all the information about its flexible nature. The minimal information included in an evidence is the reference (PMID or DOI) to a scientific publication, an experimental method associated to the IDR, the start and end positions of the region, and a disorder aspect associated to the IDR. Four possible disorder aspects can be annotated in DisProt, covering the main features of an IDR: the structural state, the structural transition, its interaction partner, and the disorder function. Each of the aforementioned branches consists of a parent term and its child terms, e.g., in the "disorder function" branch the parent term entropic chain incorporates six child terms, flexible linker/spacer, entropic bristle, entropic spring, entropic clock, structural mortar, and self-transport through channel. Curators also add statements, i.e., sentences from the publication that support the disordered nature of the region or one of its aspects, to provide users with an exhaustive description of each protein region. A standardized curation effort is one of the main goals of DisProt: in line with this, DisProt curators benefit from a detailed curation manual describing all the rules for annotating in DisProt and every aspect related to the curation process, along with a dedicated ontology of intrinsic disorder-related terms.