About / Help
--> Bergeron D, Paraqindes H, Fafard-Couture É, Deschamps-Francoeur G, Faucher-Giguère L, Bouchard-Bourelle P, Abou Elela S, Catez F, Marcel V, Scott MS. (2023) snoDB 2.0: an enhanced interactive database, specializing in human snoRNAs. Nucleic Acids Res. 51(D1):D291-D296.
In order to remove redundancy in snoRNAs, we merged snoRNA entries from all databases using genomic location. In case of overlap, we prioritize genomic location based on the following order: Ensembl, RefSeq, snoRNA Atlas, snOPY and new snoRNAs from the literature.
The conservation data come from snoRNA Atlas and we also provide the PhastCons conservation score for 100 vertebrates (obtained from the UCSC genome browser) for each snoRNA.
snoRNA motif sequence and position were taken from snoRNA Atlas when available. Otherwise, they were predicted using a custom algorithm. Briefly, for boxes C and D, we looked for perfect RUGAUGA at the 5’ end (in the first 25nt, or 30nt if the snoRNA is 80nt or more) and CUGA at the 3’ (in the last 20nt), respectively. If nothing was found, we looked for degenerate sequences (in the same end regions), with a Hamming distance of up to 3 for box C, and 2 for box D. For boxes C’ and D’, we looked for the least degenerate D’-C’ pairs (minimal sum of Hamming distances from consensus motifs), separated with at least 2nt and having the farthest distance between each other. For ACA boxes, we looked for perfect ACA motif in the 10 last nucleotides of the snoRNA sequence. For H boxes, we folded each H/ACA box snoRNA using RNAfold ( viennaRNA ) and looked for ANANNA motifs in an unpaired region between hairpin regions. For all boxes, if nothing was found, nothing is displayed in the detailed view.
*Only guide regions with 2 or more sources are highlighted in the detailed view page.
Host genes were determined using a custom algorithm looking for genes overlapping a specific snoRNA in the Ensembl (V104) annotation (complemented with snoRupdate). If more than one gene was found, manual curation was used to remove pseudogenes or long genes resulting from read-through.
Host gene function was determined using the Gene Ontology (GO) resource. Since a gene may have many different functions, our goal was to provide a more generalized function for each host gene, if possible, to allow clustering of host genes based on their function. Manual curation was used with GO to extract gene functions.
Canonical snoRNA interactions (with rRNA or snRNA) were taken from multiple resources (snoRNABase, snoRNA Atlas, Krogh N. et al., Kehr S. et al.). However, many more recently annotated copies of snoRNAs exist for which no attempt at guide region identification has been made. As a consequence, in the previous version of snoDB, for pairs of snoRNAs of the same family and with the same guide region, one could be labeled as guiding a position in rRNA while the other could be considered to be an orphan. To address this problem, for each modified rRNA position supported by at least two different sources, we identified all box C/D snoRNAs with at least 8 bp interactions with at most 1 sub-optimal (G-U) interaction that were not yet annotated as guiding this position. We used the term “snoDB predicted” in the detailed view to tag these.
Non-canonical snoRNA-RNA interactions were taken from the RISE, database.
snoRNA copies are based on the RFAM classification. All snoRNAs having the same RFAM id are considered to be copies of the same snoRNA.
snoRNA-protein interactions were extracted from the eCLIP data of 150 RNA binding proteins from the Encyclopedia of DNA Elements (ENCODE) Consortium. Interactions with p-value higher than 0.001 (-log10(3)) in bed narrowPeak files for each of the 2 replicates were filtered out, and only remaining windows overlapping in both replicates were kept.
Raw data for snoRNA and host abundance come from TGIRT-Seq datasets:
Gene Expression Omnibus (GEO):- GSE126797 for Ovary, Breast and Prostate
- GSE157846 for Testis, Skeletal Muscle, Liver and Brain
- GSE99065 for SKOV
- GSE209924 HCT116, MCF7, PC3, TOV112D
- SRX1426160 for Universal Human RNA (UHR)
- SRX1426193 for Human Brain Reference (HBR)
Details on how the raw data was processed are presented in the Experiment Details section.
As mentionned above, canonical snoRNA-rRNA interactions were taken from multiple resources (snoRNABase, snoRNA Atlas, Krogh N. et al., Kehr S. et al.). Modified positions, as well as their guide snoRNAs, were extracted from these resources.
A status (validated or predicted) was attributed for each of the positions base on validation data from several studies (see rRNA modification levels section below).
Several versions of rRNA were used in the literature. snoDB defaults to the snoRNABase rRNA versions which are the following:
Other slightly different versions for 18S are NR_003286 and NR_145820.1.For the 28S, another popular version is the Human reference rRNA (NR_003287.4).
Information on modification levels (from 0 (not modified) to 1 (fully modified)) was gathered from multiple studies using different techniques:
- Marcel et al. (2020) - (RiboMethSeq)
- Motorin et al. (2021) - (RiboMethSeq)
- Taoka et al. (2018) - (SILNAS - Mass spectrometry)
- Marchand et al. (2020) - (HydraPsiSeq)
To integrate all the snoRNA in snoDB in an up-to-date annotation, the tool snoRupdate (build in c++) was created to facilitate this process. snoRupdate is simple to install and to use, and is compatible both with Ensembl and RefSeq gtf annotation files.
The snoDB project was first started by Darren Mathurin-St-Pierre, and was further improved and put online by Philia Bouchard-Bourelle. The version 2.0 was created by Danny Bergeron.
For inquiries, comments or suggestions, contact Danny at: danny.bergeron@ushebrooke.ca.
The principal investigator Michelle Scott and the project team can be also reached here Michelle.Scott@USherbrooke.ca.