Overview
Sherbrooke Alternative Protein Feature IdentificatoR (SAPFIR) seeks to understand how alternative splicing, transcription initiation and termination change the localization or function of a gene by regulating which localization signals, functional features and other important protein features are present in the mature mRNA.
Single Gene Annotation
The Single Gene Annotation function of the SAPFIR tool visualizes the position of functional features within a gene.
The search parameters include:
- A single gene, either using HGNC (human) or MGI (mouse) gene symbol or its ENSEMBL Gene ID.
- Please note that human and mouse gene symbol do not follow the same standard, e.g., RBFOX2 (human) vs Rbfox2 (mouse), nor always the same, e.g., QKI (human) vs Qk (mouse). The gene symbols also get updated from time to time. ENSEMBL Gene ID is preferred.
- The species, choosing from human or mouse.
- The prediction tool(s) used by InterProScan to predict the features.
- Please note that each tool is designed to predict a different set of features of the protein sequence. A detailed description of the tools can be found at InterPro website. Choosing multiple tools may produce redundancy in the result if they are designed to predict similar features.
- The CDS length ratio threshold.
- This is used to exclude transcripts with short CDS. The default value of 0.25 is recommanded.
The result page consists of two downloadable tables and a graph.
The first table lists the features predicted by IntroProScan. The table contains the following columns:
- Prediction tool and prediction signature;
- IntroPro Accession and description of the predicted feature, if available;
- ENSEMBL ID of the transcript in which the feature is predicted;
- Genomic region corresponding to the predicted feature (might include introns);
Major isoforms according to APPRIS database are marked by * (if they are tagged as "PRINCIPAL:1" in the APPRIS database) or by ** (if they are tagged as "PRINCIPAL:2" or higher, or tagged as "ALTERNATIVE") in the table above. Transcripts without * or ** are minor isoforms. Please visit the APPRIS web site for more information concerning their scoring system. In summary, when the process to select the major isoform only identifies one peptide candidate, all transcripts coding for this peptide are tagged as "PRINCIPAL:1". Multiple transcripts of one single gene can have this tag if they have identical CDS and only differs in their untranslated regions. However when the process identifies multiple candidates, they are tagged as either "PRINCIPAL:2" to "PRINCIPAL:5" or "ALTERNATIVE:1" or "ALTERNATIVE:2". Untagged isoforms are considered as minor isoforms.

The second table indicates whether the features are predicted present in all transcripts. It contains the following columns:
- Prediction tool and prediction signature;
- IntroPro Accession and description of the predicted feature, if available;
- Whether the feature is constitutive or alternative according to three standards in the selection of transcripts:
- (1) All coding transcripts;
- (2) Transcripts whose CDS are longer than a certain ratio of the longest CDS, the ratio can be modified in the next section;
- Transcripts with short CDS have few predicted features, thus making features predicted in other transcripts alternative. By default the threshold is set to 0.25. This value will not interfere with the first or third standards.
- (3) Transcripts whose CDS overlaps with the feature.
- This standard is more useful for alternative splicing where an alternative region flanked by constititive regions.
To suit the needs of different studies, whether a feature is alternative will be investigated according to three independent standards in the selection of transcripts :

The motivation behind this choise is that often a gene has many non-coding transcripts and short transcripts (transcripts 4 and 5 in the following illustration) according to Ensembl annotations, which makes the majority if not all predicted features alternative. Thus, limiting the transcripts to the coding ones or those with relatively longer CDS produces more meaningful results.
Another consideration is the difference between alternative splicing, alternative transcription start sites (ATSS), and alternative transcription termination sites (ATTS). Despite ATSS and ATTS involving different mechanisms and regulation as compared to alternative slicing, they all contribute to the diversity in transcripts and proteins produced by a single gene. Indeed, many differential splicing tools report changes in ATSS and ATTS. The third standard of Overlap CDS makes the result more relevant for users particularly interested in splicing.
In this case, the selection of transcripts varies for each feature. We defien overlapping to occur when the extremities of the feature in question are within the extremities of the CDS of a (coding) transcript. Hence, in the following illustration, transcripts 1 and 2 overlap with feature1, while transcripts 1, 2 and 3 overlap with feature 4. Thus, feature1 is constitutive and feature4 alternative.

Finally, the graph represents the gene structure and the predicted features:
- A legend illustrating elements in the plot
- The plot is oriented in increasing genomic coordinates. For genes on the positive strand, the left end corresponds to the 5' and the right end corresponds to the 3'. For genes on the negative strand, the reverse is true.
- The ENSEMBL gene ID is indicated in the upper left conner of the graph.
- Each line represents one transcript, with its ENSEMBL transcript ID on the left, the exons shown as blocks and introns shown as lines.
- In each transcript, the region corresponding to the CDS is colored in blue. A transcript with only white blocks is non-coding.
- The predicted features are represented as thin rectangles, with color-coded labels indicating their InterPro Accession. Same colored label indicates the same feature within and across the transcripts. Similar colors may apprear when the number of different features is high.

Enrichment Analysis
The goal of the Enrichment Analysis is to help understand how changes in splicing profile affect the protein function in the cell. This function compares the frequency of InterProScan predicted protein features found in two lists of genomic regions, refered to as "target" and "background". A typical input can be a list of alternatively spliced exons or junctions identified by an RNA-seq experiments as the "target", and a list of expressed but not alternatively spliced exons or junctions in the same experiment as the "background"; although generally any regions meaningful for the user can be used
The expected input includes :
- The two lists of genomic regions expected to be provided in a .bed file, while the 4th column is the Ensembl ID. The Ensembl ID is required to ensure correct mapping of genomic region to genes in the database.
- The species, choosing from human or mouse.
- The prediction tool(s) used by InterProScan to predict the protein features.
Only the first 6 columns are considered, including the score column which is ignored. Any additional columns are discarded during data processing.
Example data are provided using previously published data in human and mouse
Please note that each tool is designed to predict a different set of features of the protein sequence. A detailed description of the tools can be found at InterPro website. Choosing multiple tools may produce redundancy in the result if they are designed to predict similar features.
The result page consists of a summary of the enrichment analysis, a table of comparison and enrichment, and functional features annonation for the target and background list.
- The table of comparison and enrichment. Only the head of the table is shown in the result page for convinience, the full tabe can be downloaded via the link underneath. This table contains the following columns:
- Interpro Accession of predicted features (if available), the tool used for prediction and its signature for the feature;
- Number of target regions containing the feature;
- Number of background regions containing the feature;
- The p-value of chi-square test where the null hypothesis is that the target list and the background list have the same frequency of exons or junctions containing the feature;
- The q-value, which is Benjamini-Hochberg adjusted p-value;
- The ratio of the frequency in the target list over frequency in the background list. To avoid division by zero errors, target regions are included in the background for this calculation.
- Annotated target and background regions as downloadable bed files.
- The first columns are the same as the input;
- The following columns contain a comma-separated list of functional features overlapping the entry, tools used for the prediction and prediction signature, or a single dot if such overlap is not found.
This result page will be available for 48 hours after completion.
Contact Us
SAPFIR is developped by the research groups of Michelle Scott, Ph.D. and Sherif Abou Elela, Ph.D..
SAPFIR is managed by Delong Zhou. Comments, questions or suggestions can be communicated via e-mail: delong(dot)zhou(at)usherbrooke(dot)ca; please include "SAPFIR" in the subject.