Documentation
- Overview
- Graphs are abstract representations of relationships between entities.
Through a set of instructions, they are able to convert relationships
into nodes and edges. GraPPLE (originally: The RNA sorting hat) leverages
the information available from graph properties and the predictive power
of support vector machines (SVMs) to classify potential RNA sequences as
functional/non-functional and into 1 of 46 Rfam families. In short,
GraPPLE predicts the potential function of an ncRNA sequence.
- Usage
- Either enter your sequence into the textbox or upload a sequence file.
Multiline raw sequences are accepted as well as multiple sequence FASTA
files. This tool is also available as a web service.
- Results
- The results will tell you whether we predict your sequence to be
functional or not, what Rfam family it is predicted to have and the
p-score associated with the predictions. We also tell you how confident
we are that the results are correct (high/medium/low). The p-scores are
values in the range [0,1] and the sequence gets a p-score for each
family. The sum of all p-scores is equal to 1. We show the p-score for
the classified family. Our confidence is based on the Shannon entropy of
the p-scores. An entropy less than 1.5 is considered
high, between 1.5 and 2.0 medium and
greater than 2.0 low.
- Converting RNA sequences to graph properties
- There are a couple steps involved.
- Fold the RNA. This is achieved using RNAfold,
available in the ViennaRNA package, which
was developed by Ivo Hofackers group in Vienna, Austria.
- Convert the structure into a graph and calculate the graph
properties. This is done using various algorithms developed
in-house using the R
statistics package.
- Creating the SVMs
- The SVMs are created using libsvm, a set of SVM
tools. To acheive the high accuracies we take the following steps.
- Scale the datasets. The values for each feature
(graph property) are scaled between -1 and 1. This prevents features
with greater numerical ranges from dominating those with smaller
ranges.
- Train using RBF kernel. The radial basis function
kernel is able to find non-linear relationships between the Rfam
families and the graph properties.
- These steps are recommended in the "A Practical Guide To Support Vector
Classification" available from the libsvm package.
- Data used in the SVMs
- The work done here is covered by our paper, "ncRNA prediction and
classification using graph properties". Please refer to the paper
for a more in-depth description of our experimental approach. However,
very briefly here:
- Download Rfam.
- Create non-functional set. Here we shuffle all the
sequences using uShuffle, which
allows us to preserve di-nucleotide sequences whilst shuffling.
- Create super-families. For families which are
smaller than 200 sequences, we try to group them into super-families
using Rfam annotation. This creates the super-families CD-BOX,
HACA-BOX, IRES, MIRNA, RIBOSWITCH and RIBOZYME.
- Train the SVMs. For the SVM which classified
functional vs. non-functional, we selected 400 random sequences from
the Rfam sequences and 400 from the shuffled sequences. For Rfam
family classification we select 200 random sequences from each
family/super-family.
- Cheat using BLAST. Although the accuracies of the
SVMs and BLAST alone are already very high, we found that by
combining the two we were able to further improve the accuracy of the
method. By using both BLAST and SVMs, we cover families with high
conservation and families with low conservation but similar
structures.
- The training sets
- Functional vs. non-functional: 400 functional and 400
non-functional sequences. SVM method
- Rfam family classification: 200 sequences from 46
families. Combined SVM/BLAST method
- Program versions
- libsvm-2.88
- WUBlast 2.0
- Vienna RNA Package
1.7.2
- uShuffle
- Programmatic Access
- GraPPLE is accessible using the ZSI python library. A sample client
file is available here. Sequences are
passed as a list of (Accession, Sequence, Structure) tuples. If either
the accession or the structure is not known, an empty string will
suffice.