Max-Planck Society Max-Planck Molecular Plant Physiology

GraPPLE

Graph Property based Predictor and Likelihood Estimator

Liam Childs, Zoran Nikoloski, Patrick May, Dirk Walther

Documentation

Overview
Graphs are abstract representations of relationships between entities. Through a set of instructions, they are able to convert relationships into nodes and edges. GraPPLE (originally: The RNA sorting hat) leverages the information available from graph properties and the predictive power of support vector machines (SVMs) to classify potential RNA sequences as functional/non-functional and into 1 of 46 Rfam families. In short, GraPPLE predicts the potential function of an ncRNA sequence.
Usage
Either enter your sequence into the textbox or upload a sequence file. Multiline raw sequences are accepted as well as multiple sequence FASTA files. This tool is also available as a web service.
Results
The results will tell you whether we predict your sequence to be functional or not, what Rfam family it is predicted to have and the p-score associated with the predictions. We also tell you how confident we are that the results are correct (high/medium/low). The p-scores are values in the range [0,1] and the sequence gets a p-score for each family. The sum of all p-scores is equal to 1. We show the p-score for the classified family. Our confidence is based on the Shannon entropy of the p-scores. An entropy less than 1.5 is considered high, between 1.5 and 2.0 medium and greater than 2.0 low.
Converting RNA sequences to graph properties
There are a couple steps involved.
  1. Fold the RNA. This is achieved using RNAfold, available in the ViennaRNA package, which was developed by Ivo Hofackers group in Vienna, Austria.
  2. Convert the structure into a graph and calculate the graph properties. This is done using various algorithms developed in-house using the R statistics package.
Creating the SVMs
The SVMs are created using libsvm, a set of SVM tools. To acheive the high accuracies we take the following steps.
  1. Scale the datasets. The values for each feature (graph property) are scaled between -1 and 1. This prevents features with greater numerical ranges from dominating those with smaller ranges.
  2. Train using RBF kernel. The radial basis function kernel is able to find non-linear relationships between the Rfam families and the graph properties.
These steps are recommended in the "A Practical Guide To Support Vector Classification" available from the libsvm package.
Data used in the SVMs
The work done here is covered by our paper, "ncRNA prediction and classification using graph properties". Please refer to the paper for a more in-depth description of our experimental approach. However, very briefly here:
  1. Download Rfam.
  2. Create non-functional set. Here we shuffle all the sequences using uShuffle, which allows us to preserve di-nucleotide sequences whilst shuffling.
  3. Create super-families. For families which are smaller than 200 sequences, we try to group them into super-families using Rfam annotation. This creates the super-families CD-BOX, HACA-BOX, IRES, MIRNA, RIBOSWITCH and RIBOZYME.
  4. Train the SVMs. For the SVM which classified functional vs. non-functional, we selected 400 random sequences from the Rfam sequences and 400 from the shuffled sequences. For Rfam family classification we select 200 random sequences from each family/super-family.
  5. Cheat using BLAST. Although the accuracies of the SVMs and BLAST alone are already very high, we found that by combining the two we were able to further improve the accuracy of the method. By using both BLAST and SVMs, we cover families with high conservation and families with low conservation but similar structures.
The training sets
Functional vs. non-functional: 400 functional and 400 non-functional sequences. SVM method
Rfam family classification: 200 sequences from 46 families. Combined SVM/BLAST method
Program versions
libsvm-2.88
WUBlast 2.0
Vienna RNA Package 1.7.2
uShuffle
Programmatic Access
GraPPLE is accessible using the ZSI python library. A sample client file is available here. Sequences are passed as a list of (Accession, Sequence, Structure) tuples. If either the accession or the structure is not known, an empty string will suffice.