AI Lover's Blog: An improved approach for Classification of Precursor microRNA using Artificial Neural Network.

TAHLIL AHMED CHOWDHURY

MD. NAZMUL ALAM

ABSTRACT

MicroRNAs (miRNAs) are a group of short (~22 nt) non-coding RNAs that play important regulatory roles .The miRNAs are transcribed as long primary miRNAs, which are processed into 60–70nt miRNA precursors (pre-miRNAs) by nuclear RNaseIII Drosha. The pre-miRNAs are then cleaved into 22nt mature miRNAs. MiRNA precursors are characterized by their hairpin structures. However, a large amount of similar hairpins can be folded in many genomes.

There are lots of computational approach to classify pseudo hairpin and pre-miRNAs. As computational prediction models, generally support-vector-machine, Gaussian/kernel density estimator, hidden Markov model etc are commonly used. Now a days another computational model Artificial Neural Network a renowned supervised approach of machine learning algorithm for pattern recognition has been extensively used in the prediction field. In this study we will describe an approach which uses artificial neural network for improved classification of pseudo hairpin and pre-miRNAs.

Artificial Neural Network Architecture

Feedforward network is used for the architecture of ANN. It has 3 layers. 2 hidden layers and 1 output layer. Though it has an input layer but it is not counted as a layer. We used 20 neurons in each hidden layer. There are 22 neurons in input layers which is equal to the number of features used in out system. It has one neuron in output layer. Every layer is internally connected to each other and directly connected only to its adjacent layer.

Input And output

We used 21 features as input for our network. This 21 featuers are scaled value from 0 to 1. The output of our network is binary value. Which is either 0 or 1. 1 means the inputs set is a pre-miRNA sequence. 0 means the input sequence is a pseudo hairpin.

Training Algorithm, Transfer, Learning Function

Determining which training algorithm , transfer function and learning function to use is a very difficult task. There are 19 different training algorithms, 13 transfer functions and 15 learning functions. Testing all the 19*13*15 combination is an impossible with out current processing capability. So we tested some of the combinations. In our system we used Quasi-Newton BFGS algorithm known as trainbfs for training algorithm , logsig transfer function for first hidden layer and purelin for second hidden layer, learnlv1 for learning function.

Weight Initialization

Random weight initialization gives different output with same configuration and input while testing each time. So we needded to use a static initialization method. We used the followingequation found from miRANN system which initialize the weights of each layer to a limit [-.05,0.5]. Closer connections has a very low differences in their weights for this equation.

W(i,j) = -0.5 + (i*c+j)*(.1/(r*c))

Where W(i,j) is the weight of j’th neruon of i’th layer, r is the total number of neurons in that Layer and c is the total number of inputs.

MATLAB Neural Network Toolbox

There are lots of standard network codes available for training. But for our work we used MATLAB Neural Network Toolbox. It’s very efficient and all the complex functions could be accessed easily.

Training The Network

To train the network first we have to load the input and corresponding output sets as it is a supervised learning. We use the load function for this. We can use load function like this,

f257_train_input=transpose(load('E:\MIRNA\NegetiveSet\TrainSet\Input_data'));

Then we define some parameters transfer function, train function, learn function, hidden layer Size means number of neurons in each hidden layer, maximum iteration and validation set size.

transfer_func = {'logsig' 'purelin'};

train_func = 'trainbfg';

learn_func = 'learnlv1';

layer_size = [20 20];

max_epoc = 1; (Number of iteration)

max_validation = 10;

Here max_epoc is 1 because we train the network repeatedly manually. max_epoc=1 means it will automatically run only 1 iteration. We did this because after each iteration we see if the networks performance is good enough. Here trainbfg is the training function for the network. learnlv1 is the learning function used for the network. logsig and purelin are the transfer function in first and second hidden layer correspondingly.

Now the function newff creates a feedforward backpropagation network. This function requires three arguments which are, train input, train output, layer size. We can give additional parameters to it for running exactly we want. We can give the parameters transfer function, train function , learning function.

net = newff( f257_train_input, f257_train_output, layer_size, transfer_func, train_func, learn_func);

Here net is the network object returned by newff. Here f257_train_input is a set of sequence of both pseudo and pre-miRNAs which we obtained from mirbase release 17. f257_train_output is the corresponding output for each input sequence as it is a supervised learning. This command creates the network object net and also initializes the weights and biases of the network; therefore the network is ready for training. For re initializing the weights, or to perform the custom initialization we used our equation for generating the values and assigned to the net.IW, net.LW, net.b.

[riw ciw] = size(net.IW) [rlw clw] = size(net.LW) [rb cb] = size(net.b)

for i=1:riw

for j=1:ciw

[r c] = size(net.IW{i,j});

for m=1:r

for n=1:c

net.IW{i,j}(m,n) =-.05+(m*c+n)*(.1/(r*c));%(1/(i+j+m+n))-(1/((rb+cb+r+c)/2));

end

The code above initializes the weights of the synapses from input layer to other layer, here to only first hidden layer.

for i=1:rlw

for j=1:clw

[r c] = size(net.LW{i,j});

for m=1:r

for n=1:c

net.LW{i,j}(m,n) = -.05+(m*c+n)*(.1/(r*c));%(1/(i+j+m+n))-(1/((rb+cb+r+c)/2));

end

The code above initializes the weights of the synapses from each hidden layer to other layers, here to second hidden layer and from second hidden layer to output layer.

for i=1:rb

for j=1:cb

[r c] = size(net.b{i,j});

for m=1:r

for n=1:c

net.b{i,j}(m,n) = -.05+(m*c+n)*(.1/(r*c));%(1/(i+j+m+n))-(1/((rb+cb+r+c)/2));

end

The code above initializes the weights of the synapses of the biases of all layers. Now the network is created and initialized. Now we will train the network with ‘train’ function. We call the train function iteratively and after each iteration we measure the performance of the network with confusion matrix.

[net, train_rec, train_output, train_error] = train (net, f257_train_input, f257_train_output);

‘train’ function returns the object of the network, train output and train error. Train output includes current prediction of the input sequences of the network. We get the confusion matrix from the ‘confusion’ function. It takes the train output and actual output as input and returns a 1*4 matrix. The value on the first element is the number of sequences the network predicted correctly as pseudo hairpin. The value on the second element is the number of sequences the network incorrectly predicted as pre-miRNA. The value of the third element defines the number of the sequences the network incorrectly predicted as pseudo hairpin. The last element of the matrix denotes the number of sequences the network correctly defined as pre-miRNA.

[c, cm, ind, per] = confusion (f257_train_output, train_output);

Here cm is the confusion matrix returned by ‘confusion’ function.We define the performance of the network’s training by the accuracy of the network which means the rate of correct predictions of both pre-miRNA and pseudo hairpin.

train_perf = ((cm(1)+cm(4))/sum(sum(cm)))*100;

train_perf defines the accuracy as percentage. The equation for accuracy is elaborately discussed in previous chapters of this report.

Now that the network is trained we need to test it to see how it actually works on unknown data. We use ‘sim’ function to simulate the network for testing set.

test_output = sim(net, f257_test_input);

f257_test_input is the testing input sequence. test_output denotes the prediction output for the input sequences. Just like the training we also find the confusion matrix by calling ‘confusion’ function and getting confusion matrix.

[c, cm, ind, per] = confusion(f257_test_output, test_output);

test_perf = ((cm(1)+cm(4))/sum(sum(cm)))*100

From the confusion matrix we get the accuracy of the network. If the accuracy is standard we stop the iteration and the current state of the network is saved for future predictions.

Result And Discussion

To measure the performance of our system we used confusion matrix. There are 4 terms associated with confusion matrix,

True positive (TP) = Correct prediction of pre-miRNA.

True positive (TN) = Correct prediction of pseudo hairpin.

False positive (FP) = Incorrect prediction of pseudo hairpin.

False negative(FN) = Incorrect prediction of pre-miRNA.

The total prediction accuracy (ACC), Speciﬁcity (SE), Sensitivity (SP) of the prediction system are given by:

ACC = ( (TP+TN) / (TP+TN+FP+FN) ) *100

SP = (TN / (TN+FP) )*100

SE = (TP / (TP+FN) )*100

Human Performance

In our system for human data set,

True Positive prediction (TP) = 698

True Negative prediction (TN) = 693

False Positive prediction (FP) = 2

False Negative prediction (FN) = 7

Sensitivity ( SE ) = 99.01%

Specificity ( SP ) = 99.71%

Accuracy (ACC) = 99.357

		Prediction (from ANN)
		Negative	Positive
Target (Actual)	Negative	693	7
Target (Actual)	Positive	2	698

TABLE : Confusion Matrix

Receiver Operating Characteristic (ROC) Graph

Receiver Operator Characteristic (ROC) curve which is a plot of the true positive rate (sensitivity) against the false positive rate (specificity).

Figure : Receiver Operator Characteristic (ROC) curve

Cross-species Performance

We evaluated 20 species including human for cross-species testing. We trained with human dataset and tested with positive corresponding species dataset and for negative we used human dataset. We used mirbase release 17 for the positive sets. Our prediction accuracy for cross-species 91.33% which shows that there is strong resembles in those species miRNAs and human miRNAs.

Species	Total miRNA	Found	Not Found	%Found
Homo sapiens	1424	1415	9	99.357143
Bos Taurus	662	648	14	97.870778
Pongo pygmaeus	581	563	18	96.955504
Pan troglodytes	600	557	43	92.763664
Oryza sativa	491	394	97	80.268682
Mus musculus	720	699	21	97.112676
Gallus gallus	499	447	52	89.658048
Macaca mulatta	479	449	30	93.808312
Canis familiaris	323	281	42	86.999022
Taeniopygia guttata	220	206	14	93.804348
Schmidtea mediterranea	148	139	9	93.750000
Ciona intestinalis	331	259	72	78.176528
Physcomitrella patens	229	217	12	94.940797
Bombyx mori	487	475	12	97.556866
Arabidopsis thaliana	232	196	36	84.334764
Danio rerio	358	354	4	98.865784
Rattus norvegicus	408	307	101	75.180505
Drosophila pseudoobscura	211	203	8	96.373626
Medicago truncatula	375	302	73	80.651163
Herpes Simplex Virus 2	18	17	1	98.189415
Total	8796	8128	668	91.330838

Table: Cross Species Performance

Conclusion

In this work we tried to further improve previous work of miRANN. miRANN has some shortcoming which are, its cross species performance were much lower, it did not have good basis for choosing its feature set, due to huge amount of feature set the system was biased, and for having 257 feature set the computation time for the system was very lengthy. We use genetic algorithm based feature selection for our feature set, and we have a very good cross-species performance and our system is not biased due to feature set. Our ongoing work is whole genome wide search for miRNA with our current system.

AI Lover's Blog

Saturday, January 9, 2016

An improved approach for Classification of Precursor microRNA using Artificial Neural Network.

No comments:

Post a Comment