Application of a simple log
likelihood ratio approximant to
protein sequence
classification
László Kaján1,3*, Attila Kertész-Farkas2, Dino Franklin1, Nelly Ivanova1, András Kocsor2**and Sándor Pongor1**
1Bioinformatics Group, International Centre
for Genetic Engineering and Biotechnology, Padriciano 99, I-34012
2Research Group on Artificial Intelligence
of the
3Bioinformatics Group, Biological
Research Centre, Hungarian
Supplementary data for this article
Motivation: Likelihood ratio approximants have been widely used for model comparison in statistics. The present study was undertaken in order to explore their utility in the classification of protein sequences and sequence similarity searching.
Results: We used a simple log likelihood ratio approximant (LRA) based on the maximal similarity (or minimal distance) scores of the two top ranking sequence classes. The scoring methods (Smith-Waterman, BLAST, local alignment kernel, amino acid composition vector-distance and compression based distances) were compared on datasets designed to test sequence similarities between proteins distantly related in terms of structure or evolution. It was found that LRA-based scoring can significantly outperform simple scoring methods.
Contact: kocsor@inf.u-szeged.hu, pongor@icgeb.org
Appendix
From similarity to probability functions [pdf]
1)
DATASETS
Dataset 1 (SCOP subset)
See http://www1.cs.columbia.edu/compbio/svm-pairwise/
Dataset 2 (3PGK sequences)
The sequences
were from Prof. D. Pearl, the taxonomic classification was taken from Pollack,
J.D., Li, Q. and
-- 3PGK.fasta concatenated FASTA file of 131 sequences
-- 3PGK_cast_matrix, a tab-delimited table specifying the positive and negative training and test sets for each family. Each row is one sequence, and each column is one family. (1 = positive train; 2 = negative train; 3 = positive test; 4 = negative test).
Dataset 3 (COG sequences)
Sequences were taken from the COG database (http://www.ncbi.nlm.nih.gov/COG/). We used 17973 sequences from 117 COG groups with at least 8 eukaryotic sequences (positive test group, sequences of the S. cerevisiae, S. pombe and E. cuniculi genomes) and 16 additional prokaryotic sequences (positive training group).
-- COG.fasta concatenated FASTA file
-- COG cast matrix, a tab-delimited table specifying the positive and negative training and test sets for each family. Each row is one sequence, and each column is one family. (1 = positive train; 2 = negative train; 3 = positive test; 4 = negative test).
2)
Results
-- Blast 1NN results for the 3 datasets Excel file with tables, Rows: groups, Column: AUC value obtained by ROC analysis.
-- Beta-dependence of the AUC value in the 3 datasets Excel file.
*Present
address: BioInfoBank Institute, 60-744