Deposited materials

Application of a simple log likelihood ratio approximant to

protein sequence classification

László Kaján^1,3*, Attila Kertész-Farkas², Dino Franklin¹, Nelly Ivanova¹, András Kocsor^2**and Sándor Pongor^1**

¹Bioinformatics Group, International Centre for Genetic Engineering and Biotechnology, Padriciano 99, I-34012 Trieste, Italy,

²Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged, Aradi vértanúk tere 1., H-6720 Szeged, Hungary

³Bioinformatics Group, Biological Research Centre, Hungarian Academy of Sciences, Temesvári krt. 62, H-6701 Szeged, Hungary

Supplementary data for this article

*Abstract

Motivation: Likelihood ratio approximants have been widely used for model comparison in statistics. The present study was undertaken in order to explore their utility in the classification of protein sequences and sequence similarity searching.

Results: We used a simple log likelihood ratio approximant (LRA) based on the maximal similarity (or minimal distance) scores of the two top ranking sequence classes. The scoring methods (Smith-Waterman, BLAST, local alignment kernel, amino acid composition vector-distance and compression based distances) were compared on datasets designed to test sequence similarities between proteins distantly related in terms of structure or evolution. It was found that LRA-based scoring can significantly outperform simple scoring methods.

Contact: kocsor@inf.u-szeged.hu, pongor@icgeb.org

Appendix

From similarity to probability functions [pdf]

1) DATASETS

Dataset 1 (SCOP subset)

See http://www1.cs.columbia.edu/compbio/svm-pairwise/

Dataset 2 (3PGK sequences)

The sequences were from Prof. D. Pearl, the taxonomic classification was taken from Pollack, J.D., Li, Q. and Pearl, D.K. (2005) Taxonomic utility of a phy-logenetic analysis of phosphoglycerate kinase proteins of Archaea, Bacteria, and Eukaryota: insights by Bayesian analyses, Mol Phyloge-net Evol, 35, 420-430

-- 3PGK.fasta concatenated FASTA file of 131 sequences

-- 3PGK_cast_matrix, a tab-delimited table specifying the positive and negative training and test sets for each family. Each row is one sequence, and each column is one family. (1 = positive train; 2 = negative train; 3 = positive test; 4 = negative test).

Dataset 3 (COG sequences)

Sequences were taken from the COG database (http://www.ncbi.nlm.nih.gov/COG/). We used 17973 sequences from 117 COG groups with at least 8 eukaryotic sequences (positive test group, sequences of the S. cerevisiae, S. pombe and E. cuniculi genomes) and 16 additional prokaryotic sequences (positive training group).

-- COG.fasta concatenated FASTA file

-- COG cast matrix, a tab-delimited table specifying the positive and negative training and test sets for each family. Each row is one sequence, and each column is one family. (1 = positive train; 2 = negative train; 3 = positive test; 4 = negative test).

2) Results

-- Blast 1NN results for the 3 datasets Excel file with tables, Rows: groups, Column: AUC value obtained by ROC analysis.

-- Beta-dependence of the AUC value in the 3 datasets Excel file.

*Present address: BioInfoBank Institute, 60-744 Poznan, Poland, Email: kajla@bioinfo.pl **To whom correspondence should be addressed.