Supplementary data for
Application of
compression-based distance measures to protein sequence classification: a
methodological study
András Kocsor1*, Attila Kertész-Farkas1, László Kaján2 and Sándor Pongor2,3
1Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged, Aradi vértanúk tere 1.,H-6720 Szeged, Hungary
2Bioinformatics Group, International Centre for Genetic Engineering and Biotechnology, Padriciano 99, I-34012 Trieste, Italy
3Bioinformatics Group, Biological Research Centre, Hungarian Academy of Sciences, Temesvári krt. 62, H-6701Szeged, Hungary
Abstract
Motivation: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences.
Results: We constructed Compression-based Distance Measures (CBMs) using the Lempel-Ziv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour (1NN) or support vector machine (SVM) classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome. CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two Hidden Markov Model-based algorithms.
Databases
Downloadable databases:
Dataset I. (A test database used by
William Noble an associates)
Dataset III. (Native and rearranged C1S
seaquences in FASTA format)
Dataset IV. (High and low complexity
segments (length between 20 and 1000 amino acids) of the human proteome. zipped
file)
Figures
Dependence
of the classification performance on sequence length
SVM
1NN
Dependence of the various distances as a function of the sequence
length in Dataset I
(SCOP
subset)
Self-similarity
1-sequence to random
Dependence of the various distances as a
function of the sequence length in Dataset IV
(higy
and low complexity segments from the human proteome, taken from the KOG
database)
1-sequence to random