Supplementary data for

 

Kalman filtering for disease-state estimation

from microarray data

 

Abstract

 

Motivation:

In this paper we propose using the Kalman filter as a pre-processing step in microarray-based molecular diagnosis. Incorporating the expression covariance between genes is important in such classification problems, since this represents the functional relationships that govern tissue state. Failing to fulfil such requirements may result in biologically implausible class prediction models. Here we show that employing the Kalman filter to remove noise (while retaining meaningful covariance and thus being able to estimate the underlying biological state from microarray measurements) yields linearly separable data suitable for most classification algorithms.

Results:

We demonstrate the utility and performance of the Kalman filter as a robust disease-state estimator on publicly available binary and multiclass microarray datasets in combination with the most widely used classification methods to date. Moreover, using popular graphical representation schemes we show that our filtered datasets also have an improved visualization capability.

Contact:

kelli@nucleus.szbk.u-szeged.hu

 

Code (in Matlab)

 

Contents.m

Readdotnames.m

Readdotdata.m

Trainparam.m

Filterdataset.m

 

Download source Matlab code (.zip)

 

Datasets

            A short description about the employed datasets is presented in [datasets.xsl] excel file.

 

Results

 

            SVM results

                        Only Accuracy and ROC scores

                        All result

 

            ANN results

                        Only Accuracy and ROC scores

                        All result

 

            1NN results

                        Only Accuracy and ROC scores

                        All result

 

            RF results

                        Only Accuracy and ROC scores

                        All result

           

Most performance measures (e.g. ROC, Specificity, Recall, true positive rate, etc.) are defined for two-class classification problems, and the scores are calculated for each class in a multi-class dataset. To measure the performance for a dataset we computed the average of these scores and here ` - mean` denotes the average of the performance scores. The training time and the testing time for a whole dataset are both given in seconds.

 

A short description about the employed performance measure is can be found, for example, in

 

 

            t-test results

                        The significance test for all performance measures also downloadable [t-test.xls] in excel file.

 

Name of selected features

 

            The name of each feature selected by RFE is also downloadable in [features.zip] file (a list).

 

 

Figures

            figures for ALL-AML dataset (Golub et al., 1999)

            figures for Breast Cancer (BC) dataset (van`t Veer et al., 2002)

            figures for Leukeamia dataset (Yeoh et al., 2002)

            figures for Lung Cancer (LC) dataset (Gordon et al., 2002)

            figures for MLL dataset (Armstrong et al., 2002)

            figures for SRBCT dataset (Khan et al., 2001)

            figures for Tumours (Various Tumour Types, VTT) dataset (Ramaswamy et al., 2001)

 

 

figures for ALL-AML dataset

2 features

3 features

5 features

7 features

10 features

15 features

20 features

30 features

50 features

75 features

100 features

all (7129) features

figures for Breast Cancer (BC) dataset

2 features

3 features

5 features

7 features

10 features

15 features

20 features

30 features

50 features

75 features

100 features

all (24188) features

figures for Leukeamia dataset

2 features

3 features

5 features

7 features

10 features

15 features

20 features

30 features

50 features

75 features

100 features

all (10342) features

figures for Lung Cancer (LC) dataset

2 features

3 features

5 features

7 features

10 features

15 features

20 features

30 features

50 features

75 features

100 features

all (12533) features

figures for MLL dataset

2 features

3 features

5 features

7 features

10 features

15 features

20 features

30 features

50 features

75 features

100 features

all (12582) features

figures for  SRBCT dataset

2 features

3 features

5 features

7 features

10 features

15 features

20 features

30 features

50 features

75 features

100 features

all (2308) features

figures for Tumours (Various Tumour Types, VTT) dataset

2 features

3 features

5 features

7 features

10 features

15 features

20 features

30 features

50 features

75 features

100 features

all (16063) features