1. Title: MUSK "Clean2" database

2. Sources:
   (a) Creators:  AI Group at Arris Pharmaceutical Corporation
        contact:  David Chapman or Ajay Jain
                  Arris Pharmaceutical Corporation
                  385 Oyster Point Blvd.
                  South San Francisco, CA 94080
                  415-737-8600
                  zvona@arris.com, jain@arris.com
   (b) Donor:     Tom Dietterich
                  Department of Computer Science
                  Oregon State University
                  Corvallis, OR 97331
                  503-737-5559
                  tgd@cs.orst.edu
   (c) Date received: September 12, 1994

3. Past Usage:

   (a) Dietterich, T. G., Jain, A., Lathrop, R., Lozano-Perez, T. (1994).
       A comparison of dynamic reposing and tangent distance for drug
       activity prediction.  Advances in Neural Information Processing
       Systems, 6.  San Mateo, CA: Morgan Kaufmann.  216--223.

       The clean2 dataset included here is derived from the starting
       poses employed in this paper.  The paper reports the following
       results:

       Algorithm:                                 20-fold XVAL:
       1-nearest neighbor (euclidean distance)    75%
       neural network (standard poses)            75%
       1-nearest neighbor (tangent distance)      79%
       neural network (dynamic reposing)          91%

       The tangent distance and dynamic reposing technique require
       computation of the molecular surface, which cannot be done
       using the feature vectors included in this data set.

   (b) Jain, A. N., Dietterich, T. G., Lathrop, R. H., 
       Chapman, D., Critchlow, R. E., Bauer, B. E., Webster, T. A.,
       Lozano-Perez, T.  Compass: A shape-based machine learning tool for
       drug design.  Accepted for publication in Computer-Aided
       Molecular Design. 

       This paper describes the dynamic reposing technique in more
       detail and reports the same result for dynamic reposing as
       above.  The paper also gives a complete description of each of
       the 102 molecules in the data set.

   (c) Dietterich, T. G., Lathrop, R. H., Lozano-Perez, T. (submitted)
       Solving the multiple-instance problem with axis-parallel rectangles.
       Submitted to Artificial Intelligence.

       This paper describes a family of axis-parallel rectangle
       algorithms and compares various approaches to the multiple
       instance problem.  It includes the following table:

        Algorithm             TP FN FP TN errs %correct [CI]
        iterated-discrim APR  30  9  2 61  11  89.2 [83.2--95.2]
        GFS elim-kde APR      32  7 13 50  20  80.4 [72.7--88.1]
        GFS elim-count APR    31  8 17 46  25  75.5 [67.1--83.8]
        all-positive APR      34  5 23 40  28  72.6 [63.9--81.2]
        backpropagation       16 23 10 53  33  67.7 [58.6--76.7]
        GFS all-positive APR  37  2 32 31  34  66.7 [57.5--75.8]
        most frequent class    0 39  0 63  39  61.8 [52.3--71.2]
        C4.5 (pruned)         32  7 35 28  42  58.8 [49.3--68.4]
        
        key: TP = true positives
             FN = false negatives
             FP = false positives
             TN = true negatives
             errs = errors = FN+FP
             %correct = 10-fold cross-validation %correct.
             CI = 95% confidence interval on proportion of correct
             predictions.
             For explanations of the various algorithms, see the
             paper. 

        C4.5 and backprop were applied ignoring the multiple instance
        problem (see below) during training, but obeying it during
        testing.  

        This paper also gives more details on the construction of the
        data set. 

4. Relevant Information:
   This dataset describes a set of 102 molecules of which 39 are judged
   by human experts to be musks and the remaining 63 molecules are
   judged to be non-musks.  The goal is to learn to predict whether
   new molecules will be musks or non-musks.  However, the 166 features
   that describe these molecules depend upon the exact shape, or
   conformation, of the molecule.  Because bonds can rotate, a single
   molecule can adopt many different shapes.  To generate this data
   set, all the low-energy conformations of the molecules were
   generated to produce 6,598 conformations.  Then, a feature vector
   was extracted that describes each conformation. 

   This many-to-one relationship between feature vectors and molecules
   is called the "multiple instance problem".  When learning a
   classifier for this data, the classifier should classify a molecule
   as "musk" if ANY of its conformations is classified as a musk.  A
   molecule should be classified as "non-musk" if NONE of its
   conformations is classified as a musk.

5. Number of Instances  6,598

6. Number of Attributes 168 plus the class.

7. For Each Attribute:
   
   Attribute:           Description:
   molecule_name:       Symbolic name of each molecule.  Musks have names such
                        as MUSK-188.  Non-musks have names such as
                        NON-MUSK-jp13.
   conformation_name:   Symbolic name of each conformation.  These
                        have the format MOL_ISO+CONF, where MOL is the
                        molecule number, ISO is the stereoisomer
                        number (usually 1), and CONF is the
                        conformation number. 
   f1 through f162:     These are "distance features" along rays (see
                        paper cited above).  The distances are
                        measured in hundredths of Angstroms.  The
                        distances may be negative or positive, since
                        they are actually measured relative to an
                        origin placed along each ray.  The origin was
                        defined by a "consensus musk" surface that is
                        no longer used.  Hence, any experiments with
                        the data should treat these feature values as
                        lying on an arbitrary continuous scale.  In
                        particular, the algorithm should not make any
                        use of the zero point or the sign of each
                        feature value. 
   f163:                This is the distance of the oxygen atom in the
                        molecule to a designated point in 3-space.
                        This is also called OXY-DIS.
   f164:                OXY-X: X-displacement from the designated
                        point.
   f165:                OXY-Y: Y-displacement from the designated
                        point.
   f166:                OXY-Z: Z-displacement from the designated
                        point. 
   class:               0 => non-musk, 1 => musk

   Please note that the molecule_name and conformation_name attributes
   should not be used to predict the class.

8. Missing Attribute Values: none.

9. Class Distribution: 
   Musks:     39
   Non-musks: 63