THE SECOND DATA GENERATION PROGRAM - DGP/2 Powell Benedict University of Illinois at Urbana Inductive Learning Group Beckman Institute Urbana, IL 61801 tel: (217) 244-1620 E-mail: benedict@cs.uiuc.edu /*************************************************************************/ /*************************************************************************/ /* */ /* DATA GENERATION PROGRAM/2 V1.0 */ /* */ /* Copyright: Powell Benedict, Larry Rendell, and the University of */ /* Illinois, 1990 */ /* */ /* Author : Powell Benedict */ /* Inductive Learning Group */ /* Beckman Institute for Advanced Technology and Sciences */ /* University of Illinois at Urbana */ /* */ /* This program and any results obtained may be freely used for any */ /* purpose provided the above author's reference is appropraitely made. */ /* If this program is modified then that should be stated in the */ /* reference. */ /* */ /* Please send comments, suggestions, bug reports, and requests for */ /* updates to: */ /* Powell Benedict */ /* benedict@cs.uiuc.edu */ /* */ /* This program was written and compiled under Borland Turbo C++. It does*/ /* not, however, use any C++ or Borland specific code and should compile */ /* under any standard C, with the possible exception of */ /* READ_INPUT_PARAMETERS (see comments in that function). */ /*************************************************************************/ /*************************************************************************/ 1. INTRODUCTION. DGP/2 is a generation program for constructing synthetic application domains for inductive learning systems. DGP/2 was motivated by the relative scarcity of natural application domains which tends to restrict experimentation in the development and improvement of inductive learning systems and techniques such as constructive induction and bias optimization. By generating data synthetically we can test experimental techniques and assumptions on thousands of application domains covering a wide range of problem space. Quite naturally the first question to ask is can synthetic data provide a meaningful simulation of the real world. We qualify this concern with four questions: 1. What measurable features of problem space are most useful to synthesize (i.e. for predicting the performance of induction algorithms ? 2. To the extent that some of these features are not measurable without knowledge of the target concept, how can we estimate them ? 3. How much training is actually needed ? In other words, how thoroughly must the synthetic problem space be explored to significantly improve performance on natural induction problems ? 4. What assumptions in DGP/2 are valid for natural problems ? In order to begin to answer these questions several empirical studies have been performed making use of DGP/2 (and its predecessor DGP) which demonstrate the usefulness of synthetic data (see Rendell 1988,1989 and Benedict 1990). This write-up begins by defining terms to provide a common understanding, discusses previous work DGP/2 has been used in, gives an account of the program, including parameter definitions, program operation, a user's guide, and finally a general discussion. Readers who are interested in a more detailed discussion of the experimental work DGP/2 has been used in, as well as the motivation for DGP/2, are encouraged to read the aforementioned papers which can be obtained by sending me e-mail. 2. DEFINITIONS OF TERMS USED IN THIS WRITE-UP. In this section we define terms associated with attribute based concept learning (which we will simply call inductive learning), and present an overview of the strategy. 2.1 Instances, features, classes, target concepts. The starting point for induction algorithms is a space of instances of a application domain. For our purposes, each instance is simply a point in an N-dimensional space. These dimensions represent features of the problem space. Instances are often referred to as events, points or just data. Each instance is labeled with a class value which we often take to be boolean. The goal is to produce a decision function, characterized as rules, which describes the target concept by generalizing the set of instances. Concepts, then, are members of the algebra of subsets defined over the instances in the space. 2.2 Problem spaces and application domains. A problem space represents a generalization of the features of individual application domains, where the generalized features are the dimensions of the problem space. Application domains are specific instantiations of points in the problem space. For example, consider a problem space having 'number of instances' and 'number of features' as its dimensions. Each point, (x,y), in this space corresponds to all those application domains having x number of instances and y number of features. Note that a single point in problem space can map to multiple applications domains. A given application domain (e.g., diagnosing soybean diseases) represents a single problem for an inductive learning system. 2.3 Inductive concept formation, training and testing sets. Inductive concept formation is, trivially, the process of partitioning a set of instances into subsets, whose compact descriptions represent the desired target concepts. This process permits the classification of unseen data based on the features of the observed data. As the induction proceeds, candidate class descriptions, called candidate hypotheses, are constructed and evaluated in terms of their ability to accurately describe the target concept. Induction stops when a hypothesis is found which describes the target concept, or no suitable hypothesis is found. We may view this form of induction as a search through hypothesis space for a hypothesis which accurately describes the target concept. The data presented to the learning algorithm for concept formation is called the training set. The training set consists of instances with associated class values. In the boolean case the class values are said to be positive or negative, indicating whether or not they are examples of the target concept. The induction generalizes the positive instances into a class description, expressed as rules, which excludes all or most all negative instances. Accuracy of the resultant class description is computed with the testing set. The testing set, of identical format to the training set, is passed through the rules, measuring the number of false positives, Fp, (i.e., negative instances classified as positive) and false negatives, Fn, (i.e., positive instances classified as negative). The accuracy, a, is defined as: a = 1 - (Fp + Fn) / T , where T is the total number of instances in the testing set. The error rate, e, is defined as: e = 1 - a. We construct the training and testing sets by partitioning the initial sample of instances into two sets. Several strategies exist for selecting the partition. Since our intent is to have the composition of the training and testing sets reflect the distribution of positive and negative instances in the initial sample, we construct a partition which insures that the ratio of positive and negative instances is preserved. Furthermore, to determine a statistically reliable accuracy, we partition (randomly selecting positive and negative instances in the correct ratio), train, and test several times, and average the individual accuracies. 3. PREVIOUS WORK USING DGP/2. My primary reason for developing DGP/2 is to study bias optimization in inductive learning systems. Mitchell (1980) first defined inductive bias as anything which causes an inductive algorithm to choose one concept over another aside from strict consistency with the training set. Mitchell states that all tractable inductive algorithms must be biased because the number of possible concepts is too large to be searched directly. For example, we may reduce the hypothesis space by only allowing certain classes of hypotheses to be investigated. Additionally, we may reduce the feature set (e.g., through abstraction without loss of information) which in turn reduces the complexity of the hypothesis space. Inductive bias can be expressed as feature abstraction, restrictions on the language of concept expression, search operators, feature construction operators, etc. Typically, biases are implicit in the design and implementation of the induction algorithm. Such systems are said to represent a fixed bias. Biases, however, necessarily restrict the applicability of the induction system, by reducing the domains over which the algorithm is effective. For example, if we limit the language of concept expression to conjunctive terms we could never express disjunctive concepts, which in turn, reduces the number of domains for which this bias is effective. Fixed bias systems, consequently, restrict the application domain independent applicability of induction algorithm. Utgoff (Utgoff & Mitchell, 1982; Utgoff, 1986) states that learning algorithms can achieve better performance if they dynamically alter their biases. Such systems are said to represent dynamic bias. His novel STABB (Search Towards a Better Bias) lead to significant research in bias management. Further work, such as Russell & Grosof (1990) and Rendell et. Al (1987) emphasizes that algorithms can only alter their biases dynamically if these bias are made explicit and the algorithm has a built in control structure for intelligently selecting these biases. The focus of this paper is in learning efficient bias selection rules from synthetic data for selecting biases in dynamic bias systems. We may dynamically select biases by using a mapping between application domains and biases found to be effective for that domain. In order to create this mapping, we must define an efficient characterization of the problem space, and biases must be identified which are effective for application domains in this space. A new application domain is then located in problem space and the appropriate biases selected. The Variable Bias Management System (VBMS) (Rendell, 1989) does just this. VBMS learns this mapping by applying multiple bias sets to natural application domains and measuring subsequent classification rule accuracies. VBMS characterizes the problem space by the number of training instances and the number of features in the data. Three learning systems, PLS (Rendell, 1983), AQ15 (Michalski et. Al, 1986), ASSISTANT (Kononenko et. Al, 1984), represent the bias sets. VBMS constructs its mapping by measuring the performance of each of the three learning systems on the data sets. The results given for VBMS clearly show the effectiveness of variable bias management. An inherent difficulty in this scheme, however, is that its performance depends on a robust training set of natural application domains to produce an effective mapping. Although Rendell et. Al (1987) reported favorable results for VBMS, subsequent experimentation by his research group confirmed that VBMS learns too slow, that is, it is not easy to find enough natural data sets that have a broad enough distribution in the problem space to result in effective learning. Although many application domains have thousands of instances (e.g., see Seshu et. Al, 1989), the fact is that researchers don't have access to thousands (or even hundreds) of these domains. This problem begs the use of a program like DGP/2 to overcome the limitation of scare natural application domains. In Benedict (1990) DGP/2 was used in two separate experiments to assess the validity of learning bias optimization rules with synthetic data. The first experiment demonstrates similar performance between a natural application domain and a synthetic application domain in response to varying biases. The second experiment generates multiple synthetic application domains, runs a hill-climbing based bias optimizer, converts the results into bias selection rules, and measures their accuracy using the training/testing set methodology. The results are very encouraging. Current work in progress, though not reported here, is measuring the performance of these bias selection rules on natural application domains. In addition, in an unpublished manuscript I have used DGP/2 to generate application domains for a back-propogation network in an attempt to learn bias selection rules for setting net parameters such as learning rate. Again, the experiments have yielded encouraging results. DGP/2's ancestor was also used in Rendell et. Al (1988). The objective was to compare the performance of different learning systems on various application domains. Although a number of widely used data sets were analyzed, they also found that a program which generated quasi-random application domains could help in the analysis. This program, the first Data Generation Program (DGP) had the ability to generate application domains based on specific parameters. For example, the user could specify the proportion of positive to negative instances or the number of features. DGP generated instances around peaks and allowed for specification of the mean and standard deviations in the normally distributed data. Interestingly, the performance conclusions drawn from the experiments involving synthetic and natural data were similar, indicating that synthetic data behaves like natural data. DGP was used again in Rendell & Cho (1989) to survey the effect of varying certain characteristics of the target concept on learning algorithms. For example, the results showed that highly disjunctive concepts require more data to learn; a result which is consistent with theoretically derived computational bounds on learning sample sizes for PAC learning (see, e.g. Ehrenfeuct, Haussler, Kearns, & Valiant, 1988). 4. DGP/2 PARAMETERS. DGP/2 is an improvement of DGP. It allows for additional parameters and automates the setting of the standard deviation parameter, which is not easily done by the user. In particular, DGP/2 allows for variation in the number of instances, the number of features, the range of feature values, the number of peaks, the percent of positive instances desired and a radius around the peaks that these instances will fall within (this controls instance density, and determines the standard deviation value for the normal distribution function). The best way to explain DGP/2's parameters is to give an example of an actual parameter file. ;*************************************************************************** ; Sample parameter file for DGP/2 Version 1.0 ; Author: Powell Benedict ;*************************************************************************** ; This is a sample parameter file for DGP/2 V1.0. Notice first that comments ; begin with a semi-colon. Next, blank lines are allowed anywhere. No line ; in this file may be more than 80 characters long. Finally, comments may ; be placed at the end of an actual parameter line. ; Parameter format is: ; parm_name = parm_value ; You may have any number of intervening spaces between the fields of a ; parm line, provided it doesn't add up to more than 80 characters total. ; You may specify the parameters in any order. Parameters which are absent, ; or which are incorrectly formed will be pointed out by DGP/2. DGP/2 does ; in no case supply default values. num_features = 6 ; Number of features max_feature_value = 10 ; Maximum feature value num_peaks = 1 ; Number of peaks in the final space num_instances = 100 ; Number of instances generated proto_seed = 2398 ; Initial random seed range = 3 ; Range for positive class membership percentage = 67 ; Percentage of positive instances trunc_flag = 0 ; Out of range instance disposition flag out_file_name = test.out ; Filename for instances stat_file_name = stat.out ; Filename for run statistics ; End of parm file num_features - This parameter allows the user to specify the number of features desired in the final instance set (equivalent to the number of dimensions). max_feature_value - This value specifies the maximum feature space value for all features, 0 being the minimum. Note that because the instances are generated with a normal distribution about pre-selected peaks, it is possible for a feature value to be outside of the range [0,max_feautre_value]. The disposition of these instances is specified by the trunc_flag parameter. num_peaks - This specifies the number of peaks in the positive data. Peaks may be thought of as distinct concentrations of positive instances. The height of the peak corresponds to the density of the positive instances about the centroid of the peak. For example, if we had an instance set of men, where the features were height and weight, we could have three peaks corresponding to the norm for men (the highest peak), basketball players, and football players. Each peak has a centroid which is the average of the instances in the given peak. In DGP/2, one centroid point is generated for each peak requested. The centroids are used as the mean values in the normal distribution function for generating instances. Placement of the centroids is currently done by DGP/2. Centroid generation is constrained to a region from -0.5 * max_feature_value to +0.5 * max_feature_value. This prevents centroids from being to close to the instance space boundaries, which would cause many out of range instances (see the discussion of the parameter trunc_flag). num_instances - This parameter allows you to specify the number of instances in the total sample, both positive and negative. proto_seed - The random number generator used in DGP/2 is taken from The Communications of the ACM (see the code for an exact reference). It requires three seeds. I have, however, for user convenience only requested a single seed, the proto_seed. This seed is used to seed the Turbo-C random number generator, which then generates the three seeds required by DGP/2's random number generator (see the code for detail). Note that the operation of DGP/2 is completely deterministic on the proto_seed. The same proto_seed and parameter settings will produce the same instance file. range - This values specifies a range around the centroids, within which instances will be positive. This range along with percentage (number of positive instances requested), determines the standard deviation value in the normal distribution function. Using a low range, with say 67% positive, will produce a higher peak, a larger range produces a flatter peak. Note also that a large range with respect to max_feature_value causes overlaps in the positive instances from each peak. I generally set range to be about 10-20% of max_feature_value. percentage - This allows you to specify what percentage of num_instances are to be positive. The positive instances are divided evenly across the number of peaks. The actual value of positive instances generated can vary slightly (the exact value generated is reported in the stats file). Variation can be due to the random nature of generation, but usually occurs because of range overlap. If the centroids are too close and/or the range is too large, then a given instance which is negative with respect to the peak it was generated about may be picked up as a positive instance by another peak. If DGP/2 refused to let another peak pick up this instance then we could wind up with negative instances right in the middle of a peak. In general, based on my experimentation, this error is very low (+- 0.1%), but does vary as a function of the rest of the parameters. trunc_flag - This parameter controls the disposition of instances which have one or more feature values which lie outside the bounds of the instance space: [0,max_feature_value]. Because we use a true normal distribution function, which is asymptotic to y=0, we can produce feature values which are out of bounds. Trunc_flag can assume a value of 0,1 or 2. A value of 0 causes DGP/2 to allow those out of bound instances to be included in the final instance set. Note that because of the constraints on the placement of the centroids the majority of out of bound instances will be negative instances (given a reasonable value for range). A value of 1 causes DGP/2 to simply delete the offending instance. This preserves a correct distribution but can result in fewer instances in the final instance set. A value of 2 causes the out of bound feature value to be re-generated until it is in bounds. This preserves the correct number of instances, but can corrupt the distribution resulting in more positive instances than requested. Note that because of the randomness of generation this is the same as regenerating the whole instance. Exactly what occurs due to the value of trunc_flag is printed in the stats file. out_file_name - This allows you to specify the name of the final instance file. The format of this file is completely controlled by the function print_instances. That function is extensively documented as to the current format and how the format can be changed. stat_file_name - This allows you to specify the name of the run statistics file. This file contains complete information about the given run. Across repeated runs this file is appended to, allowing a collection of run statistics. To prevent this file from being kept, the null device can be specified as the file name. 5. DGP/2 OPERATION. DGP/2 operates by generating n peak centroids within a subrange of the total instance space range (as determined by the range of feature values) and divides the total number of instances across the n peaks. Feature values are randomly generated whose frequency of occurrence (done on a feature by feature basis) around the corresponding peak centroid value have a normal distribution. Specification of what percentage of instances should fall within a given range of the peak centroids (making them positive instances) automatically adjusts the standard deviation value in the normal distribution function (the mean is determined by the location of the peak centroids). Instances outside this distance are assigned to the negative class. Note that the uniform range, which is used as a standard n-space Euclidean distance measure, value results in hyper-sphere clusters of instances around the peaks. This, combined with the distance between peaks, allows accurate specification of positive instance densities. The main routine in the code is extremely simple and reads exactly as it runs. The code has been thoroughly commented to facilitate understanding and modification as needed. It is suggested that the code be read through to the level of the comments. 6. USER'S GUIDE (such as it is). DGP/2 takes the name of the parameter file from the command line. If no parameter file name is given it defaults to PARMS.DGP. The parameter file contains all run information. For example, if DGP/2 were compiled to DGP2.EXE then DGP2 MYPARMS.TXT, would run DGP/2 using the file MYPARMS.TXT as the parameter file. Typically, the only modifications required to the DGP/2 code will be in the function print_instances. Since this function prints the instances, along with their class membership, it will have to be modified to reflect individual user's needs. The function is extensively commented, indicating both the current form of the instance file as well as where modifications would most likely have to be made. Additionally, this function has the responsibility of rounding the instance values. All instance, and centroid, values are generated as reals. For my application I required integers. To this effect, my version of print instances rounds feature values based on whether they are positive or negative and their location relative to the peak they belong to. Again, the code contains extensive documentation about this. 7. GENERAL INFORMATION. DGP/2 was written in C using Borland's Turbo C++ compiler, although no C++ functions were used. I have used only ANSI supported functions to enhance portability, with the single exception of the Turbo C++ random number generator (the code is appropriately documented to indicate where this area is). The only other C compiler I have used is GCC (Gnu's C Compiler). Although the code required a few modifications is was reasonably straight forward to convert. I have to the best of my ability debugged this code and I am fairly sure it works correctly. Like any proud parent I may have overlooked something. In all cases I would appreciate any and all correspondence regarding DGP/2. This includes bugs (and hopefully, fixes), suggestions for improvements, etc. In particular, if this code is translated to run under different compilers I would be very interested in obtaining a copy. In fact anything you might want to say can be said to benedict@cs.uiuc.edu. I will make periodic updates to this program. As they become available I will post them to any central place this code has been posted. I am keeping a list of people requesting updates. If you would like to be on this list please let me know. [PB]REFERENCES Benedict, P.A., The Use of Synthetic Data in Dynamic Bias Selection, Proc. Of the 6th Aerospace Applications of Artificial Intelligence Conference, Dayton, Ohio, October, 1990. Ehrenfeucht, A., Haussler, D., Kearns, M, Valiant, L. A general lower bound on the number of examples needed for learning. Proc. Computational Learning Theory, 1988, 139-154. Kononenko, I., Bratko, I., Roskar, E., Experiments in Automatic Learning of Medical Diagnostic Rules (Ljubljana, Yugoslavia: Jozef Stefan Institute, 1984). Michalski, R.S., Mozetic, I., Hong, J., Lavrac, N., The Multipurpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains, Proc. Of the Fifth National Conference on Artificial Intelligence, Pp. 1041-1045, Morgan Kaufman, Los Altos, Ca, 1986. Mitchell, T. M. The need for biases in learning generalizations. Technical Report CBM-TR-117, May 1980. Rendell, L.A., A New Basis for State Space Learning Systems and a Successful Implementation, Artificial Intelligence 20(1983):369-392. Rendell, L. A., Cho, H. H. The effect of data character on empirical concept learning in Proc. Fifth International Conference on Artificial Intelligence Applications, March, 1989. Rendell, L. A., Benedict, P. A., Cho, H. H., Seshu, Improving the design of rule-learning systems, Proceedings of the Seventh International Conference on Expert Systems and their Applications, June, 1988. Rendell, L., Seshu, R., Learning hard concepts, to appear in Computational Intelligence, 1990. Rendell, L. A., Seshu, R. M., Tcheng, D. K. Layered concept learning and dynamically-variable bias management. Proceedings of the Tenth International Joint Conference on Artificial Intelligence, 1987. Russell, S., Grosof, B. Declarative bias: An overview, in P. Benjamin (Ed.), Change of Representation and Inductive Bias. Kluwer Academic Press, 1990. Utgoff, P. E. Shift of bias for inductive concept learning. Machine Learning: An Artificial Intelligence Approach, 1986, III. Utgoff, P. E., Mitchell, T. M., Acquisition of appropriate bias for inductive concept learning, Proc. National Conference on Artificial Intelligence, 1982.