Note: character.tar contains everything in this directory, as well as train and test sets. 1. TITLE: Artificial Character Database 2. SOURCES: Marco Botta Dipartimento di Informatica Universita` di Torino Corso Svizzera 185 10149 Torino ITALY Tel. (+39)(11)7712002 Fax. (+39)(11)751603 email: botta@di.unito.it July, 1992 3. PAST USAGE: (a) M. Botta, A. Giordana, L. Saitta: "Learning Fuzzy Concept Definitions", submitted to IEEE-Fuzzy Conference. We made three type of experiments: we run Smart+ by using default values for the parameter constants in the fuzzy definitions of the predicates, a local optimization algorithm and a genetic algorithm to automatically acquire parameter values. The local optimization algorithm and the GA are also described in: M. Botta, A. Giordana: "Learning Quantitative Feature in a Symbolic Environment", LNAI 542, 1991, pp. 296-305. The results obtained on this data sets are the following: Type of Optimization Recognition Rate Error Rate Ambiguity Rate No OPT 41.48% 3.82% 54.70% Loacl OPT 98.68% 0.12% 1.20% Local+GA OPT 99.70% 0.0% 0.30% 4. RELEVANT INFORMATION: This database has been artificially generated by using a first order theory which describes the structure of ten capitol letters of the English alphabet and a random choice theorem prover which accounts for etherogeneity in the instances. The capitol letters represented are the following: A, C, D, E, F, G, H, L, P, R. Each instance is structured and is described by a set of segments (lines) which resemble the way an automatic program would segment an image. Each instance is stored in a separate file whose format is the following: CLASS OBJNUM TYPE XX1 YY1 XX2 YY2 SIZE DIAG where CLASS is an integer number indicating the class as described below, OBJNUM is an integer identifier of a segment (starting from 0) in the instance and the remaining columns represent attribute values. For further details contact the author. 5. NUMBER OF INSTANCES: 1000 instance (100 per class) as learning set. 5000 instance (500 per class) as test set. 6. NUMBER OF ATTRIBUTES: Each segment in an instance is described by seven attributes, four of which are the most important, one is superflous, and the other two can be computed from the important ones, but are present for efficiency reasons. 7. ATTRIBUTE INFORMATION: TYPE: the first attribute describes the type of segment and is always set to the string "line". Its C language type is char. XX1,YY1,XX2,YY2: these attributes contain the initial and final coordinates of a segment in a cartesian plane. Their C language type is int. SIZE: this is the length of a segment computed by using the geometric distance between two points A(X1,Y1) and B(X2,Y2). Its C language type is float. DIAG: this is the length of the diagonal of the smallest rectangle which includes the picture of the character. The value of this attribute is the same in each object. Its C language type is float. 8. MISSING ATTRIBUTE VALUES: None 9. CLASS DISTRIBUTION: the class value (CLASS) can take ten different values. Each letter belongs to only one of classes. CLASS NAME TRAINING TESTING TOTAL 1 A 100 500 600 2 C 100 500 600 3 D 100 500 600 4 E 100 500 600 5 F 100 500 600 6 G 100 500 600 7 H 100 500 600 8 L 100 500 600 9 P 100 500 600 10 R 100 500 600 ----------------------------------------------------------- TOTAL 1000 5000 6000