1. Title of Database: Document Understanding 2. Sources: (a) Donato Malerba Dipartimento di Informatica University of Bari via Orabona 4 70126 Bari - Italy phone: +39 - 80 - 5443269 fax: +39 - 80 - 5443196 malerbad@vm.csata.it (b) Donor: Donato Malerba (c) Date: November 1994 3. Past Usage: These data were used for the first time by the donor during his stage at ICS, University of California, Irvine (Sept.-Dec,1992). Initially, results were published in a technical report of the ESPRIT Project 5203 INTREPID (Innovative Techniques for REcognition and ProcessIng of Documents), entitled Malerba D. Document Understanding: A Machine Learning Approach Technical Report, Esprit Project 5203 INTREPID, 4 March 1993. Experiments were performed with FOCL, hence the format of the data files. A summary of the results has also been published in: Esposito F., Malerba D., Semeraro G., & Pazzani M. A Machine Learning Approach to Document Understanding Proc. 2nd Int. Workshop on Multistrategy Learning, Harpers Ferry, WV, pp. 276-292, May 1993. Esposito F., Malerba D., & Semeraro G. Learning Contextual Rules in First-Order Logic Proc. 4th Italian Workshop on Machine Learning (GAA93), Milan, Italy, pp. 111-127, June 1993. Esposito F., Malerba D., & Semeraro G. Automated Acquisition of Rules for Document Understanding Proc. of the 2nd Int. Conf. on Document Analysis and Recognition, Tsukuba Science City, Japan, pp. 650-654, October 1993. Semeraro G., Esposito F., & Malerba D. Learning Contextual Rules for Document Understanding Proc. 10th IEEE Conf. on Artificial Intelligence for Applications San Antonio, Texas, pp. 108-115, March 1994. There are five concepts, expressed as predicates, to be learned. They concern five logical components that is possible to identify in a sample of business letters, namely sender, receiver, logotype, reference number and date. The problem is complicated by the presence of dependencies among concepts. The problem can be cast as a mulptiple predicate learning problem. Experimental results show that learning contextual rules, that is rules in which concept dependencies are explicitely considered, leads to better results. For a detailed presentation of the whole document processing system see also: Esposito F., Malerba D., & Semeraro G. Multistrategy Learning for Document Recognition Applied Artificial Intelligence, 8, pp. 33-84, 1994 4. Relevant Information Paragraph: In the experimentation, 30 single page documents were considered. They are copies of letters sent by Olivetti. Six trials were performed by randomly selecting 20 documents for the training set and 10 for the test set. Each document is identified by a letter (A to Z) or a pair of letters (AA, AB, AC, AD). Trial Training documents 1 A B C D E F G H I J K L M N O P Q R S T 2 C D E F G H I M P R S V X Y W Z AA AB AC AD 3 C D E F G H I J K P R S T U V Y W AA AB AC 4 A B C D E F G J L M N O P Q T V X Z AB AD 5 A B E F G I J K M N O P Q R T V X Z AA AD 6 A B C D E F G I J M Q S T X Y Z AA AB AC AD 5. Number of Instances Since the problem concerns the classification of parts of a document, there are more than 20 training instances (positive and negative) per concept. More precisely, we have: Trial No. of training instances No. of test instances 1 254 110 2 241 123 3 250 114 4 242 122 5 234 130 6 244 120 Moreover, there may be more than one instance of a concept in a document, since some logical components may have been fragmented into several layout blocks that the layout analysis was not able to group together. 6. Number of Attributes Each document page layout is described by means of the following target predicates: logic_type-sender(X) logic_type-receiver(X) logic_type-logo(X) logic_type-ref(X) logic_type-date(X) and the following operational predicates: width-very-very-small(X) width of a block width-very-small(X) width-small(X) width-medium-small(X) width-medium(X) width-medium-large(X) width-large(X) width-very-large(X) width-very-very-large(X) height-smallest(X) height of a block height-very-very-small(X) height-very-small(X) height-small(X) height-medium-small(X) height-medium(X) height-medium-large(X) height-large(X) height-very-large(X) height-very-very-large(X) height-largest(X) type-text(X) type of a block type-hor-line(X) type-picture(X) type-ver-line(X) type-graphic(X) type-mixture(X) position-top-left(X) position of a block in the page position-top(X) position-top-right(X) position-left(X) position-center(X) position-right(X) position-bottom-left(X) position-bottom(X) position-bottom-right(X) part-of(X,Y) X denotes a doc., Y a block on-top(X,Y) block X on top block Y to-right(X,Y) block X to right block Y aligned-only-left-col(X,Y) alignment between two blocks aligned-only-right-col(X,Y) aligned-only-middle-col(X,Y) aligned-both-columns(X,Y) aligned-only-upper-row(X,Y) aligned-only-lower-row(X,Y) aligned-only-middle-row(X,Y) aligned-both-rows(X,Y) 7. Missing Attribute Values: No missing value. 8. Class Distribution: different from concept to concept and from trial to trial. 9. Additional Notes: the file FOIL.data contains the descriptions of the instances of all documents. Data are in a format readable by FOIL4.0.