1. Title of Database: Document Understanding
2. Sources:
   (a) Donato Malerba
       Dipartimento di Informatica
       University of Bari
       via Orabona 4
       70126 Bari - Italy
       phone: +39 - 80 - 5443269
       fax: +39 - 80 - 5443196
       malerbad@vm.csata.it
   (b) Donor: Donato Malerba
   (c) Date: November 1994
3. Past Usage:
   These data were used for the first time by the donor during his
   stage at ICS, University of California, Irvine (Sept.-Dec,1992).
   Initially, results were published in a technical report of the ESPRIT 
   Project 5203 INTREPID (Innovative Techniques for REcognition and 
   ProcessIng of Documents), entitled

     Malerba D.
     Document Understanding: A Machine Learning Approach
     Technical Report, Esprit Project 5203 INTREPID, 4 March 1993.

   Experiments were performed with FOCL, hence the format of the data files.
   A summary of the results has also been published in:

     Esposito F., Malerba D., Semeraro G., & Pazzani M.
     A Machine Learning Approach to Document Understanding
	 Proc. 2nd Int. Workshop on Multistrategy Learning, Harpers Ferry,
	 WV, pp. 276-292, May 1993.

	 Esposito F., Malerba D., & Semeraro G.
	 Learning Contextual Rules in First-Order Logic
	 Proc. 4th Italian Workshop on Machine Learning (GAA93), Milan, Italy,
	 pp. 111-127, June 1993.

	 Esposito F., Malerba D., & Semeraro G.
     Automated Acquisition of Rules for Document Understanding
	 Proc. of the 2nd Int. Conf. on Document Analysis and Recognition,
	 Tsukuba Science City, Japan, pp. 650-654, October 1993.

	 Semeraro G., Esposito F., & Malerba D.
	 Learning Contextual Rules for Document Understanding
	 Proc. 10th IEEE Conf. on Artificial Intelligence for Applications
	 San Antonio, Texas, pp. 108-115, March 1994.

   There are five concepts, expressed as predicates, to be learned.
   They concern five logical components that is possible to 
   identify in a sample of business letters, namely sender, receiver,
   logotype, reference number and date. The problem is complicated
   by the presence of dependencies among concepts. The problem can 
   be cast as a mulptiple predicate learning problem. Experimental
   results show that learning contextual rules, that is rules
   in which concept dependencies are explicitely considered,
   leads to better results. 
   For a detailed presentation of the whole document processing
   system see also:

	 Esposito F., Malerba D., & Semeraro G.
	 Multistrategy Learning for Document Recognition
	 Applied Artificial Intelligence, 8, pp. 33-84, 1994

4. Relevant Information Paragraph:

   In the experimentation, 30 single page documents were considered. 
   They are copies of letters sent by Olivetti. Six trials were
   performed by randomly selecting 20 documents for the training set
   and 10 for the test set. Each document is identified by a letter
   (A to Z) or a pair of letters (AA, AB, AC, AD).

   Trial  Training documents
     1    A B C D E F G H I J K L M N O P Q R S T
     2    C D E F G H I M P R S V X Y W Z AA AB AC AD
     3    C D E F G H I J K P R S T U V Y W AA AB AC
     4    A B C D E F G J L M N O P Q T V X Z AB AD
     5    A B E F G I J K M N O P Q R T V X Z AA AD
     6    A B C D E F G I J M Q S T X Y Z AA AB AC AD

5. Number of Instances

   Since the problem concerns the classification of parts of a
   document, there are more than 20 training instances (positive
   and negative) per concept. More precisely, we have:

   Trial  No. of training instances   No. of test instances
     1              254                    110
	 2              241                    123
	 3              250                    114
	 4              242                    122
	 5              234                    130
	 6              244                    120

   Moreover, there may be more than one instance of a concept in a
   document, since some logical components may have been fragmented into
   several layout blocks that the layout analysis was not able to group
   together. 

6. Number of Attributes 

   Each document page layout is described by means of the following 
   target predicates:
     logic_type-sender(X)
     logic_type-receiver(X)
     logic_type-logo(X)
     logic_type-ref(X)
     logic_type-date(X)
   and the following operational predicates:
     width-very-very-small(X)                width of a block
     width-very-small(X)
     width-small(X)
     width-medium-small(X)
     width-medium(X)
     width-medium-large(X)
     width-large(X)
     width-very-large(X)
     width-very-very-large(X)
	 height-smallest(X)                      height of a block
	 height-very-very-small(X)
	 height-very-small(X)
	 height-small(X)
	 height-medium-small(X)
	 height-medium(X)
	 height-medium-large(X)
	 height-large(X)
	 height-very-large(X)
	 height-very-very-large(X)
	 height-largest(X)
	 type-text(X)                            type of a block
	 type-hor-line(X)
	 type-picture(X)
	 type-ver-line(X)
	 type-graphic(X)
	 type-mixture(X)
	 position-top-left(X)                    position of a block in the page
	 position-top(X)
	 position-top-right(X)
	 position-left(X)
	 position-center(X)
	 position-right(X)
	 position-bottom-left(X)
	 position-bottom(X)
	 position-bottom-right(X)
     part-of(X,Y)                            X denotes a doc., Y a block
	 on-top(X,Y)                             block X on top block Y
	 to-right(X,Y)                           block X to right block Y
	 aligned-only-left-col(X,Y)              alignment between two blocks
	 aligned-only-right-col(X,Y)
	 aligned-only-middle-col(X,Y)
	 aligned-both-columns(X,Y)
	 aligned-only-upper-row(X,Y)
	 aligned-only-lower-row(X,Y)
	 aligned-only-middle-row(X,Y)
	 aligned-both-rows(X,Y)

7. Missing Attribute Values:  No missing value.
8. Class Distribution: different from concept to concept and from
                       trial to trial.
9. Additional Notes: the file FOIL.data contains the descriptions of the
                     instances of all documents. Data are in a format
                     readable by FOIL4.0.