University of Szeged Natural Language Processing Group Hungarian Academy of Sciences

Web Page of HomePage Corpus

This site is the support site of the HomePage Corpus and the Annotation Tool of the corpus. This corpus is a manually and extensively annotated corpus for Web Content Mining. It is freely available for research purposes. We developed an Annotation Tool, which is a Firefox extension which allows the annotator to work with the pages in their original appearance. This tool handles the annotation hierarchy independently of the DOM tree of the web pages, and it allows overlapped annotation between the HTML tags. For more details, please read our article.

HomePage Corpus

You can download the corpus here.

Articles and Statistics

Farkas, Richárd; Ormándi, Róbert; Jelasity, Márk; Csirik, János 2008: A manually annotated HTML corpus for a novel scientific trend analysis. In: Proceedings of the Eighth IAPR International Workshop on Document Analysis Systems (DAS2008): Extended abstracts, Nara, Japan.

Statistics 26/06/2008 show

Annotation Tool

Annotation Tool as a Firefox extension download
Firefox Portable with Annotation Tool download