A Named Entity (NE) is a phrase in the text which uniquely refers to an entity of the world. The identification and classification of NEs serves as a base for several NLP applications (especially in information extraction and machine translation). The group has constructed several manually annotated NE corpora for Hungarian and has developed a NER system, which has been successfully applied to Hungarian and English business news and English clinical texts.
In information extraction, many applications seek to extract factual information from text. That is why it is of high importance to distinguish uncertain and/or negated text spans from factual information. In most cases, what users need is factual information, thus, uncertain or negated text spans should be treated in a special way. Depending on the given task, the system should either neglect such texts or separate them from factual information (later, the user can decide whether s/he needs them). For the training and evaluation of such systems, corpora annotated for negation and speculation are necessary.
Due to the exponentially growing number of publications, the necessity for automatic information extraction is strong in the biomedical domain. The Group's activities in this field mainly focus on the disambiguation of biological terms and the detection of uncertain and negative assertions.
In medical documents (e.g. findings or case histories) there is a huge amount of information encoded in free text format. Automated processing of these texts would make these data easily accessible. The Group's results in this field involve automatic coding of radiological findings and anonymization of medical documents.
The toolkit called magyarlanc aims at the basic linguistic processing of Hungarian texts. The toolkit consists of only JAVA modules (there are no wrappers for other programming languages), which guarantees its platform independency and its ability to be integrated into bigger systems (e.g. web servers).
Free text tagging is the task of assigning a few natural language phrases to documents which summarize them and semantically represent their content. The Group has developed a solution for the automatic tagging of the [origo] news archive.
Keyphrases summarize the content of documents with the most important phrases and automatic keyphrase extraction aims at assigning a set of such keyphrases to documents based on their content. For a more effective way of processing scientific papers, keywords should be given by authors of the papers, however, manually assigned keyphrases are rarely provided and creating them by hand would be costly and time-consuming. The constant growth of the number of publications boosted interest in automatic keyphrase generation.
Extracting keyphrases from textual documents can be valuable in many application areas, ranging from information retrieval to topic detection and summarization.
The supervised keyphrase extractor to be introduced here was trained on the pros and cons assigned to the reviews by their authors on the epinions.com site. These pros and cons are ill-structured free-text annotations and their length, depth and style are extremely heterogeneous. In order to have clean gold-standard corpora, we manually revised the segmentation and the contents of the pros and cons, and obtained sets of tag-like keyphrases.
Scientific social web mining aims at extracting global patterns in the network of researchers of a given field. The Group has developed a method for analyzing full text regions of homepages of researchers, with the help of which scientific social information can be automatically acquired.
Disambiguating person names is a challenging task: it can be seen as a special word sense disambiguation task. On the one hand, names seem to be ambiguous, thousands of people can share their first name or surname. On the other hand, certain names tend to occur in several versions. Thus, results of queries contain homepages that belong to different people with the same name, moreover, certain homepages belonging to a name are not yielded.
Multiword expressions are lexical units that consist of two or more words (tokens), however, they exhibit special syntactic, semantic, pragmatic or statistical features. From an NLP point of view, their treatment is not free of problems since - on the one hand - the system should recognize that they count as one lexical unit (and not two or more words connected) therefore it is advisable to store them as one unit in the lexicon. On the other hand, special rules for their treatment should also be included in the system.
Morphdb.hu is one of the most widely used morphological resources for Hungarian, which makes use of the KR morphological annotation system. However, the largest manually disambiguated corpus, the Szeged Treebank is annotated with MSD codes. The two coding systems are not compatible, which entails that if we want to exploit both resources in a statistical language parser (POS tagger, constituency parser, dependency parser etc.), we have to fall back to conversion rules, which leads to the loss of information. In order to avoid this, we harmonized the two coding systems (MSD and KR) and their basic principles were also made compatible. Due to the harmonized morphology, it is possible to build a morphological parser the output of which is in total harmony with the Szeged Treebank, thus, higher-level text processing systems (such as the toolkits magyarlanc and hun*) can make use of all morphological information encoded in the Szeged Treebank when training their statistical models.
Our group members are experienced in constructing language resources and corpora: besides the two main language resources (Szeged Treebank and the Hungarian WordNet), they have built several other databases as well.