In computational linguistics, especially in information extraction and retrieval it is of the utmost importance to distinguish between uncertain statements and factual information. In most cases, what the user needs is factual information, hence, uncertain and negated propositions should be treated in a special way: depending on the exact task, the system should either ignore such texts or separate them from factual information.
Our researchers developed the BioScope corpus in which biomedical texts are annotated for uncertainty and negation cues and their scopes. The objective of the ConLL-2010 Shared Task organized by the members of the human language technology group was the automatic identification of uncertainty cues and their scopes. Following the annotation principles applied in the construction of the databases used in the shared task, we created a database of Hungarian Wikipedia articles annotated for uncertainty cues called weasels. This corpus can have an essential role in implementing and evaluating uncertainty detectors for Hungarian.
Several corpora annotated for uncertainty have been constructed for different genres and domains (BioScope, FactBank, WikiWeasel, MPQA just to name a few). However, these corpora cover different aspects of uncertainty, being grounded on different linguistic models, making it hard to exploit cross-domain knowledge in applications. These differences in part stem from the varied application needs across application domains since different types of uncertainty and classes of linguistic expressions are relevant for different domains. A fine grained categorization of semantic uncertainty enables the individual treatment of each subclass, which is less dependent on domain differences than using one coarse-grained uncertainty class.
Based on the above fine-grained categorization, we manually harmonized the uncertainty annotations of three corpora, yielding the Szeged Uncertainty Corpus:
The feasibility of this categorization of uncertainty phenomena was supported by training an accurate semantic uncertainty detector on the above corpora, i.e. texts from several domains and genres. Our experiments with domain adaptation techniques also highlight that the unified subcategorization and domain adaptation, taken together, offer an efficient solution for cross-domain and cross-genre semantic uncertainty recognition. Our results are reported in Szarvas et al. (2012).
The resources can be used free of charge under the licence Creative Commons Attribution Share Alike.
Farkas, Richárd; Vincze, Veronika; Móra, György; Csirik, János; Szarvas, György 2010: The CoNLL-2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Text. In: Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL-2010): Shared Task, Uppsala, Sweden, pp. 1-12.
Szarvas, György; Vincze, Veronika; Farkas, Richárd; Móra, György; Gurevych, Iryna 2012: Cross-Genre and Cross-Domain Detection of Semantic Uncertainty. Computational Linguistics - Special Issue on Modality and Negation, 38(2):335-367.
Saurí, Roser; Pustejovsky, James 2009: FactBank: a corpus annotated with event factuality. Language Resources and Evaluation 43:227-268.
Vincze, Veronika; Szarvas, György; Farkas, Richárd; Móra, György; Csirik, János 2008: The BioScope Corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics 9 (Suppl 11):S9 doi:10.1186/1471-2105-9-S11-S9
Vincze, Veronika 2010: Speculation and negation annotation in natural language texts: what the case of BioScope might (not) reveal. In: Proceedings of the Workshop on Negation and Speculation in Natural Language Processing (NeSp-NLP 2010), Uppsala, Sweden, pp. 28-31.
Farkas, Richárd; Vincze, Veronika; Móra, György; Csirik, János; Szarvas, György 2010: Bizonytalanságot jelölő kifejezések és hatókörük azonosítása természetes nyelvi szövegekben: a CoNLL-2010 verseny tapasztalatai. In: Tanács, Attila; Vincze, Veronika (eds.): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp. 354-357.
For further information please contact György Szarvas (szarvas AT inf.u-szeged.hu), Richárd Farkas (rfarkas AT inf.u-szeged.hu) and Veronika Vincze (vinczev AT inf.u-szeged.hu).