University of Szeged Natural Language Processing Group Hungarian Academy of Sciences

The BioScope corpus

The BioScope corpus consists of medical and biological texts annotated for negation, speculation and their linguistic scope. This was done to allow a comparison between the development of systems for negation/hedge detection and scope resolution. The corpus is publicly available for research purposes.

BioNLP-2008 paper on BioScope (please cite if you make use of the corpus):

Veronika Vincze, György Szarvas, Richárd Farkas, György Móra, and János Csirik: The BioScope corpus: annotation for negation, uncertainty and their scope in biomedical texts, BMC Bioinformatics 2008, 9(Suppl 11):S9

The annotation guidelines: pdf
Annotation principles are also discussed in the following paper:

Vincze, Veronika 2010: Speculation and negation annotation in natural language texts: what the case of BioScope might (not) reveal. In: Proceedings of the Workshop on Negation and Speculation in Natural Language Processing (NeSp-NLP 2010), Uppsala, Sweden, pp. 28-31.

The corpus was also employed as the training database of the CoNLL-2010 Shared Task Learning to detect hedges and their scope in natural language text.

Corpus download

The corpus consists of texts taken from 3 different sources in order to ensure that it captures the heterogeneity of language use in the biomedical domain. Here is DTD for the xml files containing the annotations: DTD

Abstracts of the Genia corpus: xml v1.1 (In version 1.1 the Genia UIDs were replaced by PMIDs)

Full scientific articles, 5 articles from FlyBase (the same articles were used by Medlock and Briscoe (2007)) and 4 articles from the open access BMC Bioinformatics repository: xml

Clinical free-texts: The radiology report corpus that was used for the CMC clinical coding challenge. The negation/hedge annotated version of the corpus can be obtained (due to licencing issues) by downloading the original 'ICD-9-CM coding' corpus from Cincinatti Children's Hospital site and merge it with our annotation: readme, merger software.

The full corpus and the evaluation code in one file: zip

For the evaluation of the CoNLL-2010 Shared Task, 15 more biomedical articles were annotated for hedges and their scope, which can be accessed at the shared task website, following a registration.

Inter-agreement analysis

The BioScope corpus was annotated by two independent linguists following the guidelines written by our linguist expert before the annotation of the corpus was initiated. These guidelines were developed throughout the annotation process as annotators were often confronted with problematic issues. The annotators were not allowed to communicate with each other as far as the annotation process was concerned, but they could turn to the expert when needed and regular meetings were also held between the annotators and the linguist expert in order to discuss recurring and/or frequent problematic issues. When the two annotations for one subcorpus were finalized, differences between the two were resolved by the linguist expert, yielding the gold standard labeling of the subcorpus.

We measured the consistency level of the annotation using inter-annotator agreement analysis. The inter-annotator agreement rate is defined as the F-measure of one annotation, treating the second one as the gold standard. The evaluation has two levels: first keyword F-measures are calculated, then left/right/full scope F-measures are gathered around the true positive keyword matches.

The JAVA code for evaluating scope annotations: zip , readme

Corpus statistics

In the table below, agreement rates are provided in the following format: the first number in each cell represents the agreement rate between the two annotators, whereas the second and third numbers give the agreement rate between one of the annotators and the chief annotator:
typeclinical recordsabstractsfull articles
keyword90.70 / 94.56 / 95.8191.46 / 91.71 / 98.0579.42 / 86.77 / 91.71
left scope86.27 / 86.86 / 97.9597.78 / 97.90 / 10083.44 / 82.42 / 95.87
right scope88.88 / 91.26 / 97.3994.56 / 95.17 / 99.4284.36 / 88.19 / 95.09
full scope76.29 / 79.32 / 95.3592.46 / 93.07 / 99.4270.86 / 73.35 / 91.21
keyword84.01 / 89.86 / 92.3779.12 / 83.92 / 92.0577.60 / 81.49 / 90.81
left scope89.36 / 88.90 / 97.6087.52 / 88.37 / 97.5875.49 / 80.13 / 92.15
right scope91.28 / 92.64 / 97.9087.13 / 89.92 / 96.1682.40 / 83.28 / 96.97
full scope81.90 / 82.88 / 95.5476.72 / 80.07 / 94.0462.50 / 66.72 / 89.67