The aim of this task is to identify sentences in texts which contain unreliable or uncertain information. In particular, the task is a binary classification problem, i.e. to distinguish factual versus uncertain sentences.
As training data
Since uncertainty cues play an important role in detecting sentences containing uncertainty, they are tagged in the training data to enhance training. On the other hand, they will not be given in the evaluation dataset since cue tagging in submissions is not mandatory (however, we encourage participants to do that).
Additionally, unannotated (but pre-processed) paragraphs from Wikipedia are offered as well. These data do not contain any annotation for weasel cues and/or uncertainty. Using these data enables sampling from a large pool of Wikipedia articles. Since evaluation will be partly carried out on Wikipedia paragraphs, the exploitation of raw Wikipedia texts other than offered here is PROHIBITED when training the systems.
Evaluation will be carried out on the sentence level: i.e. whether a sentence contains hedge/weasel information or not (the F-measure of the uncertain class will be employed as the chief evaluation metric). As for the submitted system outputs, we expect the certainty attributes for each sentence to be filled. Official evaluation will be based on the certainty attribute values (sentence level evaluation). Providing ccue tags for Task1 as in the train data (i.e. linguistic evidence supporting the sentence-level decision) is NOT mandatory, however, we will evaluate them for those who submit them. This will be used for information only, official ranking will be based on the sentence level F-measure of "uncertain" class.
Evaluation will be carried out
The motivation behind the cross-domain and the open challenges is that in this way, it can be assessed whether the addition of extra (i.e. not domain-specific) information to the systems can contribute to performance.
Evaluation will be carried out on the sentence level: i.e. whether a sentence contains hedge/weasel information or not (the F-measure of the uncertainty class will be employed as the chief evaluation metric).
The biological evaluation set will consist of biomedical full articles (i.e. no abstracts are included in the evaluation dataset).
For the second task, in-sentence scope resolvers have to be developed. Biological scientific texts from the BioScope corpus, in which instances of speculative - that is, keywords and their scope - are annotated manually, serve as the training data. This task falls within the scope of semantic analysis of sentences exploiting syntactic patterns (hedge spans can be usually determined on the basis of syntactic patterns dependent on the keyword).
Task2 involves the annotation of "cue"+"xcope" tags in sentences. We expect the systems to add cue and corresponding xcope tags linked together by using some unique IDs as in the training data. Scope-level F-measure will be used as the chief metric where true positives are scopes which match the gold standard clue words AND gold standard scope boundaries assigned to the clue word. That is, correct scope boundaries with incorrect clue annotation AND correct clue words with bad scope boundaries will be BOTH considered as errors (see FAQ for examples).
Evaluation will be carried out using the same biomedical full articles we use for Task1 (but the level of analysis required for Task2 is different).