BioNLP, ML and other things György Móra

Publications

Gene name detection

A statistical and thesaurus-based hybrid Named Entity recognition system,
Richárd Farkas, György Móra
Proceeding of the First CALBC Workshop

We participated in the Named Entity Recognition task of the CALBC Challenge. Our system combines a Conditional Random Fields Named Entity Recogniser with the UMLS expert-crafted database. Our chief hypothesis is that biological information in taxonomies and other sort of knowledgebases are not exploited perfectly in state-of-the-art Entity Recognisers. For instance, there are more than 1.7 million UMLS concepts under the CALBC group PRGE. Certainly, simple string matching is far from perfect; the context – the semantics and syntax – of the (possible) occurrences should be taken into account as well. In this preliminary work we combined machine learning approaches with knowledge bases in a simple way; we utilised matched UMLS concepts as features for a statistical NER system.

PDF

Assessment of NER solutions against the first and second CALBC Silver Standard Corpus,
Dietrich Rebholz-Schuhmann, Antonio Jimeno Yepes, Chen Li, Senay Kafkas, Ian Lewin, Ning Kang, Peter Corbett, David Milward, Ekaterina Buyko, Elena Beisswanger, Kerstin Hornbostel, Alexandre Kouznetsov, René Witte, Jonas B. Laurila, Christopher JO Baker, Cheng-Ju Kuo, Simone Clematide, Fabio Rinaldi, Richárd Farkas, György Móra, Kazuo Hara, Laura Furlong, Michael Rautschka, Mariana Lara Neves, Alberto Pascual-Montano, Qi Wei, Nigel Collier, Md. Faisal Mahbub Chowdhury, Alberto Lavelli, Rafael Berlanga, Roser Morante, Vincent Van Asch, Walter Daelemans, José Luís Marina, Erik van Mulligen, Jan Kors, Udo Hahn
Fourth International Symposium on Semantic Mining in Biomedicine (SMBM 2010)

Abstract

Background: Text mining challenges have been organised to measure the performance of automatic text mining solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is timeconsuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups were chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their annotation solutions.

Results: All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their annotation system, or could train a machine-learning approach on the provided preannotated data. In general, the performances of the annotation solutions were lower for the CHED and PRGE in comparison to the identification of DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I. The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE.

Conclusions: The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.

PDF

Gene normalization and concept identification

Concept identification by machine learning aided dictionary-based named entity recognition and rule-based entity normalisation
György Móra
Second CALBC Workshop, Cambridge, UK

Species taxonomy for gene normalization,
György Móra and Richárd Farkas
Fourth International Symposium on Semantic Mining in Biomedicine (SMBM 2010)

Abstract

Background: The task of gene normalization is to assign a unique identifier from a database to the gene mentions. Using these identifiers a great deal of information can be gathered from external databases such as interactions, pathways, sequences and protein structures. Normalizing gene mentions in articles is a difficult task as the inter-species ambiguity of the gene mentions in biomedical publications is high. The experiences gained from the BioCreative II Gene Normalization Task indicate that the biggest challenge in gene normalization is the recognition of the species that a specific gene mention belongs to. In biomedical scientific articles the authors often use taxonomical entities besides concrete species mentions as references to different group of organisms. Species taxonomies are hierarchical systems (trees) of living creatures and therefore provide a classification of species. Here we investigate the added value of the utilization of taxonomic entity mentions in the inter-species gene normalization task.

Results: We present a method which marks those words mentioning all taxonomic entities (genus, family, etc.) and applies filtering heuristics to select the taxonomic entities referring to species mentioned in the document. These entities are then treated as species mentions together with standard species annotations and we employ them in gene normalization.

Conclusion: After experiments were carried out on the BioCreative III Gene Normalization Task's data-set to investigate the contribution of the additional species mentions to the gene disambiguation task, we found that our approach improves the performance of the inter-species gene mention disambiguator, both in terms of precision and recall.

PDF

Biological information extraction

Exploring ways beyond the simple supervised learning approach for biological event extraction,
György Móra, Richárd Farkas, György Szarvas and Zsolt Molnár,
NAACL HLT 2009 BioNLP'09 Workshop

Abstract

Our paper presents the comparison of a machine-learnt and a manually constructed expert-rule-based biological event extraction system and some preliminary experiments to apply a negation and speculation detection system to further classify the extracted events. We report results on the BioNLP’09 Shared Task on Event Extraction evaluation datasets, and also on an external dataset for negation and speculation detection.

PDF

Speculation and negation

Cross-Genre and Cross-Domain Detection of Semantic Uncertainty,
György Szarvas, Veronika Vincze, Richárd Farkas, György Móra and Iryna Gurevych
Computational Linguistics (In press)

Abstract

Uncertainty is an important linguistic phenomenon that is relevant in various Natural Language Processing applications, in diverse genres from medical to community generated, newswire or scientific discourse and domains from science to humanities. The semantic uncertainty of a proposition can be identified in most cases by using a finite dictionary — i.e. lexical cues — and the key steps of uncertainty detection in an application include the steps of locating the (genre- and domain-specific) lexical cues, disambiguating them, and linking them with the units of interest for the particular application (e.g. identified events in information extraction).  In this study, we focus on the genre and domain differences of the context-dependent semantic uncertainty cue recognition task.

We introduce a unified subcategorization of semantic uncertainty as different domain applications can apply different uncertainty categories. Based on this categorization, we normalized the annotation of three corpora and present results with a state-of-the-art uncertainty cue recognition model for four fine-grained categories of semantic uncertainty.

Our results reveal the domain and genre dependence of the problem; nevertheless, we also show that even a distant source domain dataset can contribute to the recognition and disambiguation of uncertainty cues, efficiently reducing the annotation costs needed to cover a new domain. Thus, the unified subcategorization and domain adaptation for training the models offer an efficient solution for cross-domain and cross-genre semantic uncertainty recognition.

PDF

Linguistic scope-based and biological event-based speculation and negation annotations in the Genia Event and BioScope corpora,
Veronika Vincze, György Szarvas, György Móra, Tomoko Ohta and Richárd Farkas
Fourth International Symposium on Semantic Mining in Biomedicine (SMBM 2010)

Abstract

Background: The treatment of negation and hedging in natural language processing has received much interest recently, especially in the biomedical domain. However, open access corpora annotated for negation and/or speculation are hardly available for training and testing applications, and even if they are, they sometimes follow different design principles. In this paper, the annotation principles of the two largest corpora containing annotation for negation and speculation – BioScope and Genia Event – are compared. BioScope marks linguistic cues and their scopes for negation and speculation while in Genia biological events are marked for uncertainty and/or negation.

Results: Differences among the annotations of the two corpora are thematically categorized and the frequency of each category is estimated. We found that the largest amount of differences is due to the issue that scopes – which cover text spans – deal with the key events and each argument (including events within events) of these events is under the scope as well. In contrast, Genia deals with the modality of events within events independently.

Conclusions: We think that the useful information for the biologist can be acquired from the key events, thus if we aim to detect ”new knowledge”, an automatic scope-detector trained on BioScope can contribute to biomedical information extraction. However, for detecting the negation and speculation status of events (within events) syntax-based rules investigating the dependency path between the modality cue and the event cue may be employed.

PDF

The CoNLL 2010 Shared Task: Learning to Detect Hedges and their Scope in Natural Language Texts,
Richárd Farkas, Veronika Vincze, György Móra, János Csirik, György Szarvas
Fourteenth Conference on Computational Natural Language Learning (CoNLL 2010)

Abstract

The CoNLL-2010 Shared Task was dedicated to the detection of uncertainty cues and their linguistic scope in natural language texts. The motivation behind this task was that distinguishing factual and uncertain information in texts is of essential importance in information extraction. This paper provides a general overview of the shared task, including the annotation protocols of the training and evaluation datasets, the exact task definitions, the evaluation metrics employed and the overall results. The paper concludes with an analysis of the prominent approaches and an overview of the systems submitted to the shared task.

PDF Shared Task Page

The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes,
Veronika Vincze, György Szarvas, Richárd Farkas, György Móra and János Csirik,
BMC Bioinformatics, 2008

Background: Detecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus).

Results: The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty.

Conclusion: Statistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.

PDF

Dependency

Hungarian Dependency Treebank.
Veronika Vincze, Dóra Szauter, Attila Almási, György Móra,Zoltán Alexin, János Csirik
Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta.

Abstract: Herein, we present the process of developing the first Hungarian Dependency TreeBank. First, short references are made to dependency grammars we considered important in the development of our Treebank. Second, mention is made of existing dependency corpora for other languages. Third, we present the steps of converting the Szeged Treebank into dependency-tree format: from the originally phrase-structured treebank, we produced dependency trees by automatic conversion, checked and corrected them thereby creating the first manually annotated dependency corpus for Hungarian. We also go into detail about the two major sets of problems, i.e. coordination and predicative nouns and adjectives. Fourth, we give statistics on the treebank: by now, we have completed the annotation of business news, newspaper articles, legal texts and texts in informatics, at the same time, we are planning to convert the entire corpus into dependency tree format. Finally, we give some hints on the applicability of the system: the present database may be utilized – among others – in information extraction and machine translation as well./p>

PDF