Research
MOTIVATION
The rapid growth of the computers performance has widened their areas of application. This increasing computional capacity now permits automation in areas where tasks were only possible to perform with a high cost on human resources. Such informatic challenge the understanding and representation of the natural language texts and human speaking.
The communication of the scientific communities is done through scientific papers, conference proceedings or other scientific publications. The international scientific papers are mostly written in English so an enormous amount of written data is avaliable in this language. The scientist often have difficulties if they need information. Human language technology has a range of tools to help them, including search engines, ontologies and databases. There is also a need in the area of biomedical sciences for readily accessible formalised information about biological processes, events and systems, but this information is mostly available only in written form and the processing of texts requires agreat amount of human effort.
NLP AND IE
Computional linguistics deals with the implementation of the models describing natural languages. Besides linguisttics and psicholinguistics, it draws on the resources of artificial intelligence, computation science and informatics as well. Its goal is to understand, represent and reproduce human language. It has a kind of human-machine interaction role. As principles applied are wide ranging the language technology itself has a wide range of applications.
The enormous amount of the available and accessible data makes it impossible to handle it all via the classical methods. The extracted information is more compact, it can be represented and stored more easily and it is more usable than simple plain text. In the field of information extraction the goal is to extract data in this form. Coherent, well structured and semantically defined data can be gathered from different source. One of these sources is natural language texts. Typical information extraction subtasks are the following:
- Named Entity Recognising (NER): Recognising the name of entities (like name of people, places, organizations, technical and scientific expressions or numeric values).
- Cross reference: Identification of noun phrase structures defining the same entity (eg.: anafora).
- Expression extraction: Extraction of the relevant expression of a given corpus.
- Relation extraction: Extraction of various relations between entities
BIOLOGICAL INFORMATION EXTRACTION
One challenge of biological information extraction is the identification of protein names and other biological named entities in scientific papers. The state-of-the-art methods still need improving [17] [9] [8], but because of the huge amount of data it is possible to extract a lot of useful information.
Protein Protein Interaction (PPI) is a kind of relation extraction. It extracts data about interacting protein pairs which were identified via biological NER or other methods. The type of the relation is not always determined so the results of two different system are not always comparable.
[9] Tomohiro Mitsumori, Sevrani Fation, Masaki Murata, Kouichi Doi, and Hirohumi Doi. Gene/protein name recognition based on support vector machine using dictionary as features. BMC Bioinformatics, 6(Suppl 1):S8, 2005.
[17] Yoshimasa Tsuruoka and Jun’ichi Tsujii. Improving the performance of dictionary-based approaches in protein name recognition. Journal of Biomedical Informatics, 37(6):461 – 470, 2004. Named Entity Recognition in Biomedicine.
EXTRACTING BIOLOGICAL EVENTS
Biological events are the life science facts or statements that describe a genetic, molecular or cell biological process. These are (among others) protein-protein interactions, the transcription and expression of the genes and the processes of various cell lines, microorganisms and viruses.
The extraction of these events is important because they have information which is important to the scientist in a life sciences field. There are databases avaliable that contain biological events and their parameters, but these need to be crafted by human experts. Automating of this process has a great business and research potential. Current systems mostly use some kind of statistical, machine learning method, combining vector space model-based features with attributes extracted via laguage parsers [6] [12]. Others are based on language patterns [5] or employ rule based systems [4].
[5] Yu Hao, Xiaoyan Zhu, Minlie Huang, and Ming Li. Discovering patterns to extract protein-protein interactions from the literature: part II. Bioinformatics, 21(15):3294–3300, 2005.
[6] Hyunchul Jang, Jaesoo Lim, Joon-Ho Lim, Soo-Jun Park, Kyu-Chul Lee, and Seon-Hee Park. Finding the evidence for protein-protein interactions from pubmed abstracts. Bioinformatics, 22(14):e220– e226, 2006.
[12] Rune Sætre, Kenji Sagae, and Jun’ichi Tsujii. Syntactic features for protein-protein interaction extraction. In Proceedings of LBM’07, volume 319, 2008.
BIONLP 2009 SHARED TASK ON EVENT EXTRACTION
We participated in this event extraction competition. Our system combined the classical machine learning approach with expert rules.
See publications or PDF version of the article.
BIOSCOPE CORPUS
The BioScope corpus contains medical reports, biological abstracts and free texts annotated with negation and speculation terms in the text along with their linguistic scope. I provided the technical background for the corpus building (converters, tools, evaluation methods, xml format).
BioScope Corpus Page: http://www.inf.u-szeged.hu/rgai/bioscope

