The project futurICT (Infocommunicational technologies and the society of the future, TÁMOP-4.2.2.C-11/1/KONV-2012-0013) aims at developing and validating novel algorithms in the field of natural language processing. Our methods primarily focus on extracting information from huge textual and speech databases, with special emphasis on texts from the world wide web. In order to develop language independent methods to the greatest extent possible, our methods are trained and tested on both English and Hungarian databases.
Within the framework of the BELAMI project, in 2009, we focused on the text mining problems in Ambient Assisted Living applications. In a study, the research group identified syntactic and semantic analysis of the transcripts of sound materials (speech recognition) and automatic collection of information from web sources as the two most important, related data-mining problems. in 2010, emphasis was put on developing domain adaptation models. One of the basic assumptions in machine learning is that data used in training and testing exhibit the same random distribution. During the training phase, the model captures the patterns and connections that are characteristic of each class, which yields that if the trained model is applied to data from another distribution, the classifier's performance decreases to a great extent. For domain adaptation problems, we developed a novel algorithm with the core idea of model transformation. The machine learning algorithm implemented in this way was tested on synthetic databases and an opinion mining problem. We gave adequate solutions for text mining problems hindering the development of Ambient Assisted Living applications by means of applying easily adaptable machine learning techniques.
[more]The MASZEKER Project (TECH_08_A2/2-2008-0092) started off in 2009 with the aim of developing a model-based semantic search system primarily for English and Hungarian patents and folklore texts. In the first year, the members of the consortium selected the natural language parsers (POS-tagger for English, dependency parser, NE-recognizer) to be used in the search system and adapted them to the features of the subtasks and domains. During the second year of the project, our colleagues developed the prototype of the syntactic parser for English patents, they created a morphological parser that exploits the harmonized MSD-KR coding system and a POS-tagger that is built on the output of the former parser. Moreover, the dependency parser for Hungarian is also being implemented. In the last phase of the project, we concentrated primarily on the semantic processing of patents: a word sense disambiguation module was constructed and a semantic lexicon was built.
[more]The objective of the Textrend project was to develop a business and governmental decision support toolbox using trend- and text-analysis tools. Within the framework of the project, text analysis tools for English adapted for the Textrend toolbox were developed. Integrability was achieved using the UIMA toolkit. During the last year of the project, we integrated the text processing tools developed for Hungarian into the toolbox and research was made on the topics of automatic keyphrase assignment and topic monitoring of document sets, distance-based visualizations of tagsets, automatic keyphrase extraction and extraction of persons' attributes from the web. The project terminated in November 2010 with success.
[more]In the focus of the project lay the examination of historic narratives pertaining to traumatic events of the Hungarian past (Trianon, World War II, Holocaust, '56) in respect of historically changing identity construction strategies. Examinations require application, adaptation and development of natural language analysis and processing methods. Therefore, the objective of the project was to implement software that enables researchers to extract information and draw conclusions pertaining to group identity and inter-group relations in narrative texts.
[more]The objective of the project was to create the framework of a unified national ontology, which contains a freely available top concept set and a domain specific (telecommunication and information services) ontology. The network of concepts is founded on a freely available ontology infrastructure, with its own ontology management methodology, tool set, practical guidelines and the cooperative institutional system necessary for the maintenance of the framework. The developed ontology infrastructure can be utilised in many other fields of research and application in the future. This is due to the fact that any newly developed domain concept set can easily be concatenated to the developed top concept set.
[more]The main aim of the project was to implement a Hungarian-English machine translation system. The system is based on the prototype of three applications: (1) an example sentence translator, (2) a software supporting free text comprehension, and (3) a form-filler translator. The system supports the filling of official forms, the translation of business letters into English, and facilitates the appearance of Hungarian companies in the international scene. The developed system is generally expected to enhance the country's international integration, to make EU development resources more available, to increase competitiveness of certain economic operators on the international market, and to encourage innovation activities of state-financed organisations.
[more]The main objective of the present project was to create and develop an independent system and software capable of three things: firstly, of processing data from scientific literature (Medline abstracts) linguistically; secondly, of extracting information from the processed texts; and thirdly, of identifying correlations by the utilisation of a graph-based, analytic representation of the extracted information. Questions raised by DNA-chip technology are a new challenge for bioinformatics these days. As opposed to static information stored in DNA databases, DNA-chip experiments provide information in large quantities on dynamic changes in the expression of thousands of genes. Our major objective was to utilise both sets of information in extracting new types of results and correlations, which is part of the new vistas that have opened up in the bioinformatics field of genomic research.
[more]The objective of the project was to create the framework of a unified national ontology, which contains a freely available top concept set and a domain specific (telecommunication and information services) ontology. The network of concepts is founded on a freely available ontology infrastructure, with its own ontology management methodology, tool set, practical guidelines and the cooperative institutional system necessary for the maintenance of the framework. The developed ontology infrastructure can be utilised in many other fields of research and application in the future. This is due to the fact that any newly developed domain concept set can easily be concatenated to the developed top concept set.
[more]The objective of the project was to develop a knowledge-based Hungarian semantic search engine, which eliminates the shortcomings of state-of-the-art search technologies (typically reduced to operating with surface technologies) by enabling in-depth understanding of texts dealing with special subjects. The proposed technology has been implemented as part of an intelligent traumatological information system in order that the National Traumatology and Emergency Institute can actively benefit from it. The practical objective of the project was, therefore, to enable traumatologists and nurses to formulate their queries in free text, and to enable the system to answer these questions in the same form on the basis of documents available in the medical protocol and case repository.
[more]The project's main objective was to develop a syntactic analysis method supported by machine learning algorithms. A further objective was to implement the method in the form of a program prototype. Developments inferred the existence of a syntactically fully analysed textual database, i.e. a treebank (see Szeged Treebank 2.0). Apart from these, the consortium endeavoured to develop a technology that is capable of recognising and managing special tokens and named entities (e.g., proper names, dates, figures, formulae, internet and e-mail addresses, etc.).
[more]The project set three major objectives: firstly, to develop an information extraction technology for economic and business news; secondly, to implement the technology thus developed in the form of a program prototype; finally, (as a prerequisite for the former two points) to build a syntactically and semantically annotated database (see Szeged Treebank 1.0), which covers the given domain in the most representative way possible.
[more]The main objective of the project was to develop an effective Hungarian POS tagging method and a program prototype implementing it. As a prerequisite, it was necessary to develop a suitably sized, morpho-syntactically annotated and disambiguated textual database (see Szeged Corpus 1.0 and 2.0). The corpus served as the learning database for machine learning algorithms, which constituted the core of automatic disambiguation method.
[more]