RichardFARKASPhD
UniversityofSzeged
DEPARTMENT OF INFORMATICS
email: rfarkas SNAIL inf.u-szeged.hu
room: Irinyi building 45.
BiomedicalInformationExtraction
The significant part of the biological and medical knowledge is stored in textual form such as publications, patents, medical records etc. The processing of huge amount of available data requires automatic approaches. The aim of our research group is the automatic extraction of useful information from biomedical documents:
De-identificationOfMedicalRecords
The anonymization of medical records is of great importance in human life sciences because a de-identified text can be made publicly available for non-hospital researchers as well, to facilitate research on human diseases. Our group developed a novel, machine learning-based iterative de-identification model that can automatically remove personal health information (PHI) from discharge records to make them conform to the guidelines of the Health Information Portability and Accountability Act. Our system achieved outstanding accuracy (99.75%) on the standard evaluation dataset. [pdf]
AutomaticICD-9-CMCoding
ICD-9-CM codes are used for billing purposes by health institutes and are assigned to clinical records manually following clinical treatment. Since this labeling task requires expert knowledge in the field of medicine, the process itself is costly and is prone to errors as human annotators have to consider thousands of possible codes when assigning the right ICD-9-CM labels to a document. Our machine learning based approach automatically extends the available hand-crafted rules of the ICD coding guide according to the labelled corpus. Automating the assignment of ICD-9-CM codes for radiology records was the subject of a shared task challenge organized by the Computational Medicine Center (CMC) in Cincinatti, Ohio in 2007. Our system won that challenge. [demo][pdf]
Classifying patient records whether they have a certain disease is a similar task to ICD coding. The Obesity Challenge in 2008, organized by the Informatics for Integrating Biology and the Bedside (I2B2), asked participants to construct systems that could correctly replicate the textual and intuitive judgments of the medical experts on obesity and its co-morbidities based on narrative patient records. An approach similar to the one used in the ICD coder was applied here, i.e. it was an extended dictionary-lookup-based system, which also took into account the document structure and the context of disease terms for classification. To achieve this, we used statistical methods to pre-select the most common (and most confident) terms and abbreviations then evaluated outlier documents to discover infrequent terms and spelling variants. [demo][link]
GeneSymbolDisambiguation
The first task of an information extraction system is to recognize each entity in the text. A biomedical entity mention in articles and other free texts is often ambiguous. The task of Gene Symbol Disambiguation (GSD) is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. We achieved a 98% accuracy on the reference GSD datasets by obtaining information from the inverse co-author graph. [pdf]
WebMining
FreeTextTagging
Free text tagging is the task of assigning a few natural language phrases to documents which summarize them and semantically represent their content. The tags are useful for organizing, retrieving and linking different contents. We developed an automatic free text tagging solution for the online news archive of the Hungarian [origo] news portal. The 370 thousands of articles in the news archive would not be tagged by neither the community of readers nor the team of journalists. We showed that the free-text-tagging could be carried out by an automatic system achieving satisfactory accuracy of 77.5 percent.
SocialWebMining
Scientific social network analysis seeks to discover global patterns in the network of researchers working in a particular field. Common approaches use bibliographic/scholarly data as the basis for this analysis. In the Textrend project, we look for the potential of exploiting other resources as an information source, such as the homepages of researchers. The information on homepages may be present in a structured or natural text form. We focus on the detection and analysis of full text regions of the homepages as they may contain a huge amount of information while it requires more sophisticated analysis than that for structured ones. [pdf]