Publications of Róbert Busa-Fekete


2016


1.) Consistency of Probabilistic Classifier Trees

Dembczynski, K., Kotlowski, W., Waegeman, W., Busa-Fekete, R., Hulllermeier, E.
Proceedings of ECML/PKDD accepted , 2016
Abstract: Label tree classifiers are commonly used for efficient multi- class and multi-label classification. They represent a predictive model in the form of a tree-like hierarchy of (internal) classifiers, each of which is trained on a simpler (often binary) subproblem, and predictions are made by (greedily) following these classifiers? decisions from the root to a leaf of the tree. Unfortunately, this approach does normally not assure consistency for different losses on the original prediction task, even if the internal classifiers are consistent for their subtask. In this paper, we thoroughly analyze a class of methods referred to as probabilistic classifier trees (PCT). Thanks to training probabilistic classifiers at internal nodes of the hierarchy, these methods allow for searching the tree-structure in a more sophisticated manner, thereby producing predictions of a less greedy nature. Our main result is a regret bound for 0/1 loss, which can easily be extended to ranking-based losses. In this regard, PCT nicely complements a related approach called filter trees (FT), and can indeed be seen as a natural alternative thereof. We compare the two approaches both theoretically and empirically.
[link] [pdf]


2.) Determining Native Language and Deception Using Phonetic Features and Classifier Combination

Gosztolya, G., Grósz, T., Busa-Fekete, R., Tóth, L.
Proceedings of INTERSPEECH, accepted , 2016
Abstract: For several years, the Interspeech ComParE Challenge has focused on paralinguistic tasks of various kinds. This paper deals with the Native Language and the Deception sub-challenges of ComParE 2016, where the goal is to identify the native language of the speaker, and to recognize deceptive speech. As both tasks can be treated as classification ones, we experiment with several state-of-the-art machine learning methods (Support Vector Machines, AdaBoost.MH and Deep Neural Networks), and also test a simple-yet-robust combination method. Furthermore, we assume that the native language of the speaker affects the pronunciation of specific phonemes in the language he is currently using. To exploit this, we extract phonetic features for the Native Language task. For the Deception Sub-Challenge, however, we compensate for the highly unbalanced class distribution by instance sampling. By these techniques we are able to significantly outperform the baseline SVM on the unpublished test set.
[link] [pdf]


3.) Extreme F-measure Maximization using Sparse Probability Estimates

Jasinska, K., Dembczyński, K., Busa-Fekete, R., Pfannschmidt, K., Klerx, T., Hüllermeier, E.
Proceedings of International Conference on Machine Learning (ICML), JMLR, W & CP accepted , 2016
Abstract: We consider the problem of (macro) F-measure maximization in the context of extreme multi-label classification (XMLC), i.e., multi-label classification with extremely large label spaces. We investigate several approaches based on recent results on the maximization of complex performance measures in binary classification. According to these results, the F-measure can be maximized by properly thresholding conditional class probability estimates. We show that a na"ive adaptation of this approach can be very costly for XMLC and propose multi-label classifiers that efficiently deliver emphsparse probability estimates (SPEs), that is, probability estimates restricted to the most probable labels. Empirical results provide evidence for the strong practical performance of this approach.
[link] [pdf]


2015


4.) Extreme F-measure Maximization

Kalina Jasinska, Karlson Pfannschmidt, Busa-Fekete, R., Dembczyński, K.
Proceedings of NIPS Workshop on Multi-class and Multi-label Learning in Extremely Large Label Spaces , 2015
Abstract: We consider a problem of macro F-measure maximization in extreme multi-label classification (XMLC), i.e., multi-label classification with large label spaces. We investigate several approaches that are based on recent results on maximization of complex performance measures in binary classification. According to these results, the F-measure can be maximized by a proper tuning of a threshold defined on class probability estimates. We show that a naive adaptation of this approach can be very costly for XMLC. To reduce the complexity we solve this problem by using multi-label classifiers that efficiently deliver emphsparse probability estimates (SPE), i.e., the top probability estimates or the probability estimates that exceed given thresholds. We provide preliminary empirical results showing that the introduced approaches work good in practice.
[link]


5.) Online F-Measure Optimization

Busa-Fekete, R., Szörényi, B., Dembczyński, K., Hüllermeier, E.
Proceedings of Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS) 595--603 , 2015
Abstract: The F-measure is an important and commonly used performance metric for binary prediction tasks. By combining precision and recall into a single score, it avoids disadvantages of simple metrics like the error rate, especially in cases of imbalanced class distributions. The problem of optimizing the F-measure, that is, of developing learning algorithms that perform optimally in the sense of this measure, has recently been tackled by several authors. In this paper, we study the problem of F-measure maximization in the setting of online learning. We propose an efficient online algorithm and provide a formal analysis of its convergence properties. Moreover, first experimental results are presented, showing that our method performs well in practice.
[link] [pdf]
[Supplementary material]


6.) Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach

Szörényi, B., Busa-Fekete, R., Paul, A., Hüllermeier, E.
Proceedings of Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS) 604--612 , 2015
Abstract: We study the problem of online rank elicitation, assuming that rankings of a set of alternatives obey the Plackett-Luce distribution. Following the setting of the dueling bandits problem, the learner is allowed to query pairwise comparisons between alternatives, i.e., to sample pairwise marginals of the distribution in an online fashion. Using this information, the learner seeks to reliably predict the most probable ranking (or top-alternative). Our approach is based on constructing a surrogate probability distribution over rankings based on a sorting procedure, for which the pairwise marginals provably coincide with the marginals of the Plackett-Luce distribution. In addition to a formal performance and complexity analysis, we present first experimental studies.
[link] [pdf]
[Supplementary material]


7.) Assessing the Degree of Nativeness and Parkinson's Condition Using Gaussian Processes and Deep Rectifier Neural Networks

Grósz, T., Busa-Fekete, R., Gosztolya, G., Tóth, L.
Proceedings of INTERSPEECH, 919--923 , 2015
Abstract: The Interspeech 2015 Computational Paralinguistics Challenge includes two regression learning tasks, namely the Parkinson's Condition Sub-Challenge and the Degree of Nativeness Sub- Challenge. We evaluated two state-of-the-art machine learning methods on the tasks, namely Deep Neural Networks (DNNs) and Gaussian Processes Regression (GPR). We also experiented with various classifier combination and feature selection meth- ods. For the Degree of Nativeness sub-challenge we obtained a far better Spearman correlation value than the one presented in the baseline paper. As regards the Parkinson's Condition Sub- Challenge, we showed that both DNNs and GPR are competi- tive with the baseline SVM, and that the results can be improved further by combining the classifiers. However, we obtained by far the best results when we applied a speaker clustering method to identify the files that belong to the same speaker.
[link] [pdf]
[Supplementary material]


8.) Qualitative Multi-Armed Bandits: A Quantile-Based Approach

Szörényi, B., Busa-Fekete, R., Weng, P., Hüllermeier, E.
Proceedings of International Conference on Machine Learning (ICML), JMLR, W & CP Vol. 37, 1660--1668 , 2015
Abstract: We formalize and study the multi-armed bandit (MAB) problem in a generalized stochastic setting, in which rewards are not assumed to be numerical. Instead, rewards are measured on a qualitative scale that allows for comparison but invalidates arithmetic operations such as averaging. Correspondingly, instead of characterizing an arm in terms of the mean of the underlying distribution, we opt for using a quantile of that distribution as a representative value. We address the problem of quantile-based online learning both for the case of a finite (pure exploration) and infinite time horizon (cumulative regret minimization). For both cases, we propose suitable algorithms and analyze their properties. These properties are also illustrated by means of first experimental studies.
[link] [pdf]
[Supplementary material]


2014


9.) A Survey of Preference-based Online Learning with Bandit Algorithms

Busa-Fekete, R., Hüllermeier, E.
Proceedings of Algorithmic Learning Theory (ALT) Vol. 8776, 18--39 , 2014
Abstract: In machine learning, the notion of emphmulti-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are difficult to provide---instead, only weaker information such as qualitative comparisons between pairs of alternatives are available. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state-of-the-art in this field, that we refer to as emphpreference-based multi-armed bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our systematization is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback.
[link] [pdf]


10.) Small Peptide inhibitors of beta-amyloid toxicity

Fülöp, L., Penke, B., Zarándi, M, Bozsó, Zs., Virók, D., Janáky, T., Verdier, Y., Datki, Zs., Szegedi, V., Busa-Fekete, R., Soós, K., Kasza, Á., Kocsor, A., Borbély, E.
Patent HU: P1300317/WO2014184596 A3, 2014
Abstract: The invention relates to peptides and peptidomimetics useful for reducing the neurotoxicity of amyloid peptide aggregates or prion-like (prionoid) protein aggregates, the invention further relates to pharmaceutical compositions containing said peptides and/or peptidomimetics, to the use of said peptides and/or peptidomimetics in the manufacture of medicines useful for the prophylaxis or treatment of diseases that can be cured by protection against the detrimental effect of É-peptides /abnormal É aggregation leading to amyloid plaque formation, and it relates to the use of said peptides and/or peptidomimetics in the prophylaxis or treatment of the above mentioned diseases. The invention aimed at developing treatments of neurodegenerative diseases, particularly Alzheimer's disease (AD), during which accumulation of misfolded and/or aggregated proteins occurs.
[link]


11.) Detecting the Intensity of Cognitive and Physical Load Using AdaBoost and Deep Rectifier Neural Networks

Gosztolya, G., Grósz, T., Busa-Fekete, R., Tóth, L.
Proceedings of INTERSPEECH, 452--456 , 2014
Abstract: The Interspeech ComParE 2014 Challenge consists of two machine learning tasks, which have quite a small number of examples. Due to our good results in ComParE 2013, we considered AdaBoost a suitable machine learning meta-algorithm for these tasks, besides we also experimented with Deep Rectifier Neural Networks. These differ from traditional neural networks in that the former have several hidden layers, and use rectifier neurons as hidden units. With AdaBoost we achieved competitive results, whereas with the neural networks we were able to outperform baseline SVM scores in both Sub-Challenges.
[link] [pdf]


12.) Genome-wide analysis captures the determinants of the antibiotic cross-resistance interaction network

Lázár, V., Nagy, I., Spohn, R., Csörgő, B., Györkei, Á, Nyerges, Á, Horváth, B., Vörös, A., Busa-Fekete, R., Hrtyan, M., Bogos, B., Méhi, O., Fekete, G., Szappanos, B., Kégl, B., Papp, B., Pál, Cs.
Nature Communications 5(4352):doi:10.1038/ncomms5352
Abstract: Understanding how evolution of antimicrobial resistance increases resistance to other drugs is a challenge of profound importance. By combining experimental evolution and genome sequencing of 63 laboratory evolved lines, we charted a map of cross-resistance interactions between antibiotics in Escherichia coli, and explored the driving evolutionary principles. We demonstrate that 1) convergent molecular evolution is prevalent across antibiotic treatments, 2) resistance conferring mutations simultaneously enhance sensitivity to many other drugs, and 3) 27% of the accumulated mutations generate proteins with compromised activities, suggesting that antibiotic adaptation can partly be achieved without gain of novel function. By using knowledge on antibiotic properties, we examined the determinants of cross-resistance and identified chemogenomic profile similarity between antibiotics as the strongest predictor. In contrast, cross-resistance between two antibiotics is independent of whether they show synergistic effects in combination. These results have important implications on the development of novel antimicrobial strategies.
[link]


13.) Preference-based Reinforcement Learning: Evolutionary Direct Policy Search using a Preference-based Racing Algorithm

Busa-Fekete, R., Szörényi, B., Weng, P., Cheng, W., Hüllermeier, E.
Machine Learning 97(3):327--351
Abstract: We introduce a novel approach to preference-based reinforcement learning, namely a preference-based variant of a direct policy search method based on evolutionary optimization. The core of our approach is a preference-based racing algorithm that selects the best among a given set of candidate policies with high probability. To this end, the algorithm operates on a suitable ordinal preference structure and only uses pairwise comparisons between sample rollouts of the policies. Embedding the racing algorithm in a rank-based evolutionary search procedure, we show that approximations of the so-called Smith set of optimal policies can be produced with certain theoretical guarantees. Apart from a formal performance and complexity analysis, we present first experimental studies showing that our approach performs well in practice.
[link] [pdf]


14.) Antagonism is prevalent between bacteriostatic and bactericidal antibiotics

Ocampo, P. S., Lázár, V., Papp, B., Arnoldini, M., Abel zur Wiesch, P., Busa-Fekete, R., Fekete, G., Pál, Cs., Ackermann, M., Bonhoeffer, S.
Antimicrobial Agents and Chemotherapy 58(8):4573--82
Abstract: Combination therapy is rarely used to counter resistance evolution in bacterial infections. An expansion of combination therapy requires knowledge of how drugs interact at inhibitory concentrations. More than 50 years ago it has been noted that if bactericidal drugs are most potent on actively dividing cells, then the inhibition of growth induced by a bacteriostatic drug should result in an overall reduction of drug efficacy when used in combination with a bactericidal drug. Our goal here was to investigate this hypothesis systematically. We first constructed time-kill curves using five different antibiotics at clinically relevant concentrations and observed antagonism between bactericidal and bacteriostatic drugs. We extended our investigation by performing a screen of pairwise combinations across 21 different antibiotics at sub-inhibitory concentrations, and found that strong antagonistic interactions are enriched significantly amongst combinations of bacteriostatic and bactericidal drugs. Finally, since our hypothesis relies on a phenotypic effect produced by different drug classes, we recreated these experiments in a microfluidic device and performed time-lapse microscopy to directly observe and quantify growth and division of individual cells under controlled antibiotic concentrations. While our single-cell observations supported the antagonism between bacteriostatic and bactericidal drugs, they revealed an unexpected variety of cellular responses to antagonistic drug combinations, suggesting that multiple mechanisms underlie this interaction.
[link]


15.) Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows

Busa-Fekete, R., Hüllermeier, E., Szörényi, B.
Proceedings of International Conference on Machine Learning (ICML), JMLR, W & CP Vol. 32 (2), 1071-1079 , 2014
Abstract: We address the problem of rank elicitation assuming that the underlying data generating process is characterized by a probability distribution on the set of all rankings (total orders) of a given set of items. Instead of asking for complete rankings, however, our learner is only allowed to query pairwise preferences. Using information of that kind, the goal of the learner is to reliably predict properties of the distribution, such as the most probable top-item, the most probable ranking, or the distribution itself. More specifically, learning is done in an online manner, and the goal is to minimize sample complexity while guaranteeing a certain level of confidence.
[link] [pdf]
[Supplementary material]


16.) PAC Rank Elicitation through Adaptive Sampling of Stochastic Pairwise Preferences

Busa-Fekete, R., Szörényi, B., Hüllermeier, E.
Proceedings of Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI-14) 1701--1707 , 2014
Abstract: We introduce the problem of PAC rank elicitation, which consists of sorting a given set of options based on adaptive sampling of stochastic pairwise preferences. More specifically, we assume the existence of a ranking procedure, such as Copeland's method, that determines an underlying target order of the options. The goal then is to predict a ranking that is sufficiently close in terms of a distance measure used to this target order with high probability. We instantiate this setting with combinations of two different distance measures and ranking procedures for determining the target order. For these instantiations, we devise efficient sampling strategies of pairwise preferences and analyze the corresponding sample complexity. We also present first experiments to illustrate the practical performance of our methods.
[link] [pdf]


2013


17.) Preference-based Evolutionary Optimization Using Generalized Racing Algorithms

Busa-Fekete, R., Fober, T., Hüllermeier, E.
Proceedings of 23. Workshop Computational Intelligence, Dortmund, Germany , 2013
Abstract: We propose a generic approach to evolutionary optimization that is suitable for problems in which candidate solutions are difficult to assess: Instead of a deterministic, numerical evaluation of the fitness of individual candidates, we proceed from stochastic, qualitative evaluations in the form of pairwise comparisons between competing candidates. Our extension is based on a proper specification of the selection operator under these conditions and makes use of a preference-based version of an adaptive sampling scheme known as racing algorithms.
[link] [pdf]


18.) Preference-based Evolutionary Direct Policy Search

Busa-Fekete, R., Szörényi, B., Weng, P., Cheng, W., Hüllermeier, E.
Proceedings of ECML Workshop on Reinforcement Learning with Generalized Feedback: Beyond Numeric Rewards , 2013
Abstract: We present a novel approach to preference-based reinforcement learning, namely a preference-based variant of a direct policy search method based on evolutionary optimization. The core of our approach is a preference-based racing algorithm that selects the best among a given set of candidate policies with high probability. To this end, the algorithm operates on a suitable ordinal preference structure and only uses pairwise comparisons between sample rollouts of the policies. We present first experimental studies showing that our approach performs well in practice.
[link] [pdf]


19.) Interactive Q-Learning with Ordinal Rewards and Unreliable Tutor

Weng, P., Busa-Fekete, R., Hüllermeier, E.
Proceedings of ECML Workshop on Reinforcement Learning with Generalized Feedback: Beyond Numeric Rewards , 2013
Abstract: Conventional reinforcement learning (RL) requires the specification of a numeric reward function, which is often a difficult task. In this paper, we extend the Q-learning approach toward the handling of ordinal rewards. The method we propose is interactive in the sense of allowing the agent to query a tutor for comparing sequences of ordinal rewards. More specifically, this method can be seen as an extension of a recently proposed interactive value iteration (IVI) algorithm for Markov Decision Processes to the setting of reinforcement learning; in contrast to the original IVI algorithm, our method is tolerant toward unreliable and inconsistent tutor feedback.
[link] [pdf]


20.) Bacterial evolution of antibiotic hypersensitivity

Lázár, V., Singh, G. P., Spohn, R., Nagy, I., Horváth, B., Hrtyan, R. Busa-Fekete, R., Bogos, B., Méhi, O., Csörgő, B., Pósfai, Gy., Fekete, G., Szappanos, B., Kégl, B., Papp, B., Pál, Cs.
Molecular Systems Biology 9(700):
Abstract: The evolution of resistance to a single antibiotic is frequently accompanied by increased resistance to multiple other antimicrobial agents. In sharp contrast, very little is known about the frequency and mechanisms underlying collateral sensitivity. In this case, genetic adaptation under antibiotic stress yields enhanced sensitivity to other antibiotics. Using large-scale laboratory evolutionary experiments with Escherichia coli, we demonstrate that collateral sensitivity occurs frequently during the evolution of antibiotic resistance. Specifically, populations adapted to aminoglycosides have an especially low fitness in the presence of several other antibiotics. Whole-genome sequencing of laboratory-evolved strains revealed multiple mechanisms underlying aminoglycoside resistance, including a reduction in the proton-motive force (PMF) across the inner membrane. We propose that as a side effect, these mutations diminish the activity of PMF-dependent major efflux pumps (including the AcrAB transporter), leading to hypersensitivity to several other antibiotics. More generally, our work offers an insight into one of the mechanisms that drives the evolution of negative trade-offs under antibiotic selection.
[link] [pdf]


21.) Learning to rank lexical substitutions

Szarvas, Gy., Busa-Fekete, R., Hüllermeier, E.
Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP) 1926--1932 , 2013
Abstract: The problem to replace a word with a synonym that fits well in its sentential context is known as the lexical substitution task. In this paper, we tackle this task as a supervised ranking problem. Given a dataset of target words, their sentential contexts and the potential substitutions for the target words, the goal is to train a model that accurately ranks the candidate substitutions based on their contextual fitness. As a key contribution, we customize and evaluate several learning-to-rank models to the lexical substitution task, including classification-based and regression-based approaches. On two datasets widely used for lexical substitution, our best models significantly advance the state-of-the-art.
[link] [pdf]


22.) Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers

Busa-Fekete, R., Kégl, B., Éltető, T., Szarvas, Gy.
Machine Learning 93(2--3):261--292
Abstract: In subset ranking, the goal is to learn a ranking function that approximates a gold standard partial ordering of a set of objects (in our case, a set of documents retrieved for the same query). The partial ordering is given by relevance labels representing the relevance of documents with respect to the query on an absolute scale. Our approach consists of three simple steps. First, we train standard multi-class classifiers (AdaBoost.MH and multi-class SVM) to discriminate between the relevance labels. Second, the posteriors of multi-class classifiers are calibrated using probabilistic and regression losses in order to estimate the Bayes-scoring function which optimizes the Normalized Discounted Cumulative Gain (NDCG). In the third step, instead of selecting the best multi-class hyperparameters and the best calibration, we mix all the learned models in a simple ensemble scheme.
Our extensive experimental study is itself a substantial contribution. We compare most of the existing learning-to-rank techniques on all of the available large-scale benchmark data sets using a standardized implementation of the NDCG score. We show that our approach is competitive with conceptually more complex listwise and pairwise methods, and clearly outperforms them as the data size grows. As a technical contribution, we clarify some of the confusing results related to the ambiguities of the evaluation tools, and propose guidelines for future studies.
[link] [pdf]


23.) Closure Enhancement in a Model Network with Orientation Tuned Long-Range Connectivity

Reiz, B., Busa-Fekete, R., Pongor, S., Kovács, I.
Bistable Perception Special Issue of Learning and Perception 5(2):119--148
Abstract: The primary visual cortex (V1) of the mammalian brain is equipped with a specifically connected network of neurons that can potentially solve difficult image processing tasks. These neurons are selectively tuned for locations in visual space and also for line orientation. The coupling of location and orientation tuning results in the neural representation of the visual world in terms of local features. These local features, e.g., oriented line segments, will have to be linked together in order to parse the visual world into regions corresponding to object and ground. Although standard models of V1 do not address the issue of interacting neuronal populations, we suggest that the long-range connectivity pattern of V1 provides an architecture where spreading neural activity may lead to pertinent figure-ground segmentation. The model relies on the fact that in addition to the processing units, their connections are also selectively tuned for space and orientation.? From the computational point of view, the model uses a minimalist approach that applies the fundamental concepts of Gestalt psychology - proximity, similarity and continuity - to the spreading of neuronal activation signals. This model is successful in predicting psychophysical performance of human observers, and provides an account of the computational power of V1.
[link] [pdf]


24.) Detecting Autism, Emotions and Social Signals Using AdaBoost

Gosztolya, G., Busa-Fekete, R., Tóth, L.
Proceedings of INTERSPEECH, 220--224 , 2013
Abstract: In the area of speech technology, tasks that involve the extraction of non-lingustic information have been receiving more attention recently. The Computational Paralinguistics Challenge (ComParE 2013) sought to develop techniques to efficiently detect a number of paralinguistic events, including the detection of non-linguistic events (laughter and fillers) in speech recordings as well as categorizing whole (albeit short) recordings by speaker emotion, conflict or the presence of development disorders (autism). We treated these sub-challenges as general classification tasks and applied the general-purpose machine learning meta-algorithm, AdaBoost.MH, and its recently proposed variant, AdaBoost.MH.BA, to them. The results show that these boosting algorithms convincingly outperform baseline SVM scores.
[link] [pdf]
[Supplementary material]


25.) Top-k Selection based on Adaptive Sampling of Noisy Preferences

Busa-Fekete, R., Szörényi, B., Weng, P., Cheng, W., Hüllermeier, E.
Proceedings of International Conference on Machine Learning (ICML), JMLR, W & CP Vol. 28(3), 1094--1102 , 2013
Abstract: We consider the problem of reliably selecting an optimal subset of fixed size from a given set of choice alternatives, based on noisy information about the quality of these alternatives. Problems of similar kind have been tackled by means of adaptive sampling schemes called racing algorithms. However, in contrast to existing approaches, we do not assume that each alternative is characterized by a real-valued random variable, and that samples are taken from the corresponding distributions. Instead, we only assume that alternatives can be compared in terms of pairwise preferences. We propose and formally analyze a general preference-based racing algorithm that we instantiate with three specific ranking procedures and corresponding sampling schemes. Experiments with real and synthetic data are presented to show the efficiency of our approach.
[link] [pdf]
[Supplementary material]


26.) Gossip-based distributed stochastic bandit algorithms

Szörényi, B., Busa-Fekete, R., Hegedűs, I., Ormándi, R., Jelasity, M., Kégl, B.
Proceedings of International Conference on Machine Learning (ICML), JMLR, W & CP Vol. 28(3), 28--36 , 2013
Abstract: The multi-armed bandit problem has attracted remarkable attention in the machine learning community and many efficient algorithms have been proposed to handle the so-called emphexploitation-exploration dilemma in various bandit setups. At the same time, significantly less effort has been devoted to adapting bandit algorithms to particular architectures, such as sensor networks, multi-core machines, or peer-to-peer (P2P) environments, which could potentially speed up their convergence. Our goal is to adapt stochastic bandit algorithms to P2P networks. In our setup, the same set of arms is available in each peer. In every iteration each peer can pull one arm independently of the other peers, and then some limited communication is possible with a few random other peers. As our main result, we show that our adaptation achieves a linear speedup in terms of the number of peers participating in the network. More precisely, we show that the probability that a suboptimal arm is played at a peer in iteration $t = Omega( log N )$ is $bigO(1/(d^2Nt))$, where $N$ denotes the number of peers, and $d$ is a lower bound on the gap between the expected reward on an optimal arm and any suboptimal arm. The theoretical results are supported by simulation experiments showing that our algorithm scales gracefully with the size of network.
[link] [pdf]
[Supplementary material]


2012


27.) MultiBoost: a Multi-purpose Boosting Package

Djalel Benbouzid, Róbert Busa-Fekete, Norman Casagrande, Francois-David Collin, Balázs Kégl
Journal of Machine Learning Research 13(1):549--553
Abstract: The MultiBoost package provides a fast C++ implementation of multi-class/multi-label/multi-task boosting algorithms. It is based on AdaBoost.MH but it also implements popular cascade classifiers and FilterBoost. The package contains common multi-class base learners (stumps, trees, products, Haar filters). Further base learners and strong learners following the boosting paradigm can be easily implemented in a flexible framework.
[link] [pdf]


28.) An apple-to-apple comparison of Learning-to-rank algorithms in terms of Normalized Discounted Cumulative Gain

Busa-Fekete, R. , Szarvas, Gy., Éltető, T., Kégl, B.
Proceedings of ECAI-12 Workshop, Preference Learning: Problems and Applications in AI , 2012
Abstract: The Normalized Discounted Cumulative Gain (NDCG) is a widely used evaluation metric for learning-to-rank (LTR) systems. NDCG is designed for ranking tasks with more than one relevance levels. There are many freely available, open source tools for computing the NDCG score for a ranked result list. Even though the definition of NDCG is unambiguous, the various tools can produce different scores for ranked lists with certain properties, deteriorating the empirical tests in many published papers and thereby making the comparison of empirical results published in different studies difficult to compare. In this study, first, we identify the major differences between the various publicly available NDCG evaluation tools. Second, based on a set of comparative experiments using a common benchmark dataset in LTR research and 6 different LTR algorithms, we demonstrate how these differences affect the overall performance of different algorithms and the final scores that are used to compare different systems.
[link] [pdf]


29.) Fast classification using sparse decision DAGs

Benbouzid, D., Busa-Fekete, R., Kégl, B.
Proceedings of International Conference on Machine Learning (ICML) Vol. 29, 951--958 , 2012
Abstract: In this paper we propose an algorithm that builds sparse decision DAGs (directed acyclic graphs) out of a list of base classifiers provided by an external learning method such as AdaBoost. The basic idea is to cast the DAG design task as a Markov decision process. Each instance can decide to use or to skip each base classifier, based on the current state of the classifier being built. The result is a sparse decision DAG where the base classifiers are selected in a data-dependent way. The method has a single hyperparameter with a clear semantics of controlling the accuracy/speed trade-off. The algorithm is competitive with state-of-the-art cascade detectors on three object-detection benchmarks, and it clearly outperforms them in the regime of low number of base classifiers. Unlike cascades, it is also readily applicable for multi-class classification. Using the multi-class setup, we show on a benchmark web page ranking data set that we can significantly improve the decision speed without harming the performance of the ranker.
[link] [pdf]


30.) Peer-to-Peer Multi-Class Boosting

Hegedűs, I., Busa-Fekete, R., Ormándi, R., Jelasity, M., Kégl, B.
Proceedings of European Conference on Parallel and Distributed Computing (EURO-PAR) Vol. LNCS, 7484, 389--400 , 2012
Abstract: We focus on the problem of data mining over large-scale fully distributed databases, where each node stores only one data record. We assume that a data record is never allowed to leave the node it is stored at. Possible motivations for this assumption include privacy or a lack of a centralized infrastructure. To tackle this problem, earlier we proposed the generic gossip learning framework (GoLF), but so far we have studied only basic linear algorithms. In this paper we implement the well-known boosting technique in GoLF. Boosting techniques have attracted growing attention in machine learning due to their outstanding performance in many practical applications. Here, we present an implementation of a boosting algorithm that is based on FilterBoost. Our main algorithmic contribution is a derivation of a pure online multi-class version of FilterBoost, so that it can be employed in GoLF. We also propose improvements to GoLF, with the aim of maximizing the diversity of the evolving models gossiped in the network, a feature that we show to be important. We evaluate the robustness and the convergence speed of the algorithm empirically over three benchmark databases. We compare the algorithm with the sequential AdaBoost algorithm and we test its performance in a failure scenario involving message drop and delay, and node churn.
[link] [pdf]


2011


31.) MDDAG: designing sparse decision DAGs using Markov decision processes

Benbouzid, D., Busa-Fekete, R., Kégl, B.
Proceedings of NIPS workshop on Deep Learning and Unsupervised Feature Learning , 2011
Abstract: In this paper we propose an algorithm that builds sparse decision DAGs (directed acyclic graphs) out of a list of features or base classifiers. The basic idea is to cast the DAG design task as a Markov decision process. Each instance can decide to use or to skip each base classifier, based on the current state of the classifier being built. The result is a sparse decision DAG where the base classifiers are selected in a data-dependent way. The development of algorithm was directly motivated by improving the traditional cascade design in applications where the computational requirements of classifying a test instance are as important as the performance of the classifier itself. Beside outperforming classical cascade designs on benchmark data sets, the algorithm also produces interesting deep structures where similar input data follows the same path in the DAG, and subpaths of increasing length represent features of increasing complexity.
[link] [pdf]


32.) A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost

Busa-Fekete, R., Kégl, B., Éltető, T., Szarvas, Gy.
Proceedings of ECML PKDD 11, Vol. LNCS, 6911, 263--279 , 2011
Abstract: In subset ranking, the goal is to learn a ranking function that approximates a gold standard partial ordering of a set of objects (in our case, relevance labels of a set of documents retrieved for the same query). In this paper we introduce a learning to rank approach to subset ranking based on multi-class classification. Our technique can be summarized in three major steps. First, a multi-class classification model (AdaBoost.MH) is trained to predict the relevance label of each object. Second, the trained model is calibrated using various calibration techniques to obtain diverse class probability estimates. Finally, the Bayes-scoring function (which optimizes the popular Information Retrieval performance measure NDCG), is approximated through mixing these estimates into an ultimate scoring function. An important novelty of our approach is that many different methods are applied to estimate the same probability distribution, and all these hypotheses are combined into an improved model. It is well known that mixing different conditional distributions according to a prior is usually more efficient than selecting one ``optimal'' distribution. Accordingly, using all the calibration techniques, our approach does not require the estimation of the best suited calibration method and is therefore less prone to overfitting. In an experimental study, our method outperformed many standard ranking algorithms on the LETOR benchmark datasets, most of which are based on significantly more complex learning to rank algorithms than ours.
[link] [pdf]


33.) Ranking by calibrated AdaBoost

Busa-Fekete, R., Kégl, B., Éltető, T., Szarvas, Gy.
Proceedings of YAHOO Learning-to-Rank Challenge, JMLR, Workshop and Conference Proceedings Vol. 14, 25--35 , 2011
Abstract: This paper describes the LAL team's approach in the Yahoo learning-to-rank challenge. Our technique is essentially pointwise with a listwise touch at the last combination step. The main ingredients of our approach are 1) preprocessing (querywise normalization) 2) multi-class AdaBoost.MH 3) regression calibration, and 4) an exponentially weighted forecaster for model combination. In post-challenge analysis we found that preprocessing and training AdaBoost with a wide variety of hyperparameters improved individual models significantly, the final listwise ensemble step was crucial, whereas calibration helped only in creating diversity.
[link]


34.) Reconstructing $N_mu19(1000)$

B. Kégl, R. Busa-Fekete, K. Louedec, R. Bardenet, X. Garrido, I.C. Marics, D. Monnier-Ragaigne, S. Dagoret-Campagne, M. Urban
Technical Report 2011-054, Auger Project Technical Note, 2011
[link]


2010


35.) Fast boosting using adversarial bandits

Busa-Fekete, R., Kégl, B.
Proceedings of International Conference on Machine Learning (ICML) Vol. 27, 143--150 , 2010
Abstract: In this paper we apply multi-armed bandits (MABs) to improve the computational complexity of AdaBoost. AdaBoost constructs a strong classifier a stepwise fashion by selecting simple base classifiers and using their ``vote'' to determine the final classification. We model this stepwise base classifier selection as a sequential decision problem, and it with MABs where each arm represents a subset of the base classifier set. The MAB gradually learns the ``usefulness'' of the subsets, and selects one of the subsets in each iteration. AdaBoost then searches only this subset instead of optimizing the base classifier over the whole space. The main improvement of this paper over a previous approach is that we use an adversarial bandit algorithm instead of stochastic bandits. This choice allows us to prove a weak-to-strong-learning theorem, which means that the proposed technique remains a boosting algorithm in a formal sense. We demonstrate on benchmark datasets that our technique can achieve a generalization performance similar to standard AdaBoost for a computational cost that is an order of magnitude smaller.
[pdf]
[Supplementary material]


36.) Palmprint Classification using Wavelets and ADABOOST

Chen, G.Y., Zhu, W. P., Kégl, B., Busa-Fekete, R.
Proceedings of the 7th International Symposium on Neural Networks (ISNN) Vol. LNCS 6064, 178--183 , 2010
Abstract: A new palmprint classification method is proposed in this paper by using the wavelet features and AdaBoost. The method outperforms all other classification methods for the PolyU palmprint database. The novelty of the method is two-fold. On one hand, the combination of wavelet features with AdaBoost has never been proposed for palmprint classification before. On the other hand, a recently developed base learner (products of base classifiers) is included in this paper. Experiments are conducted in order to show the effectiveness of the proposed method for palmprint classification.
[link]


37.) A nonparametric approach to estimate Xmax using Gaussian process regression

B. Kégl, M. Unger, R. Busa-Fekete
Technical Report 2010-034, Auger Project Technical Note, 2010
[link]


2009


38.) Semi-automated Construction of Decision Rules to Predict Morbidities from Clinical Texts

Farkas, R., Szarvas, G., Hegedűs, I., Almási, A., Vincze, V., Ormándi, R., Busa-Fekete, R.
JAMIA 16(4):601--5
Abstract: Objective: In this study the authors describe the system submitted by the team of University of Szeged to the second i2b2 Challenge in Natural Language Processing for Clinical Data. The challenge focused on the development of automatic systems that analyzed clinical discharge summary texts and addressed the following question: Who's obese and what co-morbidities do they (definitely/most likely) have?. Target diseases included obesity and its 15 most frequent comorbidities exhibited by patients, while the target labels corresponded to expert judgments based on textual evidence and intuition (separately).
Design The authors applied statistical methods to preselect the most common and confident terms and evaluated outlier documents by hand to discover infrequent spelling variants. The authors expected a system with dictionaries gathered semi-automatically to have a good performance with moderate development costs (the authors examined just a small proportion of the records manually).
Measurements: Following the standard evaluation method of the second Workshop on challenges in Natural Language Processing for Clinical Data, the authors used both macro- and microaveraged F?=1 measure for evaluation.
Results The authors submission achieved a microaverage F?=1 score of 97.29 for classification based on textual evidence (macroaverage F?=1 = 76.22) and 96.42 for intuitive judgments (macroaverage F?=1 = 67.27).
Conclusions The results demonstrate the feasibility of the authors approach and show that even very simple systems with a shallow linguistic analysis can achieve remarkable accuracy scores for classifying clinical records on a limited set of concepts.
[link] [pdf]


39.) A MATLAB program for cluster analysis using graph theory

Todd, S. C., Tóth, M. T., Busa-Fekete, R.
Computers & Geosciences 36(6):1205--1213
Abstract: Cluster analysis is used in numerous scientific disciplines. A method of cluster analysis based on graph theory is discussed and a MATLAB code for its implementation is presented. The algorithm is based on the number of variables that are similar between samples. By changing the similarity criterion in a stepwise fashion, a hierarchical group structure develops, and can be displayed by a dendrogram. Three indexes describe the homogeneity of a given variable in a group, the heterogeneity of that variable between two groups, and the usefulness of that variable in distinguishing two groups. The algorithm is applied to both a synthetic dataset and a set of trace element analyses of lavas from Mount Etna in order to compare GraphClus to other cluster analysis algorithms.
[link] [pdf]
[Supplementary material]


40.) Bandit-Aided Boosting

Busa-Fekete, R., Kégl, B.
Proceedings of 2nd NIPS Workshop on Optimization for Machine Learning , 2009
Abstract: In this paper we apply multi-armed bandits (MABs) to accelerate AdaBoost. AdaBoost constructs a strong classifier in a stepwise fashion by selecting simple base classifiers and using their weighted ``vote'' to determine the final classification. We model this stepwise base classifier selection as a sequential decision problem, and optimize it with MABs. Each arm represent a subset of the base classifier set. The MAB gradually learns the ``utility'' of the subsets, and selects one of the subsets in each iteration. AdaBoost then searches only this subset instead of optimizing the base classifier over the whole space. The reward is defined as a function of the accuracy of the base classifier. We investigate how the MAB algorithms (UCB, UCT) can be applied in the case of boosted stumps, trees, and products of base classifiers. On benchmark datasets, our bandit-based approach achieves only slightly worse test errors than the standard boosted learners for a computational cost that is an order of magnitude smaller than with standard AdaBoost.
[link] [pdf]


41.) Accelerating AdaBoost using UCB

Busa-Fekete, R., Kégl, B.
Proceedings of KDD Cup (2009), JMLR Workshop and Conference Proceedings Vol. 7, 111--122 , 2009
Abstract: This paper explores how multi-armed bandits (MABs) can be applied to accelerate AdaBoost. AdaBoost constructs a strong classifier in a stepwise fashion by adding simple base classifiers to a pool and using their weighted ``vote'' to determine the final classification. We model this stepwise base classifier selection as a sequential decision problem, and optimize it with MABs. Each arm represents a subset of the base classifier set. The MAB gradually learns the ``utility'' of the subsets, and selects one of the subsets in each iteration. AdaBoost then searches only this subset instead of optimizing the base classifier over the whole space. The reward is defined as a function of the accuracy of the base classifier. We investigate how the well-known UCB algorithm can be applied in the case of boosted stumps, trees, and products of base classifiers. The KDD Cup 2009 was a large-scale learning task with a limited training time, thus this challenge offered us a good opportunity to test the utility of our approach. During the challenge our best results came in the Up-selling task where our model was within $1$ of the best AUC rate. After more thorough post-challenge validation the algorithm performed as well as the best challenge submission on the small data set in two of the three tasks.
[link] [pdf]


42.) Boosting products of base classifiers

Kégl, B., Busa-Fekete, R.
Proceedings of International Conference on Machine Learning (ICML) Vol. 26, 497--504 , 2009
Abstract: We show how to boost products of base learners. Similarly to trees, we call the base learner as a subroutine but in an iterative rather than recursive fashion. In experiments on relatively large data sets we clearly outperform boosted trees. We also propose an improved base learner for nominal features and show that boosting the product of two of these new subset indicator base learners solves the maximum margin matrix factorization problem.
[link] [pdf]


43.) A One-Class Classification Approach for Protein Sequences and Structures

Bánhalmi, A., Busa-Fekete, R., Kégl, B.
Proceedings of 5th International Symposium on Bioinformatics Research and Applications (ISBRA) Vol. LNCS, 4701, 310--322 , 2009
Abstract: The One-Class Classification (OCC) approach is based on the assumption that samples are available only from a target class in the training phase. OCC methods have been applied with success to problems where the classes are very different in size. As class-imbalance problems are typical in protein classification tasks, we were interested in testing one-class classification algorithms for the detection of distant similarities in protein sequences and structures. We found that the OCC approach brought about a small improvement in classification performance compared to binary classifiers (SVM, ANN, Random Forest). More importantly, there is a substantial (50 to 100 fold) improvement in the training time. OCCs may provide an especially useful alternative for processing those protein groups where discriminative classifiers cannot be easily trained.
[link] [pdf]


2008


44.) Evolutionary Tree Reconstruction and its Applications in Protein Classification

Busa-Fekete, R.
Ph.D thesis, University of Szeged, 2008
[pdf]


45.) Extracting Human Protein Information from MEDLINE Using a Full-Sentence Parser

Busa-Fekete, R., Kocsor, A.
Acta Cybernetica 18(3):391--402
Abstract: Today, a fair number of systems are available for the task of processing biological data. The development of effective systems is of great importance since they can support both the research and the everyday work of biologists. It is well known that biological databases are large both in size and number, hence data processing technologies are required for the fast and effective management of the contents stored in databases like MEDLINE. A possible solution for content management is the application ofnatural language processing methods to help make it easier. With our approach we would like to learn more about the interactions of human genes using full-sentence parsing. Given a sentence, the syntactic parser assigns to it a syntacticstructure, which consists of a set of labelled links connecting pairs of words. The parser also produces a constituent representation of a sentence (showing noun phrases, verb phrases, and so on). Here we show experimentally that using the syntactic information of each abstract, the biological interactions of genes can be predicted. Hence, it is worth developing the kind of information extraction (IE) system that can retrieve information about gene interactions just by using syntactic information contained in these text. Our IE system can handle certain types of gene interactions with the help of machine learning (ML) methodologies (Hidden Markov Models, Artificial Neural Networks, Decision Trees, Support Vector Machines). The experiments and practical usage show clearly that our system can provide a useful intuitive guide for biological researchers in their investigation and in the design of their experiments.
[link]


46.) Protein Classification based on Propagation on Unrooted Binary Trees

Kocsor, A., Busa-Fekete, R., Pongor, S.
Protein and Peptide Letters 15(5):428--34
Abstract: We present two efficient network propagation algorithms that operate on a binary tree, i.e., a sparse-edged substitute of an entire similarity network. TreeProp-N is based on passing increments between nodes while TreeProp-E employs propagation to the edges of the tree. Both algorithms improve protein classification efficiency.
[link] [pdf]


47.) Balanced ROC analysis (BAROC) protocol for the evaluation of protein similarities

Busa-Fekete, R., Kertész-Farkas, A., Kocsor, A., Pongor, S.
Journal of Biochemical and Biophysical Methods 70(6):1210--4
Abstract: Identification of problematic protein classes (domain types, protein families) that are difficult to predict from sequence is a key issue in genome annotation. ROC (Receiver Operating Characteristic) analysis is routinely used for the evaluation of protein similarities, however its results - the area under curve (AUC) values - are differentially biased for the various protein classes that are highly different in size. We show the bias can be compensated for by adjusting the length of the top-list in a class-dependent fashion, so that the number of negatives within the top list will be equal to (or proportional with) the size of the positive class. Using this balanced protocol the problematic classes can be identified by their AUC values, or by a scatter diagram in which the AUC values are plotted against positive/negative ratio of the top-list. The use of likelihood-ratio scoring (KajÉn et al, Bioinformatics, 22, 2865-2869, 2007) the bias caused by class-imbalance can be further decreased.
[link] [pdf]


2007


48.) Tree-Based Protein Classification

Busa-Fekete, R., Kocsor, A., Pongor, S.
Computational Intelligence in Bioinformatics in the Series in Studies in Computational Intelligence, Chapter. 6, 165--82 , 2007
Abstract: The problem of protein sequence classification is one of the crucial tasks in the interpretation of genomic data. Many high-throughput systems were developed with the aim of categorizing the proteins based only on their sequences. However, modelling how the proteins have evolved can also help in the classification task of sequenced data. Hence the phylo-genetic analysis has gained importance in the field of protein classification. This approach does not just rely on the similarities in sequences, but it also considers the phylogenetic information stored in a tree (e.g. in a phylogenetic tree). Eisen used firstly phylogenetic trees in protein classification, and his work has revived the discipline of phylogenomics. In this chapter we provide an overview about this area, and in addition we propose two algorithms that well suited to this scope. We present two algorithms that are based on a weighted binary tree representation of protein similarity data. TreeInsert assigns the class label to the query by determining a minimum cost necessary to insert the query in the (precomputed) trees representing the various classes. Then TreNN assigns the label to the query based on an analysis of the query's neighborhood within a binary tree containing members of the known classes. The algorithms were tested in combination with various sequence similarity scoring methods (BLAST, Smith-Waterman, Local Alignment Kernel as well as various compression-based distance scores) using a large number of classification tasks representing various degrees of difficulty. At the expense of a small computational overhead, both TreeNN and TreeInsert exceed the performance of simple similarity search (1NN) as determined by ROC analysis, at the expense of a modest computational overhead. Combined with a fast tree-building method, both algorithms are suitable for web-based server applications.
[link] [pdf]


49.) State-of-the-art anonymisation of medical data with an iterative machine learning model/framework

Szarvas, Gy., Farkas, R., Busa-Fekete, R.
JAMIA 14(5):574--80
Abstract: Objective: The anonymisation of medical records is crucial in human life sciences because a de-identified text can be made publicly available for non-hospital researchers as well, to facilitate research on human diseases. The authors developed a de-identification model that can successfully remove personal health information (PHI) from discharge records to make them conform to the guidelines of Health Information Portability and Accountability Act. Design: We introduce here a novel, machine learning based iterative Named Entity Recognition (NER) approach intended for use on semi-structured documents like discharge records. Our method identifies PHI in several steps. First, it labels all entities whose tags can be inferred from the structure of the text and then it utilises this information to find further PHI phrases in the flow text parts of the document. Measurements: Following the standard evaluation method of the first Workshop on Challenges in Natural Language Processing for Clinical Data, we used token-level Precision, Recall and F?=1 measure metrics for evaluation. Results: Our system achieved outstanding accuracy on the standard evaluation dataset of the de-identification challenge, finishing second with an accuracy of 99.75 F measure. Conclusion: The difference between our and the best performing anonymisation system is bellow 0.01, so we can say our system is competitive with the state-of-the-art, while we describe here several techniques that can be beneficial in other problems dealing with structured documents like clinical records.
[link] [pdf]


50.) Counter-Example Generation-Based One-Class Classification

Bánhalmi, A., Kocsor, A., Busa-Fekete, R.
Proceedings of European Conference on Machine Learning (ECML) Vol. LNAI, 4701, 543--550 , 2007
Abstract: For One-Class Classification problems several methods have been proposed in the literature. These methods all have the common feature that the decision boundary is learnt by just using a set of the positive examples. Here we propose a method that extends the training set with a counter-example set, which is generated directly using the set of positive examples. Using the extended training set, a binary classifier (here ?-SVM) is applied to separate the positive and the negative points. The results of this novel technique are compared with those of One-Class SVM and the Gaussian Mixture Model on several One-Class Classification tasks.
[link] [pdf]


51.) Whitening-based Feature Space Transformations and their Effect on Real-time Phoneme Classification

Kocsor, A., Busa-Fekete, R., Bánhalmi, A.
Proceedings of International Conference on Text, Speech and Dialogue (TSD) Vol. LNAI, 4629, 222--229 , 2007
Abstract: It is quite common to use feature extraction methods prior to classification. Here we deal with three algorithms defining uncorrelated features. The first one is the so-called whitening method, which transforms the data so that the covariance matrix becomes an identity matrix. The second method, the well-known Fast Independent Component Analysis (FastICA) searches for orthogonal directions along which the value of the non-Gaussianity measure is large in the whitened data space. The third one, the Whitening-based Springy Discriminant Analysis (WSDA) is a novel method combination, which provides orthogonal directions for better class separation. We compare the effects of the above methods on a real-time vowel classification task. Based on the results we conclude that the WSDA transformation is especially suitable for this task.
[link] [pdf]


52.) An Automatic Retraining Method for Speaker Independent Hidden Markov Models

Bánhalmi, A., Busa-Fekete, R., Kocsor, A.
Proceedings of International Conference on Text, Speech and Dialogue (TSD) Vol. LNAI, 4629, 382--389 , 2007
Abstract: When training speaker-independent HMM-based acoustic models, a lot of manually transcribed acoustic training data must be available from a good many different speakers. These training databases have a great variation in the pitch of the speakers, articulation and the speed of talking. In practice, the speaker-independent models are used for bootstrapping the speaker-dependent models built by speaker adaptation methods. Thus the performance of the adaptation methods is strongly influenced by the performance of the speaker- independent model and by the accuracy of the automatic segmentation which also depends on the base model. In practice, the performance of the speaker-independent models can vary a great deal on the test speakers. Here our goal is to reduce this performance variability by increasing the performance value for the speakers with low values, at the price of allowing a small drop in the highest performance values. For this purpose we propose a new method for the automatic retraining of speaker-independent HMMs.
[link] [pdf]


53.) A Multi-Stack Based Phylogenetic Tree Building Method

Busa-Fekete, R., Kocsor, A., Bagyinka, Cs.
Proceedings of Bioinformatics Research and Applications (ISBRA) Vol. LNCS, 4463, 49--60 , 2007
Abstract: Here we introduce a new Multi-Stack (MS) based phylogenetic tree building method. The Multi-Stack approach organizes the candidate subtrees (i.e. those having same number of leaves) into limited priority queues, always selecting the $K$-best subtrees, according to their distance estimation error. Using the $K$-best subtrees our method iteratively applies a novel subtree joining strategy to generate candidate higher level subtrees from the existing low-level ones. This new MS method uses the Constrained Least Squares Criteria (CLSC) which guarantees the non-negativity of the edge weights. The method was evaluated on real-life datasets as well as on artificial data. Our empirical study consists of three very different biological domains, and the artificial tests were carried out by applying a proper model population generator which evolves the sequences according to the predetermined branching pattern of a randomly generated model tree. The MS method was compared with the Unweighted Pair Group Method (UPGMA), Neighbor-Joining (NJ), Maximum Likelihood (ML) and Fitch-Margoliash (FM) methods in terms of Branch Score Distance (BSD) and Distance Estimation Error (DEE). The results show clearly that the MS method can achieve improvements in building phylogenetic trees.
[link] [pdf]


2006


54.) Phylogenetic Tree Building using a Novel Compression-based Non-Symmetric Dissimilarity Measure

Busa-Fekete, R., Kocsor, A., Bagyinka, Cs.
Applied Ecology and Environmental Research 4(2):21--30
Abstract: An approach of building phylogenetic trees is to define a distance function based on amino acid sequences of distinct proteins. The aim of this approach is to determine a weighted tree topology that approximates the entire functional similarity relations between proteins, defined by a distance function. In this case - according to the definition - the similarity relation has a symmetric property. However, the assumption of symmetry is not always appropriate, because non-symmetric similarity relations between proteins might have a biological significance. This notion inspired us to define a novel, compression-based, non-symmetric dissimilarity measure and to modify the ubiquitous Unweighted Pair Group Method with Arithmetic Mean (UPGMA)-based tree-building algorithm so that the new measure can be applied.
[link] [pdf]


55.) An Iterative Method for the De-identification of Structured Medical Text

Szarvas, Gy., Farkas, R., Iván, Sz., Kocsor, A., Busa-Fekete, R.
Proceedings of AMIA I2B2 NLP workshop proceedings , 2006
Abstract: The process of removing personal health information (PHI) from clinical records is called de-identification. There are many methodologies in use for de-identification, and most of them are based on a named entity recognition (NER) technique. We introduce here a novel, iterative NER approach intended for use onsemi-structured documents like discharge records and it can successfully identify PHI in several steps. First, our method looks for semantic information, labelling all entities whose tags can be inferred from the structure of the text and then it utilises this information to find further PHI phrases in the document.
[pdf]


56.) Automatic Extraction of Semantic Content from Medical Discharge Records

Szarvas, Gy., Farkas, R., Iván, Sz., Kocsor, A., Busa-Fekete, R.
Proceedings of AMIA I2B2 NLP workshop proceedings , 2006
Abstract: Semi-structured medical texts like discharge summaries are rich sources of information that can exploit the research results of physicians with statistical analysis of similar cases. In this paper we introduce a system based on Machine Learning (ML) algorithms that successfully classifies discharge records according to the smoking status of the patient (we distinguish between current smoker, past smoker, smoker /where a decision between the former two classes cannot be made/, non-smoker and unknown /where the document contains no data on smoking status/ classes). Such systems are useful for examining the connection between certain social habits and diseases like cancer or asthma. We trained and tested our model on the shared task organized by the I2B2 (Informatics for Integrating Biology and the Bedside) research center citei2b2, and despite the very low amount of labelled training data available, our system shows promising results in identifying the smoking habits of patients based on their medical discharge summaries.
[pdf]


57.) New, Linguistics-based, Ontology-enabled Approaches in Biological Information Management

Csendes, D., Alexin, Z., Busa-Fekete, R., Kocsor, A., Kovács, K.
Proceedings of Exploiting the Knowledge Economy: Issues, Applications, Case Studies 1352--1359 , 2006
Abstract: The paper presents an information extraction and visualisation technology used for the processing of biological and biomedical data. We offer a solution for the discovery of knowledge from publication abstracts of the electronic MedLine database primarily relating to genetics. The system is designed and developed in a way that it provides extracted information rendered in a most suitable and easily browsable way. The result of the system is an interactively browsable, graph-like visualisation of genes and their interrelation, where the distance and the edge-width between the individual objects represent the type and strength of their interaction. After an initial graph is determined and projected onto the user interface, the result can further be influenced by a series of user interactions. The system is most likely to help experts of biology and pharmacy both in their research and everyday work.


2005


58.) Locally Linear Embedding and its Variants for Feature Extraction

Busa-Fekete, R., Kocsor, A.
Proceedings of IEEE International Workshop on Soft Computing Applications 216--222 , 2005
Abstract: Many problems in machine learning are hard to manage without applying some pre-processing or feature extraction method. Two popular forms of dimensionality reduction are the methods of principal component analysis (PCA) and multidimensional scaling (MDS). In this paper we examine Locally Linear Embedding (LLE), which is an unsupervised, non-linear dimension reduction method that was originally proposed for visualisation. We will show that LLE is capable of feature extraction if we choose the right parameter values. In addition, we extend the original algorithm for more efficient classification. Afterwards we apply the methods to several databases that are available at the UCI repository, and then show that there is a significant improvement in classification performance.
[link] [pdf]


Updated: 04-Jul-2016