The research seminar takes place Tuesdays between 15:00 and 16:30 PM in Room 6 (Basement, Árpád tér 2). The talks are about 30 minutes long. There is also time for questions and after 16:30 PM for free discussions as well. Some snacks and refreshments will be served (bring your mug, if possible).
Everyone is welcome!
Are these descriptions referring to the same entity or just to a similar one?
Recent years Language Models ruled the scene of NLP however little research went into how well they represent similar terms. The main focus of my presentation is to differentiate two concepts with the same meaning from two that are just similar to each other. I took the task of graph matching where given two graphs you should find the pairs that represent the same concept and built a multi-step system that can propose pairs with the help of Language Models resulting in a best performing system on 2 out of 5 datasets of the OAEI Knowledge Graph matching track.
WECALM : A Special Structural Based Weighted Network Approach for the Analysis of Protein Complexes
The detection and analysis of protein complexes is essential for understanding the functional mechanism and cellular integrity. Recently, several techniques for detecting and analysing protein complexes from Protein-Protein Interaction (PPI) dataset have been developed. Most of those techniques are inefficient in terms of detecting, overlapping complexes, exclusion of attachment protein in complex core, inability to detect inherent structures of underlying complexes, have high false-positive rates and an enrichment analysis. To address these limitations, we introduce a special structural-based weighted network approach for the analysis of protein complexes based on a Weighted Edge, Core-Attachment and a Local Modularity structures (WECALM) implemented in the six steps. First we construct the PPI network with a weighted edge approach using Jaccard coefficient similarity. Second, we identify the overlapping proteins by the average node degree and betweenness values of immediate neighbor. Third, we identify a local structural modularity by modularity score function. Fourth, we identify the core protein complex using structural similarities of the seed protein and its immediate neighbors. Fifth we provide an efficient method to detect protein complex by appending attachment proteins to the detected core protein complexes. Finlay, in the sixth step we calculate the p-value to validate the biological significance of the detected protein complexes by a functional enrichment analysis. Our simulation results indicate that WECALM outperforms existing algorithms in terms of accuracy, computational time, and p-value. A Functional enrichment analysis also shows that WECALM is able to identify a large number of biologically significant protein complexes. Overall, our WECALM outperforms eight other approaches by striking a better balance of accuracy and efficiency in the detection of protein complexes.
Keywords: Protein complexes; Core-attachment; Local modularity structure; Weighted PPI network
A novel approach of FAIR guideline for research tools
Program Slicing is one of the most important uses of source code analysis. Over the years, different variations of it have been implemented in many tools, depending on the goal to be achieved. However, what these programs have in common is that they are not intended for publication, and their subsequent use may be problematic. In this paper, we present a set of principles that builds on the FAIR guidelines and describes the basics and common criteria for publishing research software in a more accurate and informative way.
Down the Rabbit Hole: Segmentation Metric Misinterpretations in Bioimage Analysis
In today's scientific environment, with an increasing attention on AI solutions for imaging problems, a plethora of new image segmentation and object detection methods emerge. Thus, quantitative evaluation is crucial for an objective assessment of algorithms. Often, object detection and segmentation tasks utilize evaluation metrics with the same name, but a different meaning due to the differences between object-level and pixel-level classification or just because multiple interpretations coexist. One could argue that in most cases, the meaning should be clear from the context, however, specific and often non-detailed characteristics of the circumstances (e.g. small variations of the task) can make it hard for the readers to understand the exact meaning of different metrics. My presentation is focusing on the various interpretations that have emerged in the research communities related to some segmentation scores. As such, we could identify 5 different definitions for the “average precision (AP)”, and 6 different interpretations for the “mean average precision (mAP)” metrics in the literature. To make things even more complicated, even when some methods work with the same dataset, the metrics used for the evaluation of performance are not necessarily the same. The aims of my presentation are to shed light on some of the main issues with the current state of segmentation and object detection metrics, and to investigate the reasons for the ambiguous use of classification concepts. I’m also going to point out the problems of using similar metrics with nuanced differences by evaluating the 2018 Data Science Bowl (DSB) and 2021 Sartorius neuron segmentation challenge submissions with metrics of similar meaning but slightly differing interpretations.
A better approach for reproducible research in unexplored research fields
In the last years the hype for artificial intelligence grown incredibly, and in many research fields there was a will to adopt data driven, and more specifically deep learning methods. On one hand those new approaches can give an important advantage in those fields, on the other hand machine learning methods are non-trivial to use and deploy effectively, so it’s crucial to pay attention to some potential issues in cross domain publications; especially the issue of the reproducibility of the experiments and the reusability of the published code. In this presentation an analysis of the research branch focused on the application of deep learning models in the biomedical research field will be shown. Not all fields are equally developed when it comes to deep learning; indeed lot of attention is given to image processing and speech related tasks in general, but in other more specific fields, we are left with much less literature available; we can only find few, small, publicly available datasets, and the applied modeling techniques are usually considered to be obsolete.
Improving vulnerability prediction with Transfer Learning
One of the biggest obstacles in the way of successfully detecting vulnerabilities using deep neural networks is the relatively small amount of training data. Training these networks requires a lot of data in order to get the best result, therefore having insufficient amount of data results in poor performance. It is a well-known problem in the field of deep learning, hence there are multiple ways of overcoming it, one of them is called transfer learning. In this research, we are building a big pretraining dataset of sonarqube warnings to use it for transfer learning in order to improve our vulnerability prediction performance.
Who needs tests? Not me
Call graphs are fundamental for many higher-level code analyses. The selection of the most appropriate call graph construction tool for an analysis is not always straightforward and depends on the purpose of the results’ further usage. The choice of call graph construction tool has a great effect on the following tasks’ execution time, memory usage, and result quality. This research compares the resulting static and dynamic Java call graphs to assist in the selection of the most appropriate tools. Static call graphs, as their name suggests, are constructed by static analysis, based on the source code or the bytecode, without executing tests or any code parts. This means that the project can be analyzed in its early stages and with fewer resources, but there is concern that this will result in less accurate, noisier graphs since the dynamic behavior of the programs will be estimated by static algorithms. Inaccuracies can greatly affect analyses based on call graphs. On the other hand, dynamic call graphs are created during the actual execution of the program. The calls that are included as edges in the graph are exactly those that were executed during the run, so you can expect the result to be more accurate. However, dynamic analysis requires more resources and the execution of code via test cases which provide high test coverage. In this work, we investigated the relationship between dynamic and static call graphs. Is the graph generated by dynamic analysis really better? Can static graphs approximate or even complement dynamic call graphs with sound results? In order to find the answers to these questions, we compared the results of five static and one dynamic analyzer. They were evaluated on three projects of different sizes and test coverage. We included in the comparison a merged graph created by ourselves by combining different static analyzer outputs. Not only did we compare static graphs to the dynamic results, we also validated the calls in a dynamic graph and found that these graphs could mislead the user. The results show that dynamic graphs should be considered good, although not a golden standard since they contain phantom calls, calls that are not present in the source code. Such calls are not limited to synthetic calls. Static analyzers could not be applied without consideration either, but a combination of static call graphs does tend to contain similar calls to the dynamic graphs with no phantom calls.
Automated Program Repair with the GPT Family, including GPT-2, GPT-3 and CodeX
OpenGL API Call Trace Reduction with the Minimizing Delta Debugging Algorithm
Debugging an application that uses a graphics API and faces a rendering error is a hard task even if we manage to record a trace of the API calls that lead to the error. Checking every call is nota feasible or scalable option, since there are potentially millions of calls in a recording. In this paper, we focus on the question of whether the number of API calls that need to be examined can be reduced by automatic techniques, and we describe how this can be achieved for the OpenGL API using the minimizing Delta Debugging algorithm. We present the results of an experiment on a real-life rendering issue, using a prototype implementation, showing a drastic reduction of the trace size (i.e., to less than 1‰ of the original number of calls) and positive impacts on the resource usage of the replay of the trace.
Determining the axes of rotation of the lower jaw
Recording and reproducing mandibular movements have been of key importance in the practice of dentistry for over a century. Recently, it has become possible to use digital technologies for these tasks. In our study we have investigated the possibilities of 3D intra oral scanners and optimization based algorithms for this problem. This is a novel approach against of the most of the studies have been published where CBCT and live MRI were used mostly with the pure geometry rules based Reuleaux method. This is a cooperation with the Department of Oral and Maxillofacial Surgery from Albert Szent-Györgyi Medical School, SZTE.
Accelerating Transformer Inference Time through Layer Stitching and Reduction
Compact models are important in areas where low latency is necessary or training/evaluation is highly constrained by computational resources. There are already existing solutions that tackle the problem in different ways, such as quantization, pruning, or knowledge distillation. We assume that each layer contributes to the overall objective at different rates, and we try to identify and remove the less influential layers. We propose two approaches to improve the inference time in a task-specific aspect. Following previous studies, we propose a self-stitching mechanism which adds new weights to the model with the constraint that only these weights can be trained. The trained parameters are responsible for stitching the two ends of the same model while skipping several layers in-between. The other approach aims to discard highly specialized layers before fine-tuning, which can help the convergence of the model while reducing the overall inference time. We have coined the term 'DiscardBERT' to refer to this approach.
Optimization of Combinatorical systems and Processes, development of algorithms
The process of network synthesis is a commonly used method for decreasing material and energy consumption and mitigating negative environmental impacts, thereby increasing profitability. However, finding the optimal synthesis with minimal cost presents a challenge, as it is NP-hard. To address this challenge, a branch-and-bound algorithm is employed, taking advantage of the structural characteristics of the possible synthesis to reduce the solution space significantly. While this acceleration technique is beneficial, other aspects of the algorithm have not been extensively examined. In light of this, we propose a new lower bound that gives a tighter estimate than previous relaxed versions by taking the optimal structural aspects into account. This lower bound is particularly crucial when using non-linear or stochastic units in the synthesis, as it requires more time to calculate. Additionally, we extend our acceleration technique to address specific synthesis problems, such as the construction and operation of electric transmission networks, which involve undirected connections. We expand the unit model and adapt the MSG algorithm and neutral extension finding to accelerate the problem.
Fourier Domain CT Reconstruction with Complex Valued Neural Networks
In computed tomography, several well-known techniques exist that can reconstruct a cross section of an object from a finite set of its projections, the sinogram. This task – the numerical inversion of the Radon transform – is well understood, with state of the art algorithms mostly relying on back-projection. Even though back-projection has a significant computational burden compared to the family of direct Fourier reconstruction based methods, the latter class of algorithms is less popular due to the complications related to frequency space resampling. Moreover, interpolation errors in resampling in frequency domain can lead to artifacts in the reconstructed image. Here, we present a novel neural-network assisted reconstruction method, that intends to reconstruct the object in frequency space, while taking the well-understood Fourier slice theorem into account as well. In our case, the details of approximated resampling is learned by the network for peak performance. We show that with this method it is possible to achieve comparable, and in some cases better reconstruction quality than with another state of the art algorithm also working in frequency domain.
The Fritz-John condition system in Interval Branch and Bound method
The Interval Branch and Bound (IBB) method is a good choice when a rigorous solution is required. This method handles computational errors in calculations. Few IBB implementations use the Fritz-John (FJ) optimality conditions to eliminate non-optimal boxes in a constrained nonlinear programming problem. The FJ optimality condition effectively means a solution to an interval-valued system of equations. In the best case, the solution is an empty set if the interval box does not contain an optimum. In many cases, solving this system of equations fails. This problem can be caused by the fact that the interval box contains many solutions, or the defined system of equations contains unnecessary conditions, or the interval Gauss-Seidel method fails. These unsuccessful attempts have a negative outcome and only increase the computation time. In this talk, we propose four modifications to reduce the runtime and computational complexity of the Interval Branch and Bound method. In addition, we focus on a preliminary test that the Fritz-John system of equations is solved only if we are sure that a solution exists in the interval box. We describe a method for constructing a conic hull from the enclosure of the gradients in active constraints. The conic hull is used to determine whether each interval box contains an optimal solution. If the test is satisfied, we can solve the Fritz-John system of equations and reduce the interval box. Otherwise, we can discard the interval box because it does not contain an optimal solution. We present the effectiveness of the modifications and the preliminary test with experimental results.
Impact of branching strategies to productivity in Mono/Multi Repository Structures
Productivity is the main aspect in the project development process and it has been analyzed from different aspects. However none of them has ever focused on branching strategies and repository structures. This paper analyzes more than 3 million Github repositories and creates a solid Database with over 50 000 chosen Mono/Multi repository projects. Based on this Database 3 main branching strategies have been defined among the most productive projects. Results of this paper have been grouped according to the team size, development environment and repository structure of projects.
CURVED line Detector/Descriptor
A detector descriptor pipeline has been proposed for detecting and matching curved lines in equirectangular images. The detector in the pipeline is a neural network that produces a heatmap indicating the presence of curved lines in the equirectangular images. The line extractor then extracts the line segments from the heatmap by thresholding and clustering the heatmap values. Once the line segments are extracted, the descriptor is based on extracting the line features from a patch of size 48x32. The matching of line pairs between two images is performed using a distance-based approach, where the distance calculation between the descriptors of the lines is used. The pose estimation is obtained by using a RANSAC-based algorithm, specifically the Cayley RANSAC, to estimate the camera pose. The proposed pipeline has been evaluated on the KITTI 360 dataset of equirectangular images.
PCA improves the adversarial robustness of neural networks
Deep neural networks perform well in many visual recognition tasks, but they are sensitive to adversarial input perturbation. More robust models can be learned when attacks are applied to the training data or preprocessing is used. However, the effect of preprocessing is frequently underestimated and it has not received sufficient attention as it usually does not affect the network’s clean accuracy. Here, we seek to demonstrate that preprocessing can play a role in improving adversarial robustness. Our empirical results show that principal component analysis, a simple yet effective preprocessing method, can significantly improve neural networks’ robustness for both regular and adversarial training.
Fluctuation Enhanced Gas Sensing
Fluctuation Enhanced Sensing has been an active field of research for some years among those who study the applicability of noise. Our goal is to combine the principle with machine learning to make an application that is capable of odor detection. In our work, we examined the feasibility of a microcontroller-based, long-lifetime application mainly from the perspective of power consumption and reviewed the already available technologies. Our current research aims to detect different odors using multiple commercially available gas sensors by examining their output resistance signals in both the time-domain, and the frequency-domain.
General characterisation of human activity and comparison of its determination methods
We participated in an extensive research project in cooperation with psychiatrists and biophysicists. The collaboration involved the measurement of raw acceleration signals on the non-dominant wrist of 42 healthy, free-living subjects over 10 days to measure their locomotor activity. The data acquisition had several interdisciplinary objectives. Firstly, we were interested in determining how activity signals could be quantified from the acceleration signals. Such actigraphic measurements are an important part of research in different disciplines, yet the procedure of determining activity values is unexpectedly not standardized in the literature. The acceleration data can be diversely preprocessed, and then the activity values can be calculated using various activity metrics. Therefore, several types of activity signals can be determined from the same recording. To resolve methodological inconsistencies, we executed a detailed and comprehensive comparison of the activity calculation procedures by assessing the relationship between the different types of activity signals derived from the previously mentioned dataset. The correlation pattern revealed that most activity metrics produce closely related activity signals from identically preprocessed acceleration recording, but in practice, the data preparation varies between manufacturers and methods. In the world of human dynamics analysis, the scale-free nature of temporal and spatial patterns is a recurring motif. This universality has already been identified by our research group in human location displacement data in the form of 1/f-type noise, which is a special form of power-law scaling in the frequency domain. The scale-free properties in human activity have also been identified through statistical analysis (e.g., distribution of passive periods), however, the description of human activity’s spectral characteristics was incomplete. To explore the general spectral nature of human activity in greater detail, we analyzed their fluctuations. We revealed that different types of activity signals’ spectrum generally follow a universal characteristic, including 1/f noise over frequencies above the circadian rhythmicity. Moreover, we discovered that the spectrum of the raw acceleration signal has this same characteristic, and therefore the scale-free nature is generally inherent to the motor activity of healthy, free-living humans.