In recent years, the QUIC protocol has become widely popular as an alternative to TCP. In addition, Multipath technology implemented in the MPQUIC protocol is currently being widely implemented and researched. The central component of the MPQUIC protocol is the scheduler, which decides which path and at which time to send the next data packets. There are implementations of schedulers based on both heuristic rules and reinforcement learning. At the moment, the behavior of schedulers in various environments has been studied in detail in terms of path characteristics. However, the issue of their effectiveness, depending on the congestion control algorithms used, is not sufficiently sanctified. This paper presents the implementation of various schedulers and a study of their effectiveness depending on congestion control. The results obtained suggest that the scheduler can work effectively in a network environment with a certain congestion control algorithm, but it may not be effective in an environment with a different congestion control algorithm.
In Astra Linux operating system (OS), in addition to the traditional Discretionary Access Control, its PARSEC security subsystem implements Mandatory Access Control policies such as Mandatory Integrity Control (MIC) and Multilevel Security (MLS). Given the variety of entities (objects, files, directories, sockets, etc.) and subjects (processes) available in a given OS, these policies often have a complicated logic of functioning, which makes it difficult to test them using manual testing. This problem is especially aggravated in the context of the necessity to fulfill the Security Development Lifecycle in compliance with the requirements of the highest protection classes and trust levels established by the FSTEC of Russia regulatory documents. Besides, the MIC and MLS policies of this OS are based on the mandatory entity-role model of access and information flows security control in OS of Linux family (MROSL DP-model), described in the classical mathematical notation and in the formalized notation using the formal Event-B method. Therefore, the authors of this paper have developed and finalized the approach recommended by GOST R 59453.4-2025, taking into account the specifics of current releases of Astra Linux OS, which consists of tracing system calls and translating them into the language of the formal model in order to verify the compliance of the functioning access control policies with the model. The results of this work are described in this paper, which, firstly, outlines the results of the development and verification of the so-called lower-level representation of the MROSL DP-model (PARSEC-model) used for testing, performed in the Event-B and in essence represents a functional specification of access control-related system calls of the OS. Secondly, it describes a testing system that includes the Linux Kernel Module for tracing system calls, software for their translation into model traces, an animator of model traces using the ProB toolkit, and software for generating test results in the format of the Allure toolkit. Thirdly, the paper considers approach to using eBPF technology for parallelizing testing.
The paper presents the implementation of static analysis for the Visual Basic .NET language within the industrial tool SharpChecker. Using the Roslyn compiler framework, support for the Visual Basic .NET language was integrated into SharpChecker, enabling static analysis of Visual Basic .NET source code. As part of this work, a representative set of synthetic tests was created, comprising over 2000 test cases. Testing was conducted both on this synthetic dataset and on a collection of real-world open-source projects totaling more than 1.6 million lines of code. A total of 7926 new warnings were detected in Visual Basic .NET source code, of which 1093 were manually reviewed and labeled. The final analysis accuracy reached 84.72%. Additionally, warnings related to code written in both C# and Visual Basic .NET were discovered, demonstrating the feasibility of cross-language analysis in projects that include both .NET platform languages. It was also found that adding Visual Basic .NET language support to SharpChecker had no impact on the performance or the quality of analysis for the C# language.
Program analysis and automated testing have recently become an essential part of SSDLC. Directed greybox fuzzing is one of the most popular automated testing methods that focuses on error detection in predefined code regions. However, it still lacks ability to overcome difficult program constraints. This problem can be well addressed by symbolic execution, but at the cost of lower performance. Thus, combining directed fuzzing and symbolic execution techniques can lead to more efficient error detection.
In this paper, we propose a hybrid approach to directed fuzzing with novel seed scheduling algorithm, based on target-related interestingness and coverage. The approach also performs minimization and sorting of objective seeds according to a target-related information. We implement our approach in Sydr-Fuzz tool using LibAFL-DiFuzz as directed fuzzer and Sydr as dynamic symbolic executor. We evaluate our approach with Time to Exposure metric and compare it with pure LibAFL-DiFuzz, AFLGo, and other directed fuzzers. According to the results, Sydr-Fuzz hybrid approach to directed fuzzing shows high performance and helps to improve directed fuzzing efficiency.
Training high-quality classifiers in domains with limited labeled data remains a fundamental challenge in machine learning. While large language models (LLMs) have demonstrated strong zero-shot capabilities, their use as direct predictors suffers from high inference cost, prompt sensitivity, and limited interpretability. Weak supervision, in contrast, provides a scalable alternative through the aggregation of noisy labeling functions (LFs), but authoring and refining these rules traditionally requires significant manual effort. We introduce LLM-Guided Iterative Weak Labeling (LGIWL), a novel framework that integrates prompting with weak supervision in an iterative feedback loop. Rather than using an LLM for classification, we use it to synthesize and refine labeling functions based on downstream classifier errors. The generated rules are filtered using a small development set and applied to unlabeled data via a generative label model, enabling high-quality training of discriminative classifiers with minimal human annotation. We evaluate LGIWL on a real-world text classification task involving Russian-language customer service dialogues. Our method significantly outperforms keyword-based Snorkel heuristics, zero-shot prompting with GPT-4, and even a supervised CatBoost classifier trained on a full labeled dev set. In particular, LGIWL achieves strong recall while yielding a notable improvement in precision, resulting in a final F1 score of 0.863 with a RuModernBERT classifier–demonstrating both robustness and practical scalability.
The evolution of query compilation in database management systems traces back to System R, which pioneered a code generation scheme where small machine code fragments were stitched together to form a specialized routine to process a given SQL statement. Subsequent approaches shifted to generating C code, compiling it with system compilers like GCC into dynamic libraries, and loading them at runtime. The current state-of-the-art standard for dynamic query compilation is the LLVM framework, which bypasses frontend compiler overhead by directly generating intermediate representation, enabling machine-independent optimizations and efficient machine code generation. LLVM's resource-intensive nature, primarily designed as an optimizing compiler, however, can lead to compilation times that are orders of magnitude longer than query execution times, particularly problematic for queries with millisecond-level interpretation costs. This paper evaluates two lightweight code generation frameworks for x86-64 architecture as alternatives to LLVM in PostgreSQL, assessing their code generation speed and the quality of emitted machine code. We present a qualitative comparison with LLVM, analyzing trade-offs between compilation latency and runtime performance across databases of varying sizes. Experimental results demonstrate that lightweight code generation can not only outperform LLVM on small-scale datasets but also maintain competitive performance on larger ones.
Handling missing values in tabular data remains a critical challenge for building robust machine learning models. This paper presents a novel approach to imputation based on unary classification. The proposed method employs an ensemble of perceptrons trained independently for each class to estimate the likelihood of reconstructed values with respect to the empirical support of that class. A uniform distribution over a bounded region of the feature space is used as a background model, enabling the interpretation of the model’s output as an approximation of the posterior probability that an object belongs to a given class. This probabilistic interpretation is then leveraged within an iterative procedure for missing value imputation and classifier training. The theoretical validity of the proposed estimator is rigorously justified. Experiments on synthetic two-dimensional datasets with missing values generated under the MCAR (Missing Completely At Random) mechanism demonstrate the superiority of the proposed method over classical imputation techniques, particularly in scenarios with high missingness rates and complex class boundaries.
Tables are widely used to represent and store data, but they are typically not accompanied by explicit semantics necessary for machine interpretation of their contents. Semantic table interpretation is critical for integrating structured data with knowledge graphs, but existing methods struggle with Russian-language tables due to limited labeled data and linguistic specificity. This paper proposes a contrastive learning-based approach to reduce dependency on manual labeling and improve column annotation quality for rare semantic types. The proposed approach adapts contrastive learning for tabular data using augmentations (removing/shuffling cells) and a distilled multilingual DistilBERT model trained on unlabeled RWT corpus (7.4M columns). The learned table representations are integrated into the RuTaBERT pipeline, which reduces computational costs. Experiments show micro-F1 0.974 and macro-F1 0.924, outperforming some baselines. This highlights the approach’s efficiency in handling data sparsity and Russian language features. Results confirm that contrastive learning captures semantic column similarities without explicit supervision, crucial for rare data types.
The paper proposes a metric for evaluating the performance of feature point extraction algorithms in rough terrain conditions with no clearly defined landmarks or corners. Various feature point detection algorithms are compared for subsequent integration into a SLAM algorithm on board an unmanned aerial vehicle (UAV). The proposed metric, along with other algorithm parameters, is evaluated through experiments conducted in a controlled environment. The advantages of algorithms based on machine learning models are demonstrated.
Improving the level of safety of railway traffic is directly related to the need for prompt detection of structural anomalies of track elements. This task is implemented through regular inspections using non‑destructive testing methods. Among the modern technologies used for this purpose, eddy current flaw detection stands out. The flaw detector generates a multi-channel discrete signal, which is called a defectogram. Defectograms require analysis, that is, the identification of useful signals from a defect or structural elements of the rail. This paper investigates the use of YOLO (You Only Look Once) family convolutional neural networks for automated detection of useful signals in eddy current rail defectograms. The main objective was to evaluate the effectiveness of different transformations of multichannel time‑series data into two‑dimensional images suitable for YOLO processing, and to explore the trade‑off between detection accuracy and computational cost. Four transformation methods are examined: Threshold Transform, based on amplitude comparisons against a twice threshold noise level, Short‑Time Fourier Transform, Continuous Wavelet Transform and Hilbert–Huang Transform. The dataset comprises defectogram fragments of 50 thousand counts with annotated useful signals from three classes (flash butt welds, aluminothermic welds, and bolt joints), split into training, validation, and test sets. YOLO models trained on this data achieved high mean Average Precision scores in useful signals detection for all considered transformation methods. Continuous Wavelet Transform yielded the best scores while the Threshold Transform proved to be the least computationally expensive. Short‑Time Fourier Transform method offered the best balance between precision and recall. Hilbert–Huang Transform showed slightly lower effectiveness. These results demonstrate the suitability of YOLO networks for eddy current defectogram analysis and useful signals detection in general.
The article presents the principles of compiling a dictionary of hydronyms, proper names of water bodies in the Republic of Sakha (Yakutia), for further work with it on the LingvoDoc platform. The dictionary is compiled using a comprehensive approach from the perspectives of lexicography, lexicology, semantics, morphology, etymology, and cartography. The article describes the methodology of selecting and analyzing toponymic material, the problems of distortion of names during mapping according to the rules of the Russian language, the main structural types of hydronyms, and the principles of identifying semantic features and dividing them into groups. The article presents maps created based on the data of toponym dictionaries uploaded to the LingvoDoс platform. The hydronym dictionary is the first attempt to systematize the names of water bodies in the Republic of Sakha (Yakutia) on the LingvoDoс platform.
The article describes a project, the implementation of which began this year at the Institute of Language, Literature and History of the Karelian Scientific Centre of the Russian Academy of Sciences: “The Language of the Monuments of the Baltic-Finnic Literature of the 17th-19th Centuries: A Comprehensive Analysis Based on the LingvoDoc Linguistic Platform.” The LingvoDoc platform is a digital repository designed to back up language data. Its tools allow for the simultaneous processing of language material and the online analysis of phonetic, morphological, and lexical features of the language. Placing texts from Karelian and Vepsian scripts on the LingvoDoc platform will not only enable research tasks (textual analysis, identifying dialectal specifics, creating concordances, etc.) but also address issues of language documentation. Big data processing will ensure the relevance of the results.
The paper provides a comprehensive review of contemporary methods for automatic cognate detection, integrating deep learning techniques with traditional linguistic analyses. The primary objective is to systematize existing architectures, assess their strengths and limitations, and propose an integrative model combining phonetic, morphological, and semantic representations of lexical data. To this end, we critically analyze studies published between 2015 and 2025, selected via a specialized parser from the arXiv repository. The review addresses three core tasks: (1) evaluating the accuracy and robustness of Siamese convolutional neural networks (CNNs) and transformer-based models in transferring phonetic patterns across diverse language families; (2) comparing the effectiveness of orthographic metrics (e.g., LCSR, normalized Levenshtein distance, Jaro–Winkler index) with semantic embeddings (fastText, MUSE, VecMap, XLM-R); and (3) examining hybrid architectures that incorporate morphological layers and transitive modules for identifying partial cognates. Our findings indicate that a combination of phonetic modules (Siamese CNNs + transformers), morphological processing (BiLSTM leveraging UniMorph data), and learnable semantic vectors yields the best accuracy and stability across various language pairs, including low-resource scenarios. We propose an integrative architecture capable of adapting to linguistic diversity and effectively measuring word relatedness. The outcome of this research includes both an analytical report on state-of-the-art methods and a set of recommendations for advancing automated cognate detection in large-scale linguistic applications.
The suicide is a terrifying act of a person who is misled by his own mental state. This problem arises across many countries. Sadly, Russia also has quite high number of persons who committed suicide. Luckily, a subset of these people writes their struggles in social media, allowing a way to find them and help. However, these valuable texts disappearing in many irrelevant texts which is considerably slowing down the decision process about person's suicidal risk. To tackle this problem, in this work we have presented a detailed methodology of building the dataset for detecting texts that describe presuicidal and anti-suicidal signals. This methodology describes the process of instruction and class table creation, the process of annotation, verification and post-annotation correction. Guiding by this methodology, we collect and annotate a large-scale Russian dataset with more than 50 thousand texts from social media. We provide a count statistic of the dataset as well as common problems in annotation. We also conduct basic experiments of building the classification models to show the on go performance on different levels of annotation. Furthermore, we make the dataset, code and all materials publicly available.
We present a novel method for aligning reads in whole-genome sequencing (WGS), aimed at improving alignment accuracy and the practical efficiency of this stage of genomic analysis. Unlike graph-based approaches, the proposed algorithm directly integrates knowledge of known genetic variants into the alignment process, enabling more accurate mapping of reads to the reference genome without constructing complex graph structures. The method has demonstrated high effectiveness on real sequencing data: we observed a consistent improvement in read alignment quality in highly variable and difficult-to-map regions of the genome. In particular, using variant information allows more precise alignment of reads that contain alternative alleles, reducing the number of mapping errors in these regions. At the same time, the required computational resources remain at an acceptable level, making this solution applicable in standard WGS pipelines without a significant increase in workload. The alignment speed of the algorithm is comparable to traditional solutions, which facilitates its integration into existing analytical pipelines.
The practical value of the method lies in the improved alignment accuracy, which directly affects the quality of downstream variant calling and other analyses. The proposed approach can serve as an effective alternative to current graph-based alignment methods, providing comparable improvements in alignment quality with lower complexity of implementation. Future work will include optimizing the algorithm’s performance, expanding the set of genetic variants accounted for, and conducting in-depth comparisons with other tools. These steps are intended to further increase the method’s efficiency and reliability, reinforcing its significance for practical use in genomics.
The substantial cost of training from scratch of visual foundation models (VFMs) on large and vast datasets motivates the models’ owners to protect their intellectual property via ownership verification methods. In this work, we propose ExpressPrint, a novel approach to watermarking VFMs based on the fine-tuning of expressive layers of VFMs together with a small encoder-decoder network to embed the digital watermarks into a set of input images. Our method implies a small modification of expressive layers together with training an encoder-decoder neural network to extract user-specific binary messages from the hidden representations of certain input images. This method allows distinguishing between the foundation model provided to a user and independent models, thereby preventing unauthorized use of the model by third parties. We discover that the ability to correctly extract encoded binary messages from images transfers from a watermarked VFM to its functional copies obtained via pruning and fine tuning; at the same time, we experimentally show that non-watermarked VFMs do not share this property.
The Shazam algorithm has proven its reliability and efficiency in audio identification tasks. In this paper, we adapt the core principles of the Shazam algorithm for the problem of partial video copy detection. We propose a novel method for alignment video fingerprints in partial copy detection search of video query across video base. One of the best features of this method: fast CPU execution, simplicity and at the same time high efficiency. Experimental results on publicly available video datasets demonstrate that our approach achieves high accuracy in detecting partial and modified video copies, with competitive performance in terms of speed and scalability. Our findings suggest that Shazam-inspired fingerprinting can serve as an effective tool for large-scale video copy detection applications.
ISSN 2220-6426 (Online)





