The term BigData causes a lot of controversy among specialists, many of whom note that it only means the volumes of accumulated data, but do not forget about the technical side, the considered direction includes technologies for computing, storage, and also service services. Big Data is a term that denotes technologies for processing large unstructured and structured data to obtain results that are understandable and useful to humans. In business, Big Data is used to support the adoption of transformations by a manager (for example, based on an analysis of financial indicators from an accounting system) or a marketer (for example, based on an analysis of customer preferences from social networks). By themselves, Big Data algorithms appeared with the introduction of the first mainframes (high–performance servers), which have the necessary resources for operational data processing and are suitable for computer calculations and subsequent data analysis. As the number of embedded computers rises due to falling processor prices and the ubiquity of the Internet, so too is the amount of data transferred and then processed (often in real time). Therefore, we can assume that the importance of cloud computing and the Internet of Things will increase in the coming years. It should be noted that Big Data processing technology boils down to three main areas that solve three types of tasks: (1) translation and storage of incoming information in gigabytes, terabytes, petabyte, etc. for their processing, storage and application in practice; (2) structuring of disparate content, namely: photos, texts, audio, video and all other types of data; (3) analysis of Big Data and the implementation of different methods of processing unstructured data, the creation of various analytical reports. In essence, the application of Big Data implies all areas of work with large volumes of the most disparate data, constantly updated and scattered across various sources. The goal is quite simple – the most efficient work, the introduction of new products and increased competitiveness. In this article, we will consider the features of solving the problems of using Big Data in international business.
Despite the fact that software development uses various technologies and approaches to diagnose errors in the early stages of development and testing, some errors are discovered during operation. To the user, errors often look like a program crash while running. To collect reports on program crashes, a special analysis component is built into the operating system. Such a component is present in both Windows OS and Linux-based OS, in particular Ubuntu. An important parameter is the severity of the error found, and this information is useful to both the developer of the distribution kit and the user. In particular, users with such diagnostics can take organizational and technical measures before the release of a bug fix from the software developer. The article introduces CASR: a tool for analyzing a memory image at the time of a process termination (coredump) and reporting errors. The tool allows you to assess the severity of the detected crash by analyzing the memory image, as well as collect the necessary information for the developer to help fix the defect. Such information is: OS distribution version, package version, process memory card, state of registers, values of environment variables, call stack, signal number that led to abnormal termination, etc. Severity assessment enables the software developer to correct errors, which are the most dangerous in the first place. CASR can detect files and network connections that were open at the time of the crash. This information will help reproduce the error, and will help users and administrators take action in the event of an attack on the system. The tool is designed to work on Linux OS and supports x86 / 64, armv7 architectures and can be supplied as a package for Debian-based distributions. The tool has been successfully tested with several open source bugs.
The paper presents a debugger for parallel programs in С/C++, or FORTRAN, which are executed in high-performance computers. The debugger’s program components and mechanism of their interaction are described. The graphic user’s interface capabilities are discussed and the profiling procedure using built-in profiling tools is described. The paper contains of the description of the new parallel debugger capabilities such as a communication treelike scheme of his components connection, and a non-interactive debugging mode, and the support of Nvidia’s graphic accelerators. Currently, the debugger provides launching of debug jobs in the systems of batch processing of jobs such as Open PBS / Torque, SLURM, and CSP JAM but it can be configured for other systems. The PD debugger allows to debug program processes and threads, manage breakpoints and watchpoints, logically divide program processes into subsets, manage them, change and view variables, and profile the debugged program using the free Google Performance Tools and mpiP. The PD debugger is written in the Java programming language, intended for debugging programs on Unix / Linux operating systems, and it uses free software components such as SwingX, JHDF5, Jzy3D, RSyntaxTextArea, and OpenGL.
This paper presents the results of the application of a convolutional neural network to diagnose left atrial and left ventricular hypertrophies by analyzing 12-lead electrocardiograms (ECG). During the study, a new unique dataset containing 64 thousand ECG records was collected and processed. Labels for the two classes under consideration, left ventricular hypertrophy and left atrial hypertrophy, were generated from the accompanying medical reports. A set of signals and obtained labels were used to train a deep convolutional neural network with residual blocks; the resulting model is capable of detecting left ventricular hypertrophy with F-score more than 0.82 and left atrial hypertrophy with F1-score over 0.78. In addition, the search for optimal neural network architecture was carried out and the experimental evaluation of the effect of including patient metadata into the model and signal preprocessing was conducted. Besides, the paper provides a comparative analysis of the difficulty of detecting left ventricular and left atrial hypertrophies in relation to the other two frequently occurring heart activity disorders, namely atrial fibrillation and left bundle branch block.
Amount of news is rapidly growing up in recent years. People cannot handle them effectively. This is the main reason why automatic methods of news stream analysis have become an important part of modern science. The paper is devoted to the part of the news stream analysis which is called “event detection”. “Event” is a group of news dedicated to one real-world event. We study news from Russian news agencies. We consider this task as clusterization on news and compare algorithms by external clusterization metrics. The paper introduces a novel approach to detect events at news in Russian language. We propose a two-staged clustering method. It comprises “rough” clustering algorithm at the first stage and clarifying classifier at the second stage. At the first stage, a combination of shingles method and naive named entity based clusterization is used. Also we present a labeled dataset of news event detection based on «Yandex News» service. This manually labeled dataset can be used to estimate event detection methods performance. Empirical evaluation on these corpora proved the effectiveness of the proposed method for event detection at news texts.
Logical structure extraction from various documents has been a longstanding research topic because of its high influence on a wide range of practical applications. A huge variety of different types of documents and, as a consequence, the variety of possible document structures make this task particularly difficult. The purpose of this work is to show one of the ways to represent and extract the structure of documents of a special type. We consider scanned documents without a text layer. This means that the text in such documents cannot be selected or copied. Moreover, you cannot search for the content of such documents. However, a huge number of scanned documents exist that one needs to work with. Understanding the information in such documents may be useful for their analysis, e. g. for the effective search within documents, navigation and summarization. To cope with a large collection of documents the task should be performed automatically. The paper describes the pipeline for scanned documents processing. The method is based on the multiclass classification of document lines. The set of classes include textual lines, headers and lists. Firstly, text and bounding boxes for document lines are extracted using OCR methods, then different features are generated for each line, which are the input of the classifier. We also made available dataset of documents, which includes bounding boxes and labels for each document line; evaluated the effectiveness of our approach using this dataset and described the possible future work in the field of document processing.
In this paper, we propose an approach to the document images segmentation in a case of limited set of real data for training. The main idea of our approach is to use artificially created data for training and post-processing. The domain of the paper is PDF documents, such as scanned contracts, commercial proposals and technical specifications without a text layer is considered as data. As part of the task of automatic document analysis, we solve the problem of segmentation of DLA documents (Document Layout Analysis). In the paper we train the known high-level FasterRCNN \cite{ren2015faster} model to segment text blocks, tables, stamps and captions on images of the domain. The aim of the paper is to generate synthetic data similar to real data of the domain. It is necessary because the model needs a large dataset for training and the high labor intensity of their preparation. In the paper, we describe the post-processing stage to eliminate artifacts that are obtained as a result of the segmentation. We tested and compared the quality of a model trained on different datasets (with / without synthetic data, small / large set of real data, with / without post-processing stage). As a result, we show that the generation of synthetic data and the use of post-processing increase the quality of the model with a small real training data.
Nowadays the problem of legal regulation of automatic collection of information from sites is being actively solved. This means that interest in tools and programs for automatic data collection is growing and that's why interest in automatic programs for solving CAPTCHA («Completely Automated Public Turing test to tell Computers and Humans Apart») is increasing too. In spite of сreation of more advanced types of captcha, nowadays text captcha is quite common. For instance, such large services as Yandex, Google, Wikipedia, VK continue to use them. There are many methods of breaking text captchas in literature, however, it should be noted that most of them have a limitation to priori know the length of the text on the image, which is not always the case in the real world. Also, many methods are multi-stage, which brings additional inconvenience to their implementation and application. Moreover, some methods use a large number of labeled pictures for training, but in reality, to collect data one has to be able to solve captchas from different sites. Respectively, manually labeling dozens of thousands of examples for training for each new type of captcha is an unrealistically difficult action. In this paper we propose a one-step algorithm of attack on text captchas. This approach does not require a priori knowledge of the text's length on the image. It has been shown experimentally that the usage of this algorithm in conjunction with the adversarial learning method allows one to achieve high quality on real data, using the low number (200-500) of labeled examples for training. An experimental comparison of the developed method with modern analogs showed that using the same number of real examples for training, our algorithm shows a comparable or higher quality, while it has a higher speed of working and training.
Currently, RF is actively developing the Northern territories. Questions of studying the physical processes of icing are relevant, since climate conditions affect the surface of the objects under study (power lines, residential buildings, power plants, aircraft), human safety and ecology. In clouds, the appearance and movement of liquid droplets-particles is possible. When studying two-phase flows containing a suspension of aerosol particles (dispersed phase) in the carrier medium (dispersion medium) in the atmosphere, it is important to correctly evaluate the main parameters that define the system, and adequately describe the real process using a formulated mathematical model. This article is devoted to the development of a new iceFoam solver as part of the OpenFOAM v1912 package for modeling the icing process at a typical particle size of about 40 microns, which corresponds to Annex C of the AP-25 Aviation rules. The Euler-Lagrangian approach and finite volume method are used to describe the dynamics of liquid droplets. A modified liquid film model based on the shallow water theory is used as a thermodynamic model. The results of calculation for the case of flow around the cylinder and airfoil NACA 0012 using the URANS method and Spalart-Allmaras turbulence model are presented. In the calculation domain, two variants of grids are constructed: for an external gas-drop flow and for a liquid thin film with a thickness of one cell in height. Patterns of ice thickness distribution are given. When developing the source code using C++ language, the technology of inheritance was used, i.e. creating base and derived classes. As a result, a parallel iceFoam solver was developed for simulating the motion of liquid particles and the formation of ice on the bodies’ surface. For the calculation of one test case 8-32 computing cores were used on the ISP RAS HPC.
The creation of high-performance methods for calculating the interaction of aerosol flows with a solid is of great practical interest in the problems of preventing surfaces from icing, predicting climatic phenomena, metallurgy and astronomical processes. One of the methods of icing diminishing is the use of hydrophobic coatings, which, as a rule, work effectively with insignificant ratios of inertial forces to the forces of surface tension of a liquid near the relief of the streamlined body. However, when the surface density of the kinetic energy of the supercooled drop exceeds a certain critical value, the ice-phobic properties lead to negative effects due to the penetration of the supercooled liquid into the depressions and solidification in them. A method is developed for calculating the interaction of supercooled drops with a relief solids, which have various degrees of hydrophobicity. Basic criteria for corresponding the results of molecular modeling to physical reality are formulated. The need to develop algorithms for numerical simulation is due to the fact that significant computational resources are required even for calculating small droplets which are several tens of nanometers in size. Numerical estimates of the parameters of the relief of a hydrophobic surface of a solid are obtained depending on the dimensionless dynamic parameters of the impact of supercooled drops. Moving interface – the crystallization front in supercooled metastable liquid droplets has specific properties. On the basis of previously carried out experimental studies, theoretical estimates, analytical and experimental data of other researchers, in present work mathematical models of the crystallization features of a supercooled metastable liquid are developed. Estimates of the parameters of the processes accompanying the movement of the crystallization front in supercooled metastable water droplets are obtained with application to the problem of icing of aircraft.
The paper investigates the implementation of virtual networks on the SDN data plane which is modeled by a graph of physical connections between network nodes. A virtual network is defined as a set of ordered host pairs (sender, receiver), and it is implemented by a set of host-host paths that uniquely determine the switch settings. A set of paths is perfect if any subset of its paths can be loop-free implemented, i.e., can be implemented without the occurrence of an endless movement of packets in a loop, without duplicate paths, when the host receives the same packet several times, and without unintended paths when the host receives the packet that was directed to another host. For the case when the switchgraph is a complete graph, sufficient conditions for the existence of the largest perfect set of paths connecting all pairs of different hosts are established. Algorithms for constructing such a largest perfect set are proposed with the estimates of their complexity. The paper also has the preliminary results of computer experiments which show that proposed sufficient conditions are not necessary conditions.
In this paper, we present a method for state space reduction of dense-time Petri nets (TPNs) – an extension of Petri nets by adding a time interval to every transition for its firing. The time elapsing and memory operating policies define different semantics for TPNs. The decidability of many standard problems in the context of TPNs depends on the choice of their semantics. The state space of the TPN is infinite and non-discrete, in general, and, therefore, the analysis of its behavior is rather complicated. To cope with the problem, we elaborate a state space discretization technique and develop a partial order semantics for TPNs equipped with weak time elapsing and intermediate memory policies.
ISSN 2220-6426 (Online)