Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search
Vol 26, No 4 (2014)
View or download the full issue PDF (Russian)
7-20
Abstract
Automatic term acquisition is an important task for many applications related to domain-specific texts processing. At present there are many methods for automatic term acquisition, but they are highly dependent on language and domain of input text collection. Also these methods, in general, use domain-specific text collection only, while many external resources are underutilized. We argue that one of the most promising external resources for automatic term acquisition is the online encyclopedia Wikipedia. In this paper we propose two new features: "Hyperlink probability" - normalized frequency showing how often the candidate terms is a hyperlink in Wikipedia articles; and "Semantic relatedness to the domain key concepts" - arithmetic mean of semantic relatedness to the key concepts of a given domain; those key concepts are determined automatically on the basis of input domain-specific text collection. In addition, we propose a new method for automatic term acquisition. It is based on semi-supervised machine learning algorithm, but it does not require labeled data. Outline of the method is to extract the best 100-300 candidates presented in Wikipedia by using a special method for term acquisition, and then to use these candidates as positive examples to construct a model for a classifier based on positive-unlabeled learning algorithm. An experimental evaluation conducted for the four domains (board games, biomedicine, computer science, agriculture) shows that the proposed method significantly outperforms existed one and is domain-independent: the average precision is higher by 5-17% than that of the best method for a particular data set.
21-32
Abstract
Graph partitioning is required for solving tasks on graphs that need to be split across disks or computers. This problem is well studied, but most results are not suitable for processing graphs with billons of nodes on commodity clusters, since they require shared memory or low-latency messaging. One approach suitable for cluster computing is Balanced Label Propagation, based on distributed label propagation algorithm for community detection. In this work we show how multi-level optimization can be used to improve partitioning quality of Balanced Label Propagation. One of major difficulties with distributed multi-level optimization is finding a matching in the graph. The matching is needed to choose pairs of vertices for collapsing in order to produce a smaller graph. As this work shows, simply splitting graph into several parts and finding matching in these parts independently is enough to improve the quality of partitioning generated by Balanced Label Propagation. Proposed algorithm can be implemented within any framework that supports MapReduce. In our experiments, when graphs were partitioned into 32 parts, ratio of edges that donтАЩt cross partitions increased from 54-60% to 66-70%. One of significant problems of our implementation is performance тАУ work time of multi-level algorithm was approximately twice that of the original algorithm. It seems likely that implementation can be improved so that multi-level algorithm would achieve better computational performance as well as partitioning quality.
33-44
Abstract
This article is dedicated to automation of cluster creation and management for Apache Spark MapReduce implementation in Openstack environments. As a result of this project open-source (Apache 2.0 license) implementation of toolchain for virtual cluster on-demand creation in Openstack environments was presented. The article contains an overview of existing solutions for clustering automation in cloud environments by the start of 2014 year. The article provides a shallow overview of issues and problems in Openstack Heat project that provides a compatibility layer for Amazon EC2 API. The final implementation provided in the article is almost strainforward port of existing toolchain for cluster creation automation for Apache Spark in Amazon EC2 environment with some improvements. Also prepared base system virtual machine image for Openstack is provided. Plans for further work are connected with Ansible project. Using Ansible for observed problem will make possible to implement versatile environment-agnostic solution that is able to work using any cloud computing services provider, set of Docker containers or bare-metal clusters without any dependencies for prepared operating system image. Current article doesn't use Ansible due to the lack of key features at the moment of the project start. The solution provided in this article has been already tested in production environment for graph theory research arcticle.
45-54
Abstract
This article is an overview of scalable infrastructure for storage and processing of genome data in genetics problems. The overview covers used technologies descriptions, the organization of unified access to genome processing API of different underlying services. The article also covers methods for scalable and cloud computing technologies support. The first service in virtual genome processing laboratory is provided and presented. The service solves transcription factors bindning sites prediction problem. The main principles of service construction are provided. Basic requirements for underlying comptutaion software in virtual laboratory environments are provided. Overview describes the implemented web-service (https://api.ispras.ru/demo/gen) for transcription factors binding site prediction. Provided solution is based on ISPRAS API project as an API gateway and load-balancer; the middle-ware task-manager software for pool of workers support and for communications with Openstack infrastructure; OpenZFS as an intermediate storage with transparent compression support. The described solution is easy to extend with new services fitting the basic requirements.
55-72
Abstract
This paper presents an experimental evaluation of the state-of-the-art approaches for automatic term recognition based on multiple features: machine learning method and voting algorithm. We show that in most cases machine learning approach obtains the best results and needs little data for training; we also find the best subsets of all popular features.
73-90
Abstract
In this paper we consider multidimensional indexing with the additional constraint of lexicographical ordering. In order to deal with this problem we discuss two well-known tree data structures: R-tree and B-tree. We study the problem in the transactional environment with read committed isolation level. To evaluate these approaches we had implemented these structures (modified GiST ensures concurrency) and provide extensive experiments.
91-98
Abstract
In this paper, we compare three approaches of clustering partial ordered subsets of a set of items. First approach was k-medoids clustering algorithm with distance function based on Levenshtein distance. The second approach was k-means algorithm with cosine distance as distance function after vectorization of partial orders. And the third one was k-medoids algorithm with Kendall's tau as a distance function. We use Adjusted Rand Index as a measure of quality of clustering and find out that clustering with all three methods get stable results when variance of number of items ranked is high. Vectorization of partial orders get best results if number of items ranked is low.
99-112
Abstract
Hand motion driven human-computer interface based on novel time-invariant gesture description is proposed. Description is represented as a sequence of overthreshold motion distribution histograms. Such description utilizes information about gesture spatial configuration and motion dynamics. K-nearest-neighbour classifier was trained on six gesture types. Application for remote slideshow control was developed based on the proposed algorithm.
113-122
Abstract
Suicide is a major, preventable public health problem. Particularly the problem is critical for young people. In Russia every year thousands of teenagers commit suicide. In most of the cases it can be prevented if a risky state is detected. Nowadays internet becomes a major way of communication, mainly in the text form. Therefore we suggest a method to detect a tendency to suicide based on text messages. Our main approach is to study indicators of such condition and based on it use machine learning approach to build a classifier that could determine, whether the person is about to commit a suicide. Our experiments are based on the analysis of texts of Russian writers for past 100 years that committed suicide.
123-136
Abstract
The paper deals with keyphrase extraction problem for single documents, e.g. scientific abstracts. Keyphrase extraction task is important and its results could be used in a variety of applications: data indexing, clustering and classification of documents, meta-information extraction, automatic ontologies creation etc. In the paper we discuss an approach to keyphrase extraction, itsтАЩ first step is building of candidate phrases which are then ranked and the best are selected as keyphrases. The paper is focused on the evaluation of weighting approaches to candidate phrases in the unsupervised ex-traction methods. A number of in-phrase word weighting procedures is evaluated. Unsuitable approaches to weighting are identified. Testing of some approaches shows their equivalence as applied to keyphrase extraction. A feature, which allows to increase the quality of extracted keyphrases and shows better results in comparison to the state of the art, is proposed. Experiments are based on Inspec dataset.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)