Text sampling strategies for predicting missing bibliographic links

The paper proposes various strategies for sampling text data when performing automatic sentence classification for the purpose of detecting missing bibliographic links. We construct samples based on sentences as semantic units of the text and add their immediate context which consists of several neighboring sentences. We examine a number of sampling strategies that differ in context size and position. The experiment is carried out on the collection of STEM scientific papers. Including the context of sentences into samples improves the result of their classification. We automatically determine the optimal sampling strategy for a given text collection by implementing an ensemble voting when classifying the same data sampled in different ways. Sampling strategy taking into account the sentence context with hard voting procedure leads to the classification accuracy of 98% (F1-score). This method of detecting missing bibliographic links can be used in recommendation engines of applied intelligent information systems.


Introduction
Scientific research is impossible without correlating the results obtained with the work of other scientists. Other works should be mentioned by inserting bibliographic links in the article. Experts in scientometrics rationalize the need to establish such links between studies and formulate various citation theories.
The normative theory of citation, which draws on the principles of scientific ethics formulated by Merton (1973), assumes that references in scientific papers are made in order to indicate the works that are the basis for research or topically related, describe the research methods used and are necessary to discuss the results. According to the reflexive theory, links between scientific works indicate the state of science and help to create its formalized representation, e.g. maps of science (Akoev et al. 2014). Authors of scientific papers choose the sources for citation and positions for the links by themselves and at present, this process is not automated. In this work, we investigate the possibility of creating a recommendation algorithm that allows one to find missing bibliographic references in a scientific article, that is, to identify those text fragments where it is necessary to mention another research work. For this purpose, we estimate the probability of link presence in fragments of the text using a semisupervised machine learning approach. The formal statement of the problem under consideration is the following: it is required to automatically find in the text of a scientific article those fragments (sentences) where the link is absent, but necessary, using a set of labeled fragments with and without links as training data.
The task of classifying text fragments in relation to the presence of the links in them is methodologically similar to the task of Sentiment Analysis, in which texts are automatically classified, mainly as positive and negative, according to their emotional characteristics. In addition to dividing fragments into positive and negative, the sentiment analysis approach is used to distinguish other classes, including citation (2020) studied topical classification and showed that models taking context as input performed better than context-free models. In those works, context size is determined once based on some bias and may not be optimal for a certain text corpus.
The method introduced in this work also can be considered as kind of a resampling technique. Until now resampling has been used mainly for the purpose of balancing the class distribution in training datasets in order to improve the accuracy of class prediction, which is negatively affected by imbalanced data. Resampling methods are classified into three types, namely undersampling, oversampling and hybrid It is important that all of the above algorithms do not take into account the natural structural units of texts (i.e. sentences and paragraphs) since these algorithms are adjusted to a certain size of the context, which is a fixed number of words, while the size of sentences and paragraphs varies.

Methods
The task of determining missing links is formalized as finding text fragments where the link is absent, but necessary, or, conversely, is present, but not needed.
We solve the problem using automatic binary classification with two classes namely positive and negative. For each fragment of a scientific article, our algorithm determines the probability of a bibliographic link in it. A collection of text documents is given such that each document consists of fragments. A fragment is a sequence of words (terms) of different lengths. Fragments can overlap each other and vary in size.
Each fragment is a sample and is labeled as one of the two possible classes with class labels: positive or negative. The class label corresponds to whether or not the given fragment contains a bibliographic link. The task of this study is to find a strategy for the construction samples of fragments, which gives the highest accuracy in determining the labels of the class by a certain classifier.
The hypothesis of our study is the following: text sampling strategies that take into account the context increase the accuracy of sentence classification used to predict missing bibliographic links in scientific articles.
We suggest that a positive sample consists of a bibliographic link surrounded by its context from the original text, and a negative sample is a fragment with no bibliographic link in it. In order to avoid duplication of samples, we consider a sentence with two or more links only once. The context of the link is limited to the sentence containing it, or the context is extended and it includes neighboring sentences as well.
The best option is when the boundaries of a link context coincide with the boundaries of the complete author's statement to which this link belongs. In this case, a semantic unit of text can be either one or several sentences, which makes it difficult to set the size of the context. Nevertheless, to approach the specified goal in the proposed algorithm, as a context we consider a fragment which size is determined by the number of sentences, and not words, unlike neural network algorithms. Thus, in our algorithm context is formed on the basis of natural structural units of text.
The feature space is constructed automatically based on vocabulary statistics within the Bag-of-Words model (BoW). The vocabulary of the model includes words and all the original punctuation marks and typographical symbols. As additional features, we consider named entities.

Algorithm
The algorithm consists of the following stages.  o After labeling sentences, citation markers are removed.

Named Entities processing
 Detection of named entities in the text;  Replacement of named entities with special marks. The visualization of the contents of the samples is shown in Figure 1.

Classification of samples
 Balancing the class distribution in the training set is done by random undersampling.
 For each sample, a vector model is built using count vectorizer as the fastest and most computationally effective text representation.
 Vectorized set of samples is processed using a classifier.

Optimal sampling strategy determination
Further, an ensemble method is used to automatically determine the optimal sampling strategy. We give the same data sampled in different ways to the estimators of the same type and implement a voting procedure.
The flowchart of the whole algorithm is shown in Figure 2. Each BoW j corresponds to one sampling strategy, and for each sampling strategy, we run its own estimator. All the estimators implement the same classification method but take different types of samples as input data. and "Without links" using various sampling strategies (for n=m=1).

Experiment
To test the hypothesis experimentally, we took the dataset of STEM journal articles, In our experiment we consider sentences that are more than 30 words long. With this restriction we got the set of 458774 sentences in total.
Sentences containing citation markers @xcite are assigned to the positive class ("With links"), and after that citation markers are removed. Sentences without citation markers are labeled as negative ("Without links"). The ratio of classes is: 24%positive sentences, 76% -negative sentences. This is assumed as sampling strategy #0, and the classification result on data sampled that way is considered as a baseline: classification accuracy with sampling strategy #0 measured by F1-score is 0.7866.
After establishing the baseline, we test various strategies of data sampling in order to improve the classification accuracy. The main idea of sampling is to take into account some context of the sentences with a link. Different strategies of sampling assume various directions, positions, and sizes of the context, determined by a number of surrounding sentences. In different sampling strategies, each sentence [i] is included in different types (variants) of samples.
In the experiment we test 10 strategies with the following parameters: n: [0, 1, 2 3 4 5], m: [0, 1, 2, 3, 4], k: [1,3]. All the sample types corresponding to the chosen sampling strategies are presented in Table 1. The distribution of the length (number of words) in positive and negative samples of different types is shown in Figure 3. After equalizing classes by random undersampling the data is divided into training and test sets with the proportion parameter test_size=0.33.
Vector representation is build using CountVectorizer method of the Scikit-learn library. The vocabulary includes unigrams and bigrams and is reduced by frequency with the parameters min_df=3, max_df=0.7 For classification we use a neural network multilayer perceptron (the MLPClassifier method of the Scikit-learn library). Classification performance depending on the sampling strategy used is presented in Table 2.

Result
The formulated research hypothesis has been confirmed experimentally. We have shown that the choice of the sampling strategy affects the result of text classification.
The baseline is established with sampling strategy #0. In this case, the classification performance measured using the F1-score is only 79 %, which is not sufficient for practical use in industrial information systems.
The improvement is achieved due to the data sampling strategy which assumes automatic determination of the optimal sample type. That is provided by applying the voting procedure to the decisions made by different estimators. The proposed algorithm shows 98% accuracy (F1-score), which is comparable to the state-of-the-art results for NER using automatic classification and other text classification tasks. It is important that the proposed algorithm provides high accuracy but doesn't require huge computational resources to be implemented.

Conclusion
The paper proposes a new method of determining the probability of a bibliographic link in fragments of a scientific article. The approach assumes sentence classification with ensemble voting, in which different data sampling strategies correspond to estimators implementing the same classification method. The problem statement made by the authors is close to well-studied areas NER and sentiment analysis but is new from the real application point of view.
The main innovation of the proposed method is finding the link context that maximally affects the probability of detecting a missing bibliographic link in a sentence.
In the proposed algorithm, the best size and position of context are determined automatically. The size is based on the boundaries of semantic units of the text and is measured by the number of sentences, not words, thus we utilize the fact that a sentence is a more semantically capacious (meaningful) unit than a word. Most existing text classification methods do not assume fragment context as significantly important, but this study shows the critical importance of taking it into account. The considerable impact of the context on the classification performance demonstrates that semantics related to a bibliographic link can be localized in fragments of different lengths.
The accuracy of the proposed algorithm reaches 98% (F1-score). It is important to note the high computational efficiency of the described method in comparison with convolutional artificial neural networks. This advantage is achieved due to the bigger size of samples. The investigated approach to text analysis expands the principle of the attention mechanism aimed at training a language model to understand the impact of global and local context. Automatic determination of the context boundaries correlates with the idea of automatic selection of significant features in artificial neural networks.
The proposed method can be used in recommendation engines of applied intelligent information systems, including assistance for constructing documents and composing texts with probable links to other documents, or help in checking the document correctness. Such functions are useful in many fields e.g. science, law, or journalism, where documents contain statements that should be confirmed by references to legal acts or other sources.
In accordance with the company's policy, we do not publish the source code.