Ranking in keyphrase extraction problem : is it suitable to use statistics of words occurrences ?

The paper deals with keyphrase extraction problem for single documents, e.g. scientific abstracts. Keyphrase extraction task is important and its results could be used in a variety of applications: data indexing, clustering and classification of documents, meta-information extraction, automatic ontologies creation etc. In the paper we discuss an approach to keyphrase extraction, its’ first step is building of candidate phrases which are then ranked and the best are selected as keyphrases. The paper is focused on the evaluation of weighting approaches to candidate phrases in the unsupervised extraction methods. A number of in-phrase word weighting procedures is evaluated. Unsuitable approaches to weighting are identified. Testing of some approaches shows their equivalence as applied to keyphrase extraction. A feature, which allows to increase the quality of extracted keyphrases and shows better results in comparison to the state of the art, is proposed. Experiments are based on Inspec dataset.


Introduction
The paper deals with the keyphrase extraction problem for single documents.We define keyphrase as a word or a group of words, which reflects the domain-specific of the text Keyphrase extraction could be used further in different natural language processing applications such as data indexing [1], clustering documents [2][3][4], automatic ontology creation etc.We are using results of this paper in an academic search system [4], we are mainly interested in a keyphrase extraction task from abstracts of scientific papers, because most abstracts are freely available and texts of papers are usually not.We focus on analysis of approaches to keyphrase selection from a set of candidates, built for a document [5][6][7][8].The weighted approach is used to evaluate quality of a particular candidate, then after the ranking procedure, the best candidates are selected as keyphrases.
In the paper we use only statistical information related to the word frequency in single documents and in a document collection.It is also shown that a number of measures is not adequate and some other measures are almost equivalent.We have shown that usage of some measure estimated by researchers as suitable, in reality leads to the situation where measured phrases are selected almost randomly and thus such measures could be considered equivalent for the annotation task.The novel feature which is proposed in the paper, is based on the exclusion of one-word phrases from candidates, that increases significantly the annotation quality.The reminder of the paper is organized as follows.Section 2 is dedicated to the state of the art.In Section 3 experiment is described and description of test collection is provided.In section 4 the experiment's results are presented and discussed.In Section 5 additional experiment and its results are presented and discussed.Section 6 contains conclusions.

State-of-the-Art
There are two main approaches to solve the keyphrase extraction task.The first is based on single word ranking, best words selection and concatenation of best words following each other in the text [9][10][11][12].The dominating approach [5][6][7][8][12][13][14][15][16] consists of two stages: a selection stage, when candidate phrases are selected, and a classifying or ranking stage.On the selection stage a number of procedures is used to extract candidate phrases: n-gram extraction, noun phrase extraction, word sequence extraction or their combinations, which satisfy some limitations.The examples of limitations are following: length limit of a phrase (usually not more than 4-5 words per phrase), parts of speech limits, etc.It has been shown that keyphrases should consist from nouns and adjectives to achieve the best results and this result is actively used.In [14] the author proposes to use part of speech information in classification process.In pioneer systems on the second stage supervised methods were used to decide for each candidate whether it is keyphrase.In [15] a Naive Bayes classifier is used.In [16] a keyphrase extraction process is based on a number of threshold values of some variables which are optimized using genetic algorithm.These methods [14][15][16] could be used for the case, when there is a set of documents with keyphrases already extracted by the expert.On the ranking stage all candidate phrases are weighted and ranked.Then k-best candidate phrases are selected as keyphrases.Ranking methods are usually based on phrase weight measurement [5-7, 12, 13].In this case, statistical measurements are often used for phrases and phrase words as well as information about the first position of a phrase in text and the size of a phrase with its frequency.However, researchers do not address and analyze possibilities of different variants of phrases' weight evaluation based on in-phrase word's weights.In this paper, we fill the gap.We evaluate several approaches to phrase weighting and use a number of statistical measures for this task.Experiments have shown that selected statistical measures do not allow identifying correct keyphrases among other phrases.It seems that simple exclusion of some set of candidates is more efficient, that is the set where most keyphrases are not correct apriori.In presented paper we have shown that the set of one-word candidate phrases is a set of this kind and its exclusion leads to relatively good results.Аs a result of current research we make a statement about possible reasons, why information about the length of a phrase influences the result of keyphrase extraction.

Candidate Phrase Ranking
One of the goals of the presented paper is to analyze a number of approaches to phrase weight measurement.We deal only with weight measurement based on in-phrase words evaluation.We are using the following notations.The phrase with n words is denoted as (w1, w2,…,wn), where w is a single word.Phrase weight is denoted as weight(w1, w2,…,wn) and the weight of a word as weight(w).We measure weights of phrases as: 1. Average weight among in-phrase words: 2. Geometric mean of word weights in phrase: 3. Degree of relationship between words in a phrase and a main word in a phrase.
For the case 3 (degree of relationship between words in a phrase and a main word in a phrase) six measuring approaches described below were used to determine a main word in a phrase.Word w is determined as w main for the phrase if its weight is the best weight in a phrase compared to the weights of other words in a phrase.When the main word has been chosen the relationship value between each other word w in a phrase and main word w main is calculated.In our research Two measures were used to calculate words relation:  Pointwise mutual information, calculated between the main word w main and every other word w in a phrase: where p(w main ,w) is a probability to meet word w main next to every other word in-phrase w (in window 3), p(w main ) and p(w) are probabilities of meeting words w main and w.A phrase weight is defined as an average among the obtained values:  Word w main and word w relationship: where   2set of all documents that contains w2, tf d (d,w1)the number of occurrence of the word w1 in the document d, w' belong to words in   2 .An average of obtained values is defined as a weight of a phrase as in (4) but rel(w main |w) is used instead MI(w main, ,w).
To evaluate the weight of a word weight(w) in a text d for ( 1) and ( 2) or for a selection of main word in phrase, we use the following six values:  Number of documents where the word w occurs at least once (df). Within collection word w frequency (tf). Within document d word w frequency (tf d );  Ratio: tf/df  tf-idf [17]: where N is the number of documents in the collection. The evaluation of word's w context narrowness (word context).
Concept of narrow context is borrowed from [18].Words with narrow context are domain-specific.For example, "motherboard" is the word with narrow context.If a document contains this word we can conclude with high probability that this document is about computer hardware.The word "computer" has wide context.If a document contains such word it is difficult to define the content of this document.It can be about hardware, art, health, e.t.c. with almost the same probabilities.Simplifying the method of detection words with narrow context [18], we define for each word w its context p(Y|w) by using p(y|w) (6), where y belongs to collection's vocabulary.Then entropy H is calculated for every obtained context.Based on assumption that the context of word with narrow context has low entropy, we use word's context entropy to evaluate words: The best word's weight in a phrase for df, tf, tf d , tf/df, tf-idf is the highest weight and for word context is the lowest weight.

Data Preprocessing and Candidate Phrase Extraction
Presented paper is focused on the problem of ranking of candidate phrases.Thus, we used basic algorithm for candidate extraction described as follows.The POS-tagged text is fed to the input of the algorithm (we used Stanford POS-tagging tool [19]).The sequences of nouns and adjectives are extracted from the text.Stop words, punctuation and other parts of speech, excluding nouns and adjectives, are used on this stage as delimiters.The size of obtained sequences is limited to 5. All extracted sequences are considered as candidate phrases.

Dataset
We have used Inspec dataset collection for our research, because in presented paper we are focusing on keyphrase extraction from abstracts of scientific articles.Inspec contains annotations to scientific articles in English (from disciplines "Computers and Control", and "Information technology").Inspec collection contains three sub-collections: training dataset (1000 documents), evaluation dataset (500 documents) and testing dataset (500 documents).Each text has a gold standard, which contains phrases, extracted by an expert.Gold standard includes two types of annotations: contr set and uncontr set.As in most other papers [9,12,14,20,21] test dataset and uncontr gold standard set are used for this paper.A detailed collection description is presented in [14].

Evaluation
To measure the quality of extracted keyphrases we use the traditional approach based on F-score, which is a combination of Precision and Recall [17], and is one of the most popular quality measures in keyphrase extraction domain: where G is the number of automatically extracted kephrases from all documents and C is the number of all keyphrases extracted by expert (number of phrase in the gold standard).In the case when a number of extracted keyphrases is less than given in the gold standard Precision is used instead F-score as it depends on the number of correct phrases among the extracted keyphrases.Otherwise, F-score declines with decrease of the number of extracted phrases because Recall also declines.When the number of extracted keyphrases is the same as in the gold standard, F-score and Precision are identical because G equals C.

Experiment
On the first stage, candidate phrases were extracted for each text using approach proposed in section 3.2.For each phrase in a document its weight is calculated.Weight calculation is done using strategies described in 3.1 as average weight of all words in a phrase (1), as a geometric mean of all words in a phrase (2), as an average weight of relation between main word and other words in a phrase (3)(4)(5)(6).One of six measures presented in 3.1 was used for a word's weight evaluation: tf-idf (7), df, tf, tf d , tf/df, word context (8).
It is important to say that if a phrase contains only one word, then (3) and ( 6) are not usable, because they need at least two words to be calculated.For these cases one-word phrases were excluded.To compare this weight evaluation approach with the other available approaches, we have conducted experiment, where for each approach mentioned above one-word phrases were filtered.This experiment has shown interesting results which are presented further.After weight evaluation, phrases were ranked according to their weights and k-best were selected as keyphrases.We have examined a number of cases to determine k:  k was taken according to the number of phrases mentioned in the gold standard [12];  k equals to 7;  all candidate phrases are selected as keyphrases (no ranking was performed).

4
Experimental Results and Discussion

Experiment Results
Results of keyphrase extraction experiment are presented in Tables 1 and 2, the weight of a phrase was calculated as an average of word weights, contained in a phrase (1).To calculate the word's weight six approaches were tested (3.1) and appropriate results are presented in columns.The number of phrases to select was defined as follows: the same number as in the gold standard and 7 (this information is presented by rows).Table 1 presents results, when no phrase was filtered.Table 2 presents results, when all oneword phrases were filtered.Experiments, which results are presented in Table 3 and Table 4, differ to the experiments in Tables 1 and 2 only in the change of keyphrase weight function, for these experiments geometric mean was used (2).Table 5 presents results of experiments, where the phrase weight was calculated using main word, which was chosen among the words in the phrase and then pointwise mutual information (3) was calculated for each pair, where the first word was the main word and second word -every other word in the phrase.One-word phrases were filtered.The main word was selected as a word with the best weight in the phrase.To evaluate word weights measures, described in 3.1, were used: tf-idf (7), df, tf, tf d , tf/df, word context (8).In Table 6 results of a similar experiments are shown for the case, when relationship of each word with the main word was calculated (5).Table 7 contains results of extracted keyphrases for the case, when the candidate phrases were not ranked and all of them were selected as keyphrases.Table 8 contains results for the case when keyphrases were selected randomly from the set of candidate phrases and the number of extracted keyphrases was equals the keyphrase number in the gold standard.

Discussion
Results presented in Table 1 and Table 3 show that usage of tf (within collection term frequency) and df (within collection document frequency) measures to evaluate words weight decreases the quality of extracted keyphrases even in comparison with arbitrary selection (Table 8).Other measure's results do not differ much regardless the way how phrase weight is calculated and these measures we will discuss below.Experiments show that results in Tables 2, 4, 5, 6 are very similar.Thus we can conclude that all methods give near the same results in respect to one-word phrases filtering, regardless of a way to weight words and regardless of the number of extracted keyphrases.Slightly better result is achieved when keyphrase weight is calculated as a geometry mean and tf/df is used.
Another interesting observation is the fact that filtering one-word phrases significantly increases quality of remained keyphrases and improves results of the state of the art [9,12,14].It is interesting that if we only filter out all the one-word keyphrases without performing resulting ranking at all, we will get F-score=0.40,the same result as with ranking.So it seems that ranking doesn't improve quality of keyphrases.
In fact experiments show that filtering of one-word keyphrases makes significantly greater impact than phrase weighting, based on statistics mentioned above.We have made an assumption as well, that all ranking approaches, mentioned above, essentially select keyphrases randomly and thus the results of different approaches are very close.To prove it an additional experiment was conducted, which goal was to show that the ratio between correct and incorrect keyphrases before and after ranking remains almost the same.

5
Additional Experiment

Experiment Description
The goal of proposed additional experiment is to show that all phrase-ranking approaches, used to select keyphrases in this paper, essentially select keyphrases randomly.Input data to the experiment is a set of pre-ranked phrase candidates.For this set for each phrase-length a number of phrases is set, and also known the number of correct and incorrect phrases.The ranking algorithm forms the output data, which is a set of selected keyphrases with the information about the number of selected phrases for each phrase length, including information about correctness of such selection.Number of selected keyphrases is the same as in the gold standard.The goal is to evaluate the ratio between all phrases and correct phrases before and after keyphrase selection step.

Experiment Results and Discussion
Because experiments in section 4 give almost the same results for a number of measures, here we are using only one of themtf d (within document frequency) measure.Experimental results are described in Table 9.In first column phrase length is presented and also the information about one-word phrases inclusion during experiment: are they filtered or not.In other columns additional information is presented: number of candidate phrases, how many of them are correct, ratio between the number of candidates and the number of correct among them and the same information for the case when ranking is performed.For keyphrases of 2-4 words length ratio between the number of phrases to the number of correct keyphrases lies inside range 2-3 (before and after ranking) and for oneword phrases this ratio is close to 8 on input data and is close to 6 on output data.It means that the set of one-word keyphrases contains much more incorrect keyphrases than correct ones.Notice that the number of one-word phrases in input data is the third part of all phrases.Thus it becomes obvious why filtering one-word phrases yields much better results.When we filter one-word phrases and arbitrary select the number of keyphrases as in the gold standard the F-score = 0.38 which is better than state of the art results for Inspec, which use complex ranking techniques [9][12] [14].Analysis of experimental results in Table 9 shows that the ratio between all keyhrases and correct keyphrases after ranking slightly improves the result before ranking.Taking this fact and results from Section 4 (in which it was shown, that using one-word phrase filtering, results of all methods are nearly the same) into account we can conclude that the results of all methods, which were investigated in this paper (excluding tf and df) are quite close to results of random pick of phrases from initial set.This result also shows that methods that weight phrases using information about phrase length should work good on Inspec dataset (longer phrases usually evaluate with more weight than short phrases and so one-word phrases become filtered).Remind that one-word phrase consists of alone noun/adjective and separated from other nouns and adjectives by punctuation, stop-words and other words excluding nouns and adjectives.

Conclusion
The results of presented research show that investigated approaches to phrase weighting (excluding tf and df) show almost equal results and only slightly increase random phrase selection from phrase candidates.They differ mostly in the way how they rank oneword phrases.If one-word phrases are excluded, all methods would give rather similar results.Exclusion of one-word candidate phrases increases extraction quality, because in one-word phrases ratio between correct keyphrases and all phrases is significantly bigger comparing to the phrases of other lengths.
Experiments were based on Inspec dataset, which is popular for the task of keyphrase extraction from scientific abstracts.Experiments prove that for this collection good results will be given by algorithms which filter one-word phrases, even if other phrases are ranked randomly.This result should be considered when working with Inspec collection and further evaluating approaches, investigated in this paper.

Table 1 .
Results: keyphrase weight was calculated as an average of weights among words in phrase weights

Table 2 .
Results: keyphrase weight was calculated as an average of weights among words inphrase weights, when one-word phrases were filtered

Table 3 .
Results: keyphrase weight was calculated as a geometry mean of weights among words in-phrase weights

Table 4 .
Results: keyphrase weight was calculated as a geometry mean of weights among words in-phrase weights, when one-word phrases were filtered

Table 5 .
Results: main word was selected, then pointwise mutual information was calculated between main word and other words in-phrase and average values was calculated as a score of a phrase

Table 6 .
(5)ults: main word was selected, words relationship was calculated betwee main word and other words in-phrase(5), average values was calculated as a score of a phrase

Table 7 .
Results: all candidate phrases were selected as keyphrases

Table 8 .
Results: keyphrases were selected randomly

Table 9 .
Results of additional experiment