Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)
Trudy Instituta sistemnogo programmirovaniâ

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

eng | рус

Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Archives

Texterra: A Framework for Text Analysis

Denis Turdakov, Nikita Astrakhantsev, Yaroslav Nedumov, Andrey Sysoev, Ivan Andrianov, Vladimir Mayorov, Denis Fedorenko, Anton Korshunov, Sergey Kuznetsov

https://doi.org/10.15514/ISPRAS-2014-26(1)-18

Full Text:

PDF (Rus)

Generate QR code

Abstract

The paper presents a framework for fast text analytics developed during the Texterra project. Texterra delivers a scalable solution for text processing based on novel methods that exploit knowledge extracted from the Web and text documents. This paper describes details of the project, use-cases and results of evaluation for all developed tools.

Keywords

Text mining, natural language processing, Wikipedia

About the Authors

Denis Turdakov

Institute for System Programming of RAS
Russian Federation

Nikita Astrakhantsev

Institute for System Programming of RAS
Russian Federation

Yaroslav Nedumov

Institute for System Programming of RAS
Russian Federation

Andrey Sysoev

Institute for System Programming of RAS
Russian Federation

Ivan Andrianov

Institute for System Programming of RAS
Russian Federation

Vladimir Mayorov

Institute for System Programming of RAS
Russian Federation

Denis Fedorenko

Institute for System Programming of RAS
Russian Federation

Anton Korshunov

Institute for System Programming of RAS
Russian Federation

Sergey Kuznetsov

Institute for System Programming of RAS
Russian Federation

References

1. Bird S., Klein E., Loper E., Baldridge J. Multidisciplinary instruction with the Natural Language Toolkit. Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, 2008. pp. 62-70.

2. Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics. PLoS computational biology, 9(2), 2013.

3. Ferrucci D. et al. Towards an interoperability standard for text and multi-modal analytics. IBM Res. Technical report RC24122, 2006.

4. Nozhov I. Morfologicheskaya i sintaksicheskaya obrabotka teksta(modeli i programmy) [Morphological and syntactic text processing (models and programs)]. Tezisy dissertatsii [PhD Thesis], 2003. (in Russian).

5. Аlekseev А., Dobrov B., Lukashevich N. Lingvisticheskaya ontologiya tezaurus RuTez [Linguistic ontology thesaurus RuTez] // Trudy konferentsii Open Semantic Technologies for Intelligent Systems [The Proceedings of Open Semantic Technologies for Intelligent Systems], 2013. pp. 153–158. (in Russian).

6. Braslavskij, P., Mukhin, M., Lyashevskaya, O. N., Bonch-Osmolovskaya, А. А., Krzhizhanovskij, А., Egorov, P. (2012). YARN: nachalo [YARN: The beginning]. Trudy konferentsii Dialog [The Proceedings of International Conference on Computational Linguistics Dialog], 2013.

7. Karkaletsis V., Fragkou P., Petasis G., Iosif E. Ontology based information extraction from text. Knowledge-Driven Multimedia Information Extraction and Ontology Evolution, ser. Lecture Notes in Computer Science, G. Paliouras, C. Spyropoulos, and G. Tsatsaronis, Eds. Springer Berlin / Heidelberg, 2011. vol. 6050, pp. 89-109. doi: 10.1007/978-3-642-20795-2_4

8. Unger C., Cimiano P. Pythia: Compositional meaning construction for ontology-based question answering on the semantic web. Natural Language Processing and Information Systems, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2011. vol. 6716, pp. 153–160. doi: 10.1007/978-3-642-22327-3_15

9. Jimeno-Yepes A., Berlanga-Llavori R., Rebholz-Schuhmann D. Ontology refinement for improved information retrieval. Information Processing & Management, 2010. vol. 46, no. 4, pp. 426 – 435.

10. Grineva M., Turdakov D., Sysoev A. Blognoon: Exploring a topic in the blogosphere. Proceedings of the 20th international conference companion on World wide web, Hyderabad, India, 2011. pp. 213–216.

11. Biemann C. Ontology Learning from Text: A Survey of Methods. LDV-Forum, 2005. vol. 20, pp. 75–93.

12. Astrakhantsev N, Turdakov D. Automatic construction and enrichment of informal ontologies: A survey. Programming and Computer Software, 2013. vol. 39, no. 1, pp. 34-42. doi: 10.1134/S0361768813010039

13. Segalovich I. A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine. In MLMTA, 2003. pp. 273-280.

14. Bocharov V., Alexeeva S., Granovsky D., Protopopova E., Stepanova M., Surikov A. Crowdsourcing morphological annotation. Komp'yuternaya lingvistika i intellektual'nye tekhnologii: Po materialam ezhegodnoj Mezhdunarodnoj konferentsii «Dialog» [The Proceedings of International Conference on Computational Linguistics Dialog]. 2013. vol. 12, no. 19.

15. Lyashevskaya O., Plungyan V., Sichinava D. O morfologicheskom standarte Natsional'nogo korpusa russkogo yazyka [About morphological standard of Russian National Corpus]. Natsional'nyj korpus russkogo yazyka: 2003-2005. Rezul'taty i perspektivy [Russian Natioanl Corpus: 2003-2005. Results and Prospects], 2005. pp. 111—135.

16. Milne D., Witten I. H. Learning to link with wikipedia. Proceedings of the 17th ACM conference on Information and knowledge management (CIKM '08), 2008.

17. Stanford Twitter sentiment general domain datasetAvailable at: http://www.stanford.edu/~alecmgo/cs224n/trainingandtestdata.zip

18. Sentiment140 Twitter sentiment general domain dataset. Available at: http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

19. KnowCenter Twitter sentiment general domain dataset. Available at: http://know-center.tugraz.at/loesungen/daten

20. UNED Twitter sentiment general domain dataset. Available at: http://nlp.uned.es/~damiano/datasets/entityProfiling_ORM_Twitter.html

21. International Conference on Weblogs and Social Media movie domain dataset. Available at: http://icwsm.cs.mcgill.ca

22. IMDb movie review dataset. Available at: http://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip

Review

For citations:

Turdakov D., Astrakhantsev N., Nedumov Ya., Sysoev A., Andrianov I., Mayorov V., Fedorenko D., Korshunov A., Kuznetsov S. Texterra: A Framework for Text Analysis. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2014;26(1):421-438. (In Russ.) https://doi.org/10.15514/ISPRAS-2014-26(1)-18

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

* not an advertisement

Indexing Databases

* not an advertisement

Popular articles

Editor-in-Chief

Academician Arutyun I. Avetisyan

Article Tools

How to cite item

Finding References

Email this article (Login required)

Email the author (Login required)

About the Authors

Denis Turdakov
Institute for System Programming of RAS
Russian Federation

Nikita Astrakhantsev
Institute for System Programming of RAS
Russian Federation

Yaroslav Nedumov
Institute for System Programming of RAS
Russian Federation

Andrey Sysoev
Institute for System Programming of RAS
Russian Federation

Ivan Andrianov
Institute for System Programming of RAS
Russian Federation

Vladimir Mayorov
Institute for System Programming of RAS
Russian Federation

Denis Fedorenko
Institute for System Programming of RAS
Russian Federation

Anton Korshunov
Institute for System Programming of RAS
Russian Federation

Sergey Kuznetsov
Institute for System Programming of RAS
Russian Federation

Notifications