Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Texterra: A Framework for Text Analysis

https://doi.org/10.15514/ISPRAS-2014-26(1)-18

Abstract

The paper presents a framework for fast text analytics developed during the Texterra project. Texterra delivers a scalable solution for text processing based on novel methods that exploit knowledge extracted from the Web and text documents. This paper describes details of the project, use-cases and results of evaluation for all developed tools.

About the Authors

Denis Turdakov
Institute for System Programming of RAS
Russian Federation


Nikita Astrakhantsev
Institute for System Programming of RAS
Russian Federation


Yaroslav Nedumov
Institute for System Programming of RAS
Russian Federation


Andrey Sysoev
Institute for System Programming of RAS
Russian Federation


Ivan Andrianov
Institute for System Programming of RAS
Russian Federation


Vladimir Mayorov
Institute for System Programming of RAS
Russian Federation


Denis Fedorenko
Institute for System Programming of RAS
Russian Federation


Anton Korshunov
Institute for System Programming of RAS
Russian Federation


Sergey Kuznetsov
Institute for System Programming of RAS
Russian Federation


References

1. Bird S., Klein E., Loper E., Baldridge J. Multidisciplinary instruction with the Natural Language Toolkit. Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, 2008. pp. 62-70.

2. Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics. PLoS computational biology, 9(2), 2013.

3. Ferrucci D. et al. Towards an interoperability standard for text and multi-modal analytics. IBM Res. Technical report RC24122, 2006.

4. Nozhov I. Morfologicheskaya i sintaksicheskaya obrabotka teksta(modeli i programmy) [Morphological and syntactic text processing (models and programs)]. Tezisy dissertatsii [PhD Thesis], 2003. (in Russian).

5. Аlekseev А., Dobrov B., Lukashevich N. Lingvisticheskaya ontologiya tezaurus RuTez [Linguistic ontology thesaurus RuTez] // Trudy konferentsii Open Semantic Technologies for Intelligent Systems [The Proceedings of Open Semantic Technologies for Intelligent Systems], 2013. pp. 153–158. (in Russian).

6. Braslavskij, P., Mukhin, M., Lyashevskaya, O. N., Bonch-Osmolovskaya, А. А., Krzhizhanovskij, А., Egorov, P. (2012). YARN: nachalo [YARN: The beginning]. Trudy konferentsii Dialog [The Proceedings of International Conference on Computational Linguistics Dialog], 2013.

7. Karkaletsis V., Fragkou P., Petasis G., Iosif E. Ontology based information extraction from text. Knowledge-Driven Multimedia Information Extraction and Ontology Evolution, ser. Lecture Notes in Computer Science, G. Paliouras, C. Spyropoulos, and G. Tsatsaronis, Eds. Springer Berlin / Heidelberg, 2011. vol. 6050, pp. 89-109. doi: 10.1007/978-3-642-20795-2_4

8. Unger C., Cimiano P. Pythia: Compositional meaning construction for ontology-based question answering on the semantic web. Natural Language Processing and Information Systems, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2011. vol. 6716, pp. 153–160. doi: 10.1007/978-3-642-22327-3_15

9. Jimeno-Yepes A., Berlanga-Llavori R., Rebholz-Schuhmann D. Ontology refinement for improved information retrieval. Information Processing & Management, 2010. vol. 46, no. 4, pp. 426 – 435.

10. Grineva M., Turdakov D., Sysoev A. Blognoon: Exploring a topic in the blogosphere. Proceedings of the 20th international conference companion on World wide web, Hyderabad, India, 2011. pp. 213–216.

11. Biemann C. Ontology Learning from Text: A Survey of Methods. LDV-Forum, 2005. vol. 20, pp. 75–93.

12. Astrakhantsev N, Turdakov D. Automatic construction and enrichment of informal ontologies: A survey. Programming and Computer Software, 2013. vol. 39, no. 1, pp. 34-42. doi: 10.1134/S0361768813010039

13. Segalovich I. A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine. In MLMTA, 2003. pp. 273-280.

14. Bocharov V., Alexeeva S., Granovsky D., Protopopova E., Stepanova M., Surikov A. Crowdsourcing morphological annotation. Komp'yuternaya lingvistika i intellektual'nye tekhnologii: Po materialam ezhegodnoj Mezhdunarodnoj konferentsii «Dialog» [The Proceedings of International Conference on Computational Linguistics Dialog]. 2013. vol. 12, no. 19.

15. Lyashevskaya O., Plungyan V., Sichinava D. O morfologicheskom standarte Natsional'nogo korpusa russkogo yazyka [About morphological standard of Russian National Corpus]. Natsional'nyj korpus russkogo yazyka: 2003-2005. Rezul'taty i perspektivy [Russian Natioanl Corpus: 2003-2005. Results and Prospects], 2005. pp. 111—135.

16. Milne D., Witten I. H. Learning to link with wikipedia. Proceedings of the 17th ACM conference on Information and knowledge management (CIKM '08), 2008.

17. Stanford Twitter sentiment general domain datasetAvailable at: http://www.stanford.edu/~alecmgo/cs224n/trainingandtestdata.zip

18. Sentiment140 Twitter sentiment general domain dataset. Available at: http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

19. KnowCenter Twitter sentiment general domain dataset. Available at: http://know-center.tugraz.at/loesungen/daten

20. UNED Twitter sentiment general domain dataset. Available at: http://nlp.uned.es/~damiano/datasets/entityProfiling_ORM_Twitter.html

21. International Conference on Weblogs and Social Media movie domain dataset. Available at: http://icwsm.cs.mcgill.ca

22. IMDb movie review dataset. Available at: http://www.cs.cornell.edu/people/pabo/movie-review-data/polarity_html.zip


Review

For citations:


Turdakov D., Astrakhantsev N., Nedumov Ya., Sysoev A., Andrianov I., Mayorov V., Fedorenko D., Korshunov A., Kuznetsov S. Texterra: A Framework for Text Analysis. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2014;26(1):421-438. (In Russ.) https://doi.org/10.15514/ISPRAS-2014-26(1)-18



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)