Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

A category-driven approach to deriving domain specific subsets of Wikipedia

Abstract

While many researchers attempt to build up different kinds of ontologies by means of Wikipedia, the possibility of deriving high-quality domain specific subset of Wikipedia using its own category structure still remains undervalued. We prove the necessity of such processing in this paper and also propose an appropriate technique. As a result, the size of knowledge base for our text processing framework has been reduced by more than order, while the precision of disambiguating musical metadata (ID3 tags) has decreased from 98% to 64%.

About the Authors

Anton V. Korshunov
ISP RAS, Moscow
Russian Federation


Denis Yu. Turdakov
ISP RAS, Moscow
Russian Federation


Jinguk Jeong
Convergence Solution Team, DMC R&D Center, Samsung Electronics Co., Ltd.
Korea, Republic of


Minho Lee
Convergence Solution Team, DMC R&D Center, Samsung Electronics Co., Ltd.
Korea, Republic of


Changsung Moon
Convergence Solution Team, DMC R&D Center, Samsung Electronics Co., Ltd.
Korea, Democratic People's Republic of


References

1. List of Wikipedias - Meta. http://meta.wikimedia.org/wiki/List_of_Wikipedias

2. M. Shirakawa, K. Nakayama, T. Hara, S. Nishio. Concept Vector Extraction from Wikipedia Category Network. In Proceedings of 3rd International Conference on Ubiquitous Information Management and Communication (ICUIMC 2009), pp. 71-79, 2009.

3. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. G. Ives. Dbpedia: A nucleus for a web of open data. In ISWC, volume 4825 of LNCS, pages 722–735. Springer, 2007.

4. Simone P. Ponzetto, Michael Strube. Deriving a large scale taxonomy from Wikipedia. In AAAI'07: Proceedings of the 22nd national conference on Artificial intelligence, pp. 1440-1445, 2007.

5. Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum. YAGO: A Large Ontology from Wikipedia and WordNet. In Elsevier Journal of Web Semantics, Vol. 6, No. 3, pp. 203-217, 2008.

6. Unified Medical Language System (UMLS) - Home. http://www.nlm.nih.gov/research/umls/

7. P. Buitelaar, P. Cimiano, B. Magnini (Eds.). Ontology Learning from Text: Methods, Evaluation and Applications. In Frontiers in Artificial Intelligence and Applications Series, Vol. 123, IOS Press, July 2005.

8. A. Gregorowicz, M. A. Kramer. Mining a Large-Scale Term-Concept Network from Wikipedia. Technical Report #06-1028, The MITRE Corp., Oct. 2006.

9. Cäcilia Zirn, Vivi Nastase, Michael Strube. Distinguishing between instances and classes in the Wikipedia taxonomy. In Proc. of ESWC-08, pages 376-387, 2008.

10. Gaoying Cui, Qin Lu, Wenjie Li, Yi-Rong Chen. Mining Concepts from Wikipedia for Ontology Construction. In Proceedings of Web Intelligence/IAT Workshops, pp.287-290, 2009.

11. J. Hoffart, F. Suchanek, K. Berberich, G. Weikum. YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. Research Report MPI-I-2010-5-007, Max-Planck-Institut für Informatik, November 2010.

12. A. Budanitsky, G. Hirst. Evaluating WordNet-based measures of semantic distance. In Computational Linguistics, 32(1), pp. 13-47, March 2006.

13. D. Turdakov, P. Velikhov. Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation. In Proc. of SYRCoDIS, 2008.

14. T. Zesch, I. Gurevych. Analysis of the Wikipedia Category Graph for NLP Applications. In Proceedings of the TextGraphs-2 Workshop (NAACL-HLT), 2007.

15. S. Chernov, T. Iofciu, W. Nejdl, X. Zhou. Extracting Semantic Relationships between Wikipedia Categories. In Proceedings of the First International Workshop on Semantic Wikis - From Wiki To Semantics, June 2006.

16. M. Strube, S. P. Ponzetto. WikiRelate! Computing semantic relatedness using Wikipedia. In Proceedings of the 21st national conference on Artificial intelligence (AAAI'06), pp. 1419-1424, 2006.

17. Z. Syed, T. Finin, and A. Joshi. Wikipedia as an Ontology for Describing Documents. In Proceedings of the Second International Conference on Weblogs and Social Media, 2008.

18. G. Y. Cui, Q. Lu, W. J. Li, Y. R. Chen. Corpus Exploitation from Wikipedia for Ontology Construction. In LREC 2008, Marrakech, pp. 2125-2132, 2008.

19. Wikipedia:Overcategorization - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia:Overcategorization

20. Wikipedia:Categorization - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia:Categorization

21. Catgraph. http://toolserver.org/~dapete/catgraph/

22. Wikipedia:WikiProject Categories/uncategorized - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Categories/uncategorized

23. Wikipedia:Database reports/Uncategorized categories - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia:Database_reports/Uncategorized_categories

24. Category:Better category needed - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Category:Better_category_needed

25. J. Soto. Wikipedia: A Quantitative Analysis. PhD thesis, 2009.

26. T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. Section 22.3: Depth-first search, pp. 540–549.

27. L.-E. Thorelli. An algorithm for computing all paths in a graph. In BIT 6, 347—349, 1966.

28. M. Migliore , V. Martorana , F. Sciortino. An algorithm to find all paths between two nodes in a graph. In Journal of Computational Physics, v.87 n.1, pp.231-236, March 1990.

29. R. Simoes. APAC: An exact algorithm for retrieving cycles and paths in all kinds of graphs. In Tékhne, no.12, p.39-55, 2009.

30. JUNG - Java Universal Network/Graph Framework. http://jung.sourceforge.net/

31. JGraphT - a free Java graph library. http://www.jgrapht.org/

32. neo4j open source nosql graph database. http://neo4j.org/

33. WebGraph. http://webgraph.dsi.unimi.it/

34. Wikipedia:Academic studies of Wikipedia - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Wikipedia:Academic_studies_of_Wikipedia

35. Academic studies about Wikipedia - Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Academic_studies_about_Wikipedia#Natural_language_processing


Review

For citations:


Korshunov A.V., Turdakov D.Yu., Jeong J., Lee M., Moon Ch. A category-driven approach to deriving domain specific subsets of Wikipedia. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2011;21. (In Russ.)



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)