Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Extracting Objects and Their Attributes from Tables in Text Documents

Abstract

Extracting information from tables is an important and rather complex part of information retrieval. For the task of objects extraction from HTML tables we introduce the following methods: determining table orientation, processing of aggregating objects (like Total) and scattered headers (super row labels, subheaders).

About the Author

Nikita Astrakhantsev
ISP RAS, Moscow
Russian Federation


References

1. A.C. Silva, A.M. Jorge, L. Torg. Design of an end-to-end method to extract information from tables // International Journal of Document Analysis and Recognition. 2006. 8. N 2–3. P. 144–171.

2. Y. A. Tijerino, D. W. Embley, D. W. Lonsdale,. Y. Ding, and G. Nagy. Towards ontology generation from tables. World Wide Web. 2005. 8. N 3. 261–285.

3. D.W. Embley, C. Tao, S.W. Liddle. Automating the Extraction of Data from HTML Tables with Unknown Structure // Data & Knowledge Engineering. 2003. N 54. P. 3–28.

4. D. Rus, K. Summers. Using white space for automated document structuring. Workshop on the Principles of Document Processing, 1994.

5. S. Douglas, M. Hurst, D. Quinn. Using Natural Language Processing for Identifying and Interpreting tables in Plain Text. In: Fourth Symposium on Document Analysis and Information Retrieval, pp. 535–545, 1995.

6. M. Hurst, S. Douglas. Layout and Language: Preliminary investigations in recognizing the structure of tables // Proceedings of International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 1997. P. 1043–1047.

7. D. Pinto, A. McCallum, X. Wei, W.B. Croft. Table Extraction Using Conditional Random Fields // Proceedings of the ACM SIGIR N 26. New York, USA: ACM New York, 2003. P. 235–242.

8. S. Tupaj, Z. Shi, C.H. Chang, A. Hassan. Extracting tabular information from text files, EECS Department. Tufts University, 1996.

9. Y. Wang, T.P. Ihsin, H. Robert. Improvements of zone content classification by using background analysis // Document Analysis Systems. 2000. N 4. P. 10–13.

10. Y. Wang, T.P. Ihsin, H. Robert. Automatic ground truth generation and a background-analysis-based table structure extraction method // Proceedings of the Sixth International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 2001. P. 528–532.

11. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, B. Pollak. Towards domain-independent information extraction from web tables // Proceedings of the 16th WWW. New York, USA: ACM New York, 2007. P. 71–80.

12. Y. Wang, J. Hu. A machine learning based approach for table detection on the web // Proceedings of the 11th WWW. New York, USA: ACM New York, 2002. P. 242–250.

13. H.-H. Chen, S.-C. Tsai, S.-C., J.-H. Tsai. Mining tables from large scale HTML texts // 18th International Conference on Computational Linguistics. Saarbrücken, Germany: Morgan Kaufmann, 2000. P. 166–172.

14. M. Yoshida, K. Torisawa, J. Tsujii. A method to integrate tables of the WorldWideWeb // Proceedings of the First International Workshop on Web Document Analysis. Seattle, USA: PRImA Press, 2001. P. 31–34.

15. M.J. Cafarella, A. Halevy, Y. Zhang, D.Z. Wang, E. Wu. WebTables: Exploring the Power of Tables on the Web // ACM SIGMOD Record. 2008. N 37. P. 55–61.


Review

For citations:


Astrakhantsev N. Extracting Objects and Their Attributes from Tables in Text Documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2011;21. (In Russ.)



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)