Extracting Objects and Their Attributes from Tables in Text Documents

Nikita Astrakhantsev

Extracting Objects and Their Attributes from Tables in Text Documents

Nikita Astrakhantsev

Full Text:

PDF (Rus)

Generate QR code

Abstract

Extracting information from tables is an important and rather complex part of information retrieval. For the task of objects extraction from HTML tables we introduce the following methods: determining table orientation, processing of aggregating objects (like Total) and scattered headers (super row labels, subheaders).

Keywords

information extraction, information retrieval, natural language processing, table processing, table extraction, semi-structured information extraction, html, wiki markup

About the Author

Nikita Astrakhantsev

ISP RAS, Moscow
Russian Federation

References

1. A.C. Silva, A.M. Jorge, L. Torg. Design of an end-to-end method to extract information from tables // International Journal of Document Analysis and Recognition. 2006. 8. N 2–3. P. 144–171.

2. Y. A. Tijerino, D. W. Embley, D. W. Lonsdale,. Y. Ding, and G. Nagy. Towards ontology generation from tables. World Wide Web. 2005. 8. N 3. 261–285.

3. D.W. Embley, C. Tao, S.W. Liddle. Automating the Extraction of Data from HTML Tables with Unknown Structure // Data & Knowledge Engineering. 2003. N 54. P. 3–28.

4. D. Rus, K. Summers. Using white space for automated document structuring. Workshop on the Principles of Document Processing, 1994.

5. S. Douglas, M. Hurst, D. Quinn. Using Natural Language Processing for Identifying and Interpreting tables in Plain Text. In: Fourth Symposium on Document Analysis and Information Retrieval, pp. 535–545, 1995.

6. M. Hurst, S. Douglas. Layout and Language: Preliminary investigations in recognizing the structure of tables // Proceedings of International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 1997. P. 1043–1047.

7. D. Pinto, A. McCallum, X. Wei, W.B. Croft. Table Extraction Using Conditional Random Fields // Proceedings of the ACM SIGIR N 26. New York, USA: ACM New York, 2003. P. 235–242.

8. S. Tupaj, Z. Shi, C.H. Chang, A. Hassan. Extracting tabular information from text ﬁles, EECS Department. Tufts University, 1996.

9. Y. Wang, T.P. Ihsin, H. Robert. Improvements of zone content classiﬁcation by using background analysis // Document Analysis Systems. 2000. N 4. P. 10–13.

10. Y. Wang, T.P. Ihsin, H. Robert. Automatic ground truth generation and a background-analysis-based table structure extraction method // Proceedings of the Sixth International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 2001. P. 528–532.

11. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, B. Pollak. Towards domain-independent information extraction from web tables // Proceedings of the 16th WWW. New York, USA: ACM New York, 2007. P. 71–80.

12. Y. Wang, J. Hu. A machine learning based approach for table detection on the web // Proceedings of the 11th WWW. New York, USA: ACM New York, 2002. P. 242–250.

13. H.-H. Chen, S.-C. Tsai, S.-C., J.-H. Tsai. Mining tables from large scale HTML texts // 18th International Conference on Computational Linguistics. Saarbrücken, Germany: Morgan Kaufmann, 2000. P. 166–172.

14. M. Yoshida, K. Torisawa, J. Tsujii. A method to integrate tables of the WorldWideWeb // Proceedings of the First International Workshop on Web Document Analysis. Seattle, USA: PRImA Press, 2001. P. 31–34.

15. M.J. Cafarella, A. Halevy, Y. Zhang, D.Z. Wang, E. Wu. WebTables: Exploring the Power of Tables on the Web // ACM SIGMOD Record. 2008. N 37. P. 55–61.

Review

For citations:

Astrakhantsev N. Extracting Objects and Their Attributes from Tables in Text Documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2011;21. (In Russ.)

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Extracting Objects and Their Attributes from Tables in Text Documents

Full Text:

Abstract

Keywords

About the Author

References

Review

For citations:

Cookies policy