Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Information Retrieval and Analysis for a Modern Organization

https://doi.org/10.15514/ISPRAS-2016-28(4)-1

Abstract

With the growing volume and demand for data a major concern for an Organization is to discover what data there actually is, what it contains and how it is being used and by who. The amount of data and the disparate systems used to handle this data increase in their number and complexity every year and unifying these systems becomes more and more complex. In this work we describe an Intelligent search engine system, specifically designed to tackle the problem of information retrieval and sharing in a large multifaceted organization, that already has many systems in place for each Department, which is an integral part of a joint Operational Data Platform(ODP) for data exploration and processing.

About the Author

Artyom Topchyan
Yerevan State University
Armenia


References

1. Topchyan A.R. Enabling Data Driven Projects for a Modern Enterprise. Trudy ISP RAN/Proc. ISP RAS, vol. 28, issue 3, 2016, pp. 209-230. DOI: 10.15514/ISPRAS-2016-28(3)-13

2. Rahman, Nayem, and Fahad Aldhaban. "Assessing the effectiveness of big data initiatives."2015 Portland International Conference on Management of Engineering and Technology (PICMET). IEEE, 2015.

3. Davenport, Thomas H., and Jill Dych´e. "Big data in big companies."International Institute for Analytics (2013).

4. Dunning, Ted, and Ellen Friedman. Streaming Architecture: New Designs Using Apache Kafka and Mapr Streams. O’Reilly Media.2016.

5. Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. Manning Publications Co, 2015

6. Michael Hausenblas and Nathan Bijnens. Lambda Architecture. http://lambda-architecture.net, 2015.

7. K. Mani Chandy. vent-Driven Applications: Costs, Benefits and Design Approaches, California Institute of Technology, 2006.

8. Akidau, Tyler, et al. "MillWheel: fault-tolerant stream processing at internet scale."Proceedings of the VLDB Endowment 6.11:1033-1044, 2013.

9. Zaharia, Matei, et al. "Discretized streams: Fault-tolerant streaming computation at scale."Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013.

10. Akidau, Tyler, et al. "The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, outof-order data processing."Proceedings of the VLDB Endowment 8.12: 1792-1803, 2015.

11. Verma, Abhishek, et al. "Large-scale cluster management at Google with Borg."Proceedings of the Tenth European Conference on Computer Systems. ACM, 2015.

12. Boritz, J. "IS Practitioners’ Views on Core Concepts of Information Integrity". International Journal of Accounting Information Systems. Elsevier, 2011.

13. Netflix. Distributed Resource Scheduling with Apache Mesos. http://techblog.netflix.com/2016/07/distributedresource-scheduling-with.html

14. Newell, Andrew, et al. "Optimizing distributed actor systems for dynamic interactive services.”. Proceedings of the Eleventh European Conference on Computer Systems. ACM, 2016

15. Cohen, William, Pradeep Ravikumar, and Stephen Fienberg. "A comparison of string metrics for matching names and records.". Kdd workshop on data cleaning and object consolidation. Vol. 3, 2003

16. Hoffman, Matthew, Francis R. Bach, and David M. Blei. "Online learning for latent dirichlet allocation.”. Advances in neural information processing systems, 2010

17. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation. “Journal of machine Learning research 3.Jan: 993-1022, 2003

18. Mihalcea, Rada, and Paul Tarau. "TextRank: Bringing order into texts. “Association for Computational Linguistics, 2004.

19. Hasan, Kazi Saidul, and Vincent Ng. "Conundrums in unsupervised key phrase extraction: making sense of the state-of-the-art. "Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 2010.

20. Broder, Andrei Z. "Identifying and filtering near-duplicate documents. “Annual Symposium on Combinatorial Pattern Matching. Springer Berlin Heidelberg, 2000.

21. E. Cohen et al. "Finding interesting associations without support pruning. "IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 1, pp. 64-78, 2001.

22. Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014.

23. Krestel, Ralf, Peter Fankhauser, and Wolfgang Nejdl. "Latent dirichlet allocation for tag recommendation. “Proceedings of the third ACM conference on Recommender systems. ACM, 2009.

24. Maskeri, Girish, Santonu Sarkar, and Kenneth Heafield. "Mining business topics in source code using latent dirichlet allocation. “Proceedings of the 1st India software engineering conference. ACM, 2008.

25. Apache Kafka. http://kafka.apache.org, 2015.

26. Gormley, Clinton, and Zachary Tong. Elasticsearch: The Definitive Guide. "O’Reilly Media, Inc.", 2015.

27. Apache Mesos. http://mesos.apache.org, 2015.

28. Apache Tika. https://tika.apache.org, 2015.

29. Confluent Inc. Kafka-Connect. http://docs.confluent.io, 2015.


Review

For citations:


Topchyan A. Information Retrieval and Analysis for a Modern Organization. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2016;28(4):7-28. https://doi.org/10.15514/ISPRAS-2016-28(4)-1



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)