Automatic Keyphrase Extraction from Text: A Walk-through

This is a tutorial by E. Papagiannopoulou, R. Campos and G. Tsoumakas, organized under the umbrella of the ECAI 2020

-->

Automatic Keyphrase Extraction from Text: A Walk-through

Abstract: Keyphrases are multipurpose knowledge gems, rendering keyphrase extraction a very important document processing task. They constitute a concise summary of documents that is extremely useful both for human inspection and machine consumption, in support of a number of tasks in the field of Natural Language Processing, Machine Learning and Information Retrieval. To this regard, the aim of this tutorial is two-fold: (a) to provide an overview of the automatic keyphrase extraction task and (b) to familiarize participants with the keyphrase extraction process. Specifically, we will provide a well‐structured review of the existing work, offer interesting insights on the different evaluation approaches, highlight open issues such as the need for evaluation approaches that take the semantic similarity of predicted and golden keyphrases, present a comparative experimental study of popular techniques, and familiarize the audience with the keyphrase extraction process via a demo presentation using jupyter notebook, which will be available to the audience during the practical part of the tutorial. We expect the tutorial to help newcomers and veterans alike navigate the large amount of prior art and grasp its evolution.

Brief resume of the presenters

Eirini Papagiannopoulou is a final-year Ph.D. student in School of Informatics from the Aristotle University of Thessaloniki (AUTH) in Greece. Her reseach is on the field of text mining and natural language processing. She also holds a BSc in Informatics from the University of Ioannina, Greece (2011) and an MSc in Informatics from AUTH (2013). She has participated in international and private sector funded R&D projects and has published 2 international journals and 3 conference papers. Since September 2018 she is also working as a Research Associate at CyRIC (Cyprus Research & Innovation Center Ltd) in the context of RISE (Marie Skłodowska-Curie action). Her research interests include Data Mining, Machine Learning, Natural Language Processing and Semantic Technologies.

Ricardo Campos is an assistant professor at the ICT Departmental Unit of the Polytechnic Institute of Tomar and lecturer at the Porto Business School, where he teaches at the Business Intelligence and Analytics Post-Graduate Programme. He is an integrated researcher of LIAAD-INESC TEC, the Artificial Intelligence and Decision Support Lab of U. Porto, and a collaborator of Ci2.ipt, the Smart Cities Research Center of the Polytechnic of Tomar. He is PhD in Computer Science by the University of Porto (U. Porto). His PhD on temporal information retrieval led him to win the Fraunhofer Portugal Challenge 2013 and to be distinguished as an “outstanding” researcher by the INESC TEC research lab. He has over 10 years of research experience in Information Retrieval and Natural Language Processing. In 2018, he has been awarded the best short paper award at ECIR’18 and the 1st prize of the Arquivo.pt Award for the project Conta-me Histórias. In 2019 he has been awarded the Best Demo Presentation and the Recognized Reviewer Award at ECIR’19, and nominated outstanding reviewer of the NAACL-HTL’19 conference. He is an editorial board member of the Information Processing & Management Journal (Elsevier), co-chaired international conferences and workshops, and is a regular member of the scientific committee of several international conferences.

Grigorios Tsoumakas is an Assistant Professor of Machine Learning and Knowledge Discovery at the School of Informatics of the Aristotle University of Thessaloniki (AUTH) in Greece. He received a degree in Computer Science from AUTH in 1999, an MSc in Artificial Intelligence from the University of Edinburgh, United Kingdom, in 2000 and a PhD in computer science from AUTH in 2005. His research expertise focuses on supervised learning techniques (ensemble methods, multi-target prediction) and text mining (semantic indexing, sentiment analysis, topic modeling). He has published more than 100 research papers and according to Google Scholar he has more than 10,000 citations and an h-index of 42. Dr. Tsoumakas is a senior member of the ACM, an action editor of the Data Mining and Knowledge Discovery journal, and a member of the editorial board of the Frontiers of Computer Science journal. He is an advocate of applied research that matters and has worked as a machine learning and data mining developer, researcher and consultant in several national and private sector funded R&D projects.

Point-form outline of the tutorial

Part I: Theory

Part II: Practice

In the practical part of the tutorial, we plan to make available and utilize some jupyter notebooks to show some algorithms/results to attendees. We will consider various state-of-the-art datasets from which keywords may be extracted, the corresponding pre-computed models as well as state-of-the-art evaluation metrics.

Further reading

  1. Bennani‐Smires, K., Musat, C., Hossmann, A., Baeriswyl, M., & Jaggi, M. (2018). Simple unsupervised keyphrase extraction using sentence embeddings. Paper presented at proceedings of the 22nd conference on computational natural language learning, Brussels, Belgium, October, 2018, 221–229. Brussels, Belgium: Association for Computational Linguistics. url: https://www.aclweb.org/anthology/K18-1022

  2. Boudin, F. (2016). Pke: An open source python‐based keyphrase extraction toolkit. Paper presented at COLING 2016, 26th international conference on computational linguistics, proceedings of the conference system demonstrations, Osaka, Japan, December 11–16, 2016, 69–73. url: http://aclweb.org/anthology/C/C16/C16-2015.pdf

  3. Bougouin, A., Boudin, F., & Daille, B. (2013). TopicRank: Graph‐based topic ranking for keyphrase extraction. Paper presented at proceedings of the 6th international joint conference on natural language processing, IJCNLP 2013, Nagoya, Japan, October 14-18, 2013, 543–551. url: http://aclweb.org/anthology/I/I13/I13-1062.pdf

  4. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A. M., Nunes, C., & Jatowt, A. (2018). A text feature based automatic keyword extraction method for single documents. Paper presented at advances in information retrieval—40th European conference on IR research, ECIR 2018, Grenoble, France, March 26–29, 2018, proceedings, 684–691. url: https://doi.org/10.1007/978-3-319-76941-7_63

  5. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A. M., Nunes, C., & Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. url: https://doi.org/10.1016/j.ins.2019.09.013

  6. Florescu, C., & Caragea, C. (2017b). PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. Paper presented at proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, 2017, Volume 1: Long Papers, 1105–1115. url: https://doi.org/10.18653/v1/P17-1102

  7. Gollapalli, S. D., Li, X., & Yang, P. (2017). Incorporating expert knowledge into keyphrase extraction. Paper presented at proceedings of the 31st AAAI conference on artificial intelligence, San Francisco, California, USA, February 4–9, 2017, 3180–3187. url: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14628

  8. Hasan, K. S., & Ng, V. (2010). Conundrums in unsupervised keyphrase extraction: Making sense of the state‐of‐the‐art. Paper presented at proceedings of the 23rd international conference on computational linguistics, COLING 2010, Beijing, China, August 23–27, 2010, Posters Volume, 365–373. url: http://aclweb.org/anthology/C/C10/C10-2042.pdf

  9. Hasan, K. S., & Ng, V. (2014). Automatic keyphrase extraction: A survey of the state of the art. Paper presented at proceedings of the 52nd annual meeting of the association for computational linguistics, ACL 2014, Baltimore, MD, USA, June 22–27, 2014, Volume 1: Long Papers, 1262–1273. url: http://aclweb.org/anthology/P/P14/P14-1119.pdf

  10. Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. Paper presented at proceedings of the 2003 conference on empirical methods in natural language processing, EMNLP 2003, Stroudsburg, PA, USA, 2003, 216–223. Association for computational linguistics. url: https://doi.org/10.3115/1119355.1119383

  11. Medelyan, O., Frank, E., & Witten, I. H. (2009). Human‐competitive tagging using automatic keyphrase extraction. Paper presented at proceedings of the 2009 conference on empirical methods in natural language processing, EMNLP 2009, Singapore, August 6–7, 2009, A meeting of SIGDAT, a special interest group of the ACL, 1318–1327. url: http://www.aclweb.org/anthology/D09-1137

  12. Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., & Chi, Y. (2017). Deep keyphrase generation. Paper presented at proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, 2017, Volume 1: Long Papers, 582–592. url: https://doi.org/10.18653/v1/P17-1054

  13. Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into text. Paper presented at proceedings of the 2004 conference on empirical methods in natural language processing, EMNLP 2004, Barcelona, Spain, July 25–26, 2004, A meeting of SIGDAT, a special interest group of the ACL, held in conjunction with ACL 2004, 404–411. url: http://www.aclweb.org/anthology/W04-3252

  14. Papagiannopoulou, E., & Tsoumakas, G. (2018). Local word vectors guiding keyphrase extraction. Information Processing & Management, 54, 888–902. url: https://doi.org/10.1016/j.ipm.2018.06.004

  15. Papagiannopoulou, E., & Tsoumakas, G. (2020). A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(2), e1339. url: https://onlinelibrary.wiley.com/doi/10.1002/widm.1339

  16. Sterckx, L., Caragea, C., Demeester, T., & Develder, C. (2016). Supervised keyphrase extraction as positive unlabeled learning. Paper presented at proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, 1924–1929. url: http://aclweb.org/anthology/D/D16/D16-1198.pdf

  17. Sterckx, L., Demeester, T., Deleu, J., & Develder, C. (2015). Topical word importance for fast keyphrase extraction. Paper presented at proceedings of the 24th international conference on World Wide Web companion, WWW 2015, Florence, Italy, May 18–22, 2015—Companion Volume, 121–122. url: http://doi.acm.org/10.1145/2740908.2742730

  18. Wan, X., & Xiao, J. (2008). Single document keyphrase extraction using neighborhood knowledge. Paper presented at proceedings of the 23rd AAAI conference on artificial intelligence, AAAI 2008, Chicago, Illinois, USA, July 13–17, 2008, 855–860. url: http://www.aaai.org/Library/AAAI/2008/aaai08-136.php

  19. Wang, R., Liu, W., & McDonald, C. (2015). Using word embeddings to enhance keyword identification for scientific publications. Paper presented at proceedings of the databases theory and applications—26th Australasian database conference, ADC 2015, Melbourne, VIC, Australia, June 4–7, 2015, 257–268. url: https://doi.org/10.1007/978-3-319-19548-3_21

  20. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill‐Manning, C. G. (1999). KEA: Practical automatic keyphrase extraction. Paper presented at proceedings of the 4th ACM conference on digital libraries, Berkeley, CA, USA, August 11–14, 1999, 254–255. url: http://doi.acm.org/10.1145/313238.313437

Acknowledgments

Ricardo Campos was financed by the ERDF – European Regional Development Fund through the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 and by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within project PTDC/CCI-COM/31857/2017 (NORTE-01-0145-FEDER-03185).