Automatic Keyphrase Extraction from Text: A Walk-through

Abstract: Keyphrases are multipurpose knowledge gems, rendering keyphrase extraction a very important document processing task. They constitute a concise summary of documents that is extremely useful both for human inspection and machine consumption, in support of a number of tasks in the field of Natural Language Processing, Machine Learning and Information Retrieval. To this regard, the aim of this tutorial is two-fold: (a) to provide an overview of the automatic keyphrase extraction task and (b) to familiarize participants with the keyphrase extraction process. Specifically, we will provide a well‐structured review of the existing work, offer interesting insights on the different evaluation approaches, highlight open issues such as the need for evaluation approaches that take the semantic similarity of predicted and golden keyphrases, present a comparative experimental study of popular techniques, and familiarize the audience with the keyphrase extraction process via a demo presentation using jupyter notebook, which will be available to the audience during the practical part of the tutorial. We expect the tutorial to help newcomers and veterans alike navigate the large amount of prior art and grasp its evolution.

Brief resume of the presenters

Eirini Papagiannopoulou is a final-year Ph.D. student in School of Informatics from the Aristotle University of Thessaloniki (AUTH) in Greece. Her reseach is on the field of text mining and natural language processing. She also holds a BSc in Informatics from the University of Ioannina, Greece (2011) and an MSc in Informatics from AUTH (2013). She has participated in international and private sector funded R&D projects and has published 2 international journals and 3 conference papers. Since September 2018 she is also working as a Research Associate at CyRIC (Cyprus Research & Innovation Center Ltd) in the context of RISE (Marie Skłodowska-Curie action). Her research interests include Data Mining, Machine Learning, Natural Language Processing and Semantic Technologies.

Ricardo Campos is an assistant professor at the ICT Departmental Unit of the Polytechnic Institute of Tomar and lecturer at the Porto Business School, where he teaches at the Business Intelligence and Analytics Post-Graduate Programme. He is an integrated researcher of LIAAD-INESC TEC, the Artificial Intelligence and Decision Support Lab of U. Porto, and a collaborator of Ci2.ipt, the Smart Cities Research Center of the Polytechnic of Tomar. He is PhD in Computer Science by the University of Porto (U. Porto). His PhD on temporal information retrieval led him to win the Fraunhofer Portugal Challenge 2013 and to be distinguished as an “outstanding” researcher by the INESC TEC research lab. He has over 10 years of research experience in Information Retrieval and Natural Language Processing. In 2018, he has been awarded the best short paper award at ECIR’18 and the 1st prize of the Arquivo.pt Award for the project Conta-me Histórias. In 2019 he has been awarded the Best Demo Presentation and the Recognized Reviewer Award at ECIR’19, and nominated outstanding reviewer of the NAACL-HTL’19 conference. He is an editorial board member of the Information Processing & Management Journal (Elsevier), co-chaired international conferences and workshops, and is a regular member of the scientific committee of several international conferences.

Grigorios Tsoumakas is an Assistant Professor of Machine Learning and Knowledge Discovery at the School of Informatics of the Aristotle University of Thessaloniki (AUTH) in Greece. He received a degree in Computer Science from AUTH in 1999, an MSc in Artificial Intelligence from the University of Edinburgh, United Kingdom, in 2000 and a PhD in computer science from AUTH in 2005. His research expertise focuses on supervised learning techniques (ensemble methods, multi-target prediction) and text mining (semantic indexing, sentiment analysis, topic modeling). He has published more than 100 research papers and according to Google Scholar he has more than 10,000 citations and an h-index of 42. Dr. Tsoumakas is a senior member of the ACM, an action editor of the Data Mining and Knowledge Discovery journal, and a member of the editorial board of the Frontiers of Computer Science journal. He is an advocate of applied research that matters and has worked as a machine learning and data mining developer, researcher and consultant in several national and private sector funded R&D projects.

Point-form outline of the tutorial

Part I: Theory

We give a systematic presentation of both unsupervised and supervised keyphrase extraction methods via comprehensive categorization schemes based on the main properties of these methods. In addition, we contribute a time line of unsupervised and supervised methods to shed light on their evolution, as well as a presentation of the main types of features employed in supervised methods, along with a discussion of the issue of class imbalance.
We present different approaches that can be followed for evaluating keyphrase extraction methods, as well as different evaluation measures that exist, along with their popularity in the literature.
We provide a list of popular keyphrase extraction datasets, including their sources and properties, as well as a comprehensive catalogue of commercial APIs and free software related to keyphrase extraction.
We present a thorough empirical study, both quantitative and qualitative, of commercial APIs and state‐of‐the‐art unsupervised methods, which allows to gain a deeper understanding of how the results are affected by different evaluation approaches, evaluation measures and ground truth standards.

Presentation slides

Part II: Practice

In the practical part of the tutorial, we will use our package. Attendees will be offered two options:

Standalone: requires downloading the datasets, models, installing the package, dependencies and executing the notebook.
Docker: docker image available here. Everything is installed and ready to play with the notebook.

Bennani‐Smires, K., Musat, C., Hossmann, A., Baeriswyl, M., & Jaggi, M. (2018). Simple unsupervised keyphrase extraction using sentence embeddings. Paper presented at proceedings of the 22nd conference on computational natural language learning, Brussels, Belgium, October, 2018, 221–229. Brussels, Belgium: Association for Computational Linguistics. url: https://www.aclweb.org/anthology/K18-1022
Boudin, F. (2016). Pke: An open source python‐based keyphrase extraction toolkit. Paper presented at COLING 2016, 26th international conference on computational linguistics, proceedings of the conference system demonstrations, Osaka, Japan, December 11–16, 2016, 69–73. url: http://aclweb.org/anthology/C/C16/C16-2015.pdf
Bougouin, A., Boudin, F., & Daille, B. (2013). TopicRank: Graph‐based topic ranking for keyphrase extraction. Paper presented at proceedings of the 6th international joint conference on natural language processing, IJCNLP 2013, Nagoya, Japan, October 14-18, 2013, 543–551. url: http://aclweb.org/anthology/I/I13/I13-1062.pdf
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A. M., Nunes, C., & Jatowt, A. (2018). A text feature based automatic keyword extraction method for single documents. Paper presented at advances in information retrieval—40th European conference on IR research, ECIR 2018, Grenoble, France, March 26–29, 2018, proceedings, 684–691. url: https://doi.org/10.1007/978-3-319-76941-7_63
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A. M., Nunes, C., & Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. url: https://doi.org/10.1016/j.ins.2019.09.013
Florescu, C., & Caragea, C. (2017b). PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. Paper presented at proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, 2017, Volume 1: Long Papers, 1105–1115. url: https://doi.org/10.18653/v1/P17-1102
Gollapalli, S. D., Li, X., & Yang, P. (2017). Incorporating expert knowledge into keyphrase extraction. Paper presented at proceedings of the 31st AAAI conference on artificial intelligence, San Francisco, California, USA, February 4–9, 2017, 3180–3187. url: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14628
Hasan, K. S., & Ng, V. (2010). Conundrums in unsupervised keyphrase extraction: Making sense of the state‐of‐the‐art. Paper presented at proceedings of the 23rd international conference on computational linguistics, COLING 2010, Beijing, China, August 23–27, 2010, Posters Volume, 365–373. url: http://aclweb.org/anthology/C/C10/C10-2042.pdf
Hasan, K. S., & Ng, V. (2014). Automatic keyphrase extraction: A survey of the state of the art. Paper presented at proceedings of the 52nd annual meeting of the association for computational linguistics, ACL 2014, Baltimore, MD, USA, June 22–27, 2014, Volume 1: Long Papers, 1262–1273. url: http://aclweb.org/anthology/P/P14/P14-1119.pdf
Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. Paper presented at proceedings of the 2003 conference on empirical methods in natural language processing, EMNLP 2003, Stroudsburg, PA, USA, 2003, 216–223. Association for computational linguistics. url: https://doi.org/10.3115/1119355.1119383
Medelyan, O., Frank, E., & Witten, I. H. (2009). Human‐competitive tagging using automatic keyphrase extraction. Paper presented at proceedings of the 2009 conference on empirical methods in natural language processing, EMNLP 2009, Singapore, August 6–7, 2009, A meeting of SIGDAT, a special interest group of the ACL, 1318–1327. url: http://www.aclweb.org/anthology/D09-1137
Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., & Chi, Y. (2017). Deep keyphrase generation. Paper presented at proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30–August 4, 2017, Volume 1: Long Papers, 582–592. url: https://doi.org/10.18653/v1/P17-1054
Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into text. Paper presented at proceedings of the 2004 conference on empirical methods in natural language processing, EMNLP 2004, Barcelona, Spain, July 25–26, 2004, A meeting of SIGDAT, a special interest group of the ACL, held in conjunction with ACL 2004, 404–411. url: http://www.aclweb.org/anthology/W04-3252
Papagiannopoulou, E., & Tsoumakas, G. (2018). Local word vectors guiding keyphrase extraction. Information Processing & Management, 54, 888–902. url: https://doi.org/10.1016/j.ipm.2018.06.004
Papagiannopoulou, E., & Tsoumakas, G. (2020). A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(2), e1339. url: https://onlinelibrary.wiley.com/doi/10.1002/widm.1339
Sterckx, L., Caragea, C., Demeester, T., & Develder, C. (2016). Supervised keyphrase extraction as positive unlabeled learning. Paper presented at proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, 1924–1929. url: http://aclweb.org/anthology/D/D16/D16-1198.pdf
Sterckx, L., Demeester, T., Deleu, J., & Develder, C. (2015). Topical word importance for fast keyphrase extraction. Paper presented at proceedings of the 24th international conference on World Wide Web companion, WWW 2015, Florence, Italy, May 18–22, 2015—Companion Volume, 121–122. url: http://doi.acm.org/10.1145/2740908.2742730
Wan, X., & Xiao, J. (2008). Single document keyphrase extraction using neighborhood knowledge. Paper presented at proceedings of the 23rd AAAI conference on artificial intelligence, AAAI 2008, Chicago, Illinois, USA, July 13–17, 2008, 855–860. url: http://www.aaai.org/Library/AAAI/2008/aaai08-136.php
Wang, R., Liu, W., & McDonald, C. (2015). Using word embeddings to enhance keyword identification for scientific publications. Paper presented at proceedings of the databases theory and applications—26th Australasian database conference, ADC 2015, Melbourne, VIC, Australia, June 4–7, 2015, 257–268. url: https://doi.org/10.1007/978-3-319-19548-3_21
Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill‐Manning, C. G. (1999). KEA: Practical automatic keyphrase extraction. Paper presented at proceedings of the 4th ACM conference on digital libraries, Berkeley, CA, USA, August 11–14, 1999, 254–255. url: http://doi.acm.org/10.1145/313238.313437

Acknowledgments

Ricardo Campos was financed by the ERDF – European Regional Development Fund through the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 and by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within project PTDC/CCI-COM/31857/2017 (NORTE-01-0145-FEDER-03185).