Break Down Resumes into Sections to Extract Data and Perform Text Analysis using Python

Main Article Content

Arvind Kumar Sinha
Md. Amir Khusru Akhtar
Mohit Kumar

Abstract

The objective of AI-based resume screening is to automate the screening process, and text, keyword, and named entity recognition extraction are critical. This paper discusses segmenting resumes in order to extract data and perform text analysis. The raw CV file has been imported, and the resume data cleaned to remove extra spaces, punctuation and stop words. To extract names from resumes, regular expressions are used. We have also used the spaCy library which is considered the most accurate natural language processing library. It includes already-trained models for entity recognition, parsing, and tagging. The experimental method is used with resume data sourced from Kaggle, and external Source (MTIS).

Article Details

How to Cite
Sinha, A. K. ., Akhtar, M. A. K. ., & Kumar, M. . (2023). Break Down Resumes into Sections to Extract Data and Perform Text Analysis using Python . International Journal on Recent and Innovation Trends in Computing and Communication, 11(6s), 391–400. https://doi.org/10.17762/ijritcc.v11i6s.6945
Section
Articles

References

A. Sinha, Md. A. K. Akhtar, and A. Kumar, Resume Screening using Natural Language Processing and Machine Learning: A Systematic Review. In: Swain, D., Pattnaik, P.K., Athawale, T. (eds) Machine Learning and Information Processing. Advances in Intelligent Systems and Computing, vol 1311. 2021 Springer, Singapore.

D. Çelik et al., “Towards an Information Extraction System Based on Ontology to Match Resumes and Jobs,” in 2013 IEEE 37th Annual Computer Software and Applications Conference Workshops, Jul. 2013, pp. 333–338. doi: 10.1109/COMPSACW.2013.60.

T. Kiss and J. Strunk, “Unsupervised multilingual sentence boundary detection. Computational Linguistics,” pp. 485–525, 2006.

J. C. Reynar and A. Ratnaparkhi, “A maximum entropy approach to identifying sentence boundaries. Proceedings of the Fifth Conference on Applied Natural Language Processing,” pp. 16–19, 1997.

M. D. Riley, “Some applications of tree-based modelling to speech and language. Proceedings of the Workshop on Speech and Natural Language,” pp. 339–352, 1989.

“The Stanford Natural Language Processing Group.” https://nlp.stanford.edu/software/tokenizer.shtml (accessed Jul. 08, 2021).

C. D. Manning and H. Schütze, “Foundations of Statistical Natural Language Processing. MIT Press.,” 1999.

B. Jurish and K. M. Würzner, “Word and Sentence Tokenization with Hidden Markov Models.,” pp. 61–83, 2013.

“UCREL CLAWS5 Tagset.” http://ucrel.lancs.ac.uk/claws5tags.html (accessed Jul. 08, 2021).

L. Derczynski, D. Maynard, G. Rizzo, and M. Van Erp, “Analysis of named entity recognition and linking for tweets. Information Processing & Management,” pp. 32–49, 2015.

“Text Analysis Starter Guide: What You Need to Know,” MonkeyLearn. https://monkeylearn.com/text-analysis/ (accessed Jul. 08, 2021).

“Cluster analysis,” Wikipedia. Jun. 29, 2021. Accessed: Jul. 08, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Cluster_analysis&oldid=1031035663

“Python Stemming Lemmatization.” https://www.python-ds.com/python-stemming-lemmatization (accessed Jul. 08, 2021).

P. K. Roy, S. S. Chowdhary, and R. Bhatia, “A Machine Learning approach for automation of Resume Recommendation system,” 2019.

“Machine Learning with Python: Metrics: Accuracy, precision, recall, F1-Score.” https://www.python-course.eu/metrics.php (accessed Jul. 08, 2021).

A. Hetherington, “Evaluating Classifier Model Performance,” Medium, Jul. 05, 2020. https://towardsdatascience.com/evaluating-classifier-model-performance-6403577c1010 (accessed Jul. 08, 2021).

“Precision vs Recall | Precision and Recall Machine Learning,” Analytics Vidhya, Sep. 03, 2020. https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/ (accessed Jul. 08, 2021).

“spaCy · Industrial-strength Natural Language Processing in Python.” https://spacy.io/ (accessed Jul. 08, 2021).

O. Pathak, OmkarPathak/ResumeParser. 2021. Accessed: Jul. 08, 2021. [Online]. Available: https://github.com/OmkarPathak/ResumeParser

E. Loper and S. Bird, “Nltk: the natural language toolkit,” 2002.

H. Shah, N. ., T. Khan, D. ., A. Banu, A. ., & H. Shah, L. . (2023). Symmetric and Asymmetric Encryption Schemes for Internet of Things: A Survey . International Journal of Intelligent Systems and Applications in Engineering, 11(1), 254–260. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/2465

Md Amir Khusru Akhtar, Mohit Kumar, and Gadadhar Sahoo. "Automata for santali language processing." In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 939-943. IEEE, 2017.

Md Amir Khusru Akhtar, Gadadhar Sahoo, and Mohit Kumar. "Digital corpus of Santali language." In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 934-938. IEEE, 2017.