Compute the Match Percentage of a Candidate for a Job role

Given a Job description we need to calculate the Match percentage of all the submitted Resumes.

18 min readFeb 19, 2022

This will be a End-to-End Machine Learning project. That means we will go from raw data to deployment of the best model. This blog is summary of the whole process. You can view the github repo to get more detailed explanations,

GitHub - scsanjay/jd_resume_match_percentage

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Literature Survey & Data Acquisition
1.1. Problem Definition
1.2. Data Set Description
1.3. Key Performance Indicator (KPI) Metric
1.4. Real-world Challenges and Constraints
1.5. Literature Review
EDA and Feature Extraction
2.1. Basic Eda (Exploratory Data Analysis)
2.2. Data Cleaning
2.3. Feature Extraction
2.4. Feature Encoding
Modeling and Error Analysis
3.1. Models with Bow Features
3.2. Models with Average Word2Vec Features
3.3. Stacking Ensemble (Stack of Best Models)
Advanced Modeling and Feature Engineering
4.1. Feed Forward Neural Network (FFN Network) with Bow and Average W2V Features
4.2. BERT with Feed Forward Neural Network
Deployment and Productionization
5.1. The Model
5.2. Optimisation
5.3. Architecture Diagram
5.4. Flask
5.5. Gunicorn
5.6. Amazon Ec2
5.7. Scalability and Latency
5.8. Demo
References

1. LITERATURE SURVEY & DATA ACQUISITION

1.1. PROBLEM DEFINITION

The main problem a company faces while hiring new candidates is the selection of suitable candidates based on their resumes and the job description for the first round of the interview process. It’s like finding a needle in a haystack, especially for the big companies like MAANG or other popular MNCs.

The companies need to have sufficient number hiring managers who can go through all the resumes and select the most qualified candidates. More hiring managers means more costs. And in two ways a company can increase profit are either by decreasing the cost or by increasing the sale.

In this project, We will be calculating the match percentage of the resumes for a job post. Based on the match percentage a company can ask top n candidates for the first round of the interview. Or the company can choose candidates with match percentage greater than a certain threshold.

There are ATS aka Applicant Tracking Systems in the industry for this task. But the machine learning field is changing rapidly. And we hope to increase the performance of the ATS system with help of new technologies in NLP.

1.2. DATA SET DESCRIPTION

To tackle this problem I will be using a dataset titled “A Perfect Fit” [1] which is available on Kaggle. Which was published by Mukund in 2021 under CC0: Public Domain license. This dataset originally belonged to HackerEarth’s monthly machine learning challenge.

This dataset has only 1 job description/job role and 90 resumes which has output values i.e the match percentage. The job description and the resumes are of pdf file type. The resumes have different formats/templates, which is a nice thing because in the real world we see resumes of different styles.

There is a csv file too which has two columns “CandidateID” and “Match Percentage”. The “CandidateID” is same as the file name of the resume. And the “Match Percentage” is numeric value between 0 and 100.

1.3. KEY PERFORMANCE INDICATOR (KPI) METRIC

There are mainly two types of metrics for machine learning use cases.
One is the business metric which is mainly used by the business folks to determine how much profit will the company make.
The other type of metrics is the performance metric, which is used by the model while training and evaluation.

Generally we track multiple performance metrics to compare the results. The Key Performance Indicator (KPI) is the performance metric that is used for training the model.

Different types of tasks such as classification and regression have different types of performance metrics. Since we are predicting numeric values, this problem comes under the regression task. Regression models have various interesting performance metrics such as mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), coefficient of determination (R²), Huber loss, etc. Each of the mentioned performance metrics have their own pros and cons.

I decided to use MSE as the KPI because we are okay with small errors but want to avoid large errors. The interpretation of MSE is little difficult so we will also track MAE and R² error.

The formula for R-squared (coefficient of determination)

1.4. REAL-WORLD CHALLENGES AND CONSTRAINTS

In the real world, we don’t have any strict latency requirements. We can calculate the percentage of match between the resumes and the job description every night, every hour or once the submission deadline has passed.

Similarly, scalability is not a big concern here because there is no such strict latency requirement. Interpretability of the model is not a must but a good to have feature. For interpretability, we could extract the keywords from the resume. Or use LIME or SHAP for explaining.

The most concerning challenge we could face with the model is that we could miss out on some good candidates, if the model outputs a very small value for a fit candidate. Or we will have many unfit candidates if the model outputs high value for a unfit candidate.
So basically we need to avoid large errors. Hence I have used MSE as the key performance indicator.

1.5. LITERATURE REVIEW

The first cut and the simplest solution that we could think of was using one-hot encoding and then some similarity based metric like cosine similarity. We could use this as a baseline model and build on top of it.

We could also try some simple such as Linear regression with BoW or TF-IDF or Word2Vec representations. Or advanced machine learning techniques such as XGBoost regression, Random Forest regression, etc.

Then we could also try deep learning based advanced techniques such as BERT to get embeddings of the resumes and job descriptions. And train a few simple MLPs with the embeddings.

Most of the blog posts and research papers have used similar approach as our first cut solution. Where they have done preprocessing steps like tokenization, stop word removal, lemmatization, etc. and then used text featurization techniques like BoW and TF-IDF. Then they have used cosine similarity to sort the candidates based on relavance.

2. EDA AND FEATURE EXTRACTION

2.1. BASIC EDA (EXPLORATORY DATA ANALYSIS)

In basic eda, we will explore the given data only. That means we check the high-level stats of the given data.

2.1.1. Job Description

The job description pdf is of 2 pages.
It has required work experience, education qualification and must have and nice to have skills mentioned.

2.1.2. Resumes

We have 90 data points.
All the resumes are in pdf format.
We have the resumes and their corresponding match percentage.
There is no missing or duplicate data.
There are no irregularities or outliers with match percentages.
Match percentage is a bimodal distribution.
There is no resume with Match percentage between 15 and 35.

2.2. DATA CLEANING

In real-world, we do not find data in a clean format. So we do various data cleaning operations to make sure that we have good quality data. Because if we do not have good quality data then whatever model we make, it will perform poorly.

I have converted everything to lower case so that words like “Data”, “data” and “DATA” are treated the same.
I have replaced unusual quotes like ` with ‘ (single quote).
I have concatenated words like “D A T A S C I E N C E” to get “DATA SCIENCE” with regex.
I have also removed hyperlinks.
I have converted education degrees like B.Tech or BTech to a specified form.
I have converted skills with special symbols to words like C++ to words cplusplus
I have replaced all non-alphanumeric characters and new lines with spaces.
I have removed the stop words of english.
I have removed inflections with the word lemmatizer.

2.3. FEATURE EXTRACTION

Feature extraction is the art part of data science and it is the most useful skill. If we do good feature extraction then a simple model might outperform a complex model.

I have created the following 12 features,

resume_word_num : total number of words in resume.
total_unique_word_num : total number of unique words in job description and resumes.
common_word_num : total number of common words in job description and resumes.
common_word_ratio : total number of common words divided by total number of unique words combined in both job description and resumes.
common_word_ratio_min : total number of common words divided by minimum number of unique words between job description and resumes.
common_word_ratio_max : total number of common words divided by maximum number of unique words between job description and resumes.
fuzz_ratio : fuzz.WRatio from fuzzy wuzzy library.
fuzz_partial_ratio : fuzz.partial_ratio from fuzzy wuzzy library.
fuzz_token_set_ratio : fuzz.token_set_ratio from fuzzy wuzzy library.
fuzz_token_sort_ratio : fuzz.token_sort_ratio from fuzzy wuzzy library.
is_fresher : wheather a candidate is fresher or experienced.
from_reputed_college : wheather a candidate is fresher from reputed college.

2.3.1. Univariate Analysis

The resume_word_num is approximately log normally distributed with mean 102.789
Minimum resume_word_num is 63 and maximum resume_word_num is 168
Minimum common_word_num is 9 and maximum common_word_num is 36
For both freshers and experienced have the pdf of match percentage peak at same points. That means the match percentage is not affected by wheather a candidate is fresher or experienced.
The match percentage is not affected by wheather a candidate is from reputed or non reputed college.
All of the numerical features are approximately gaussian distributed with some exceptions.
The resume_word_num is approximately log normally distributed.
The fuzz_ratio has bi-modal distribution.

2.3.2. Bi-variate Analysis

For bi-variate analysis I have plotted scatter plot between each feature and the output. And also I have used the SRCC (Spearman Rank Correlation Coefficient) to check for correlation between the features and the output. The advantage of the scatter plot is that we can see the correlations clearly. Also, the SRCC help in quantizing the correlation.

The features which have good correlation with “Match Percentage” are common_word_num, common_word_ratio, common_word_ratio_max and fuzz_token_sort_ratio. All of these features have SRCC of greater than .5 with “Match Percentage”.
The features related to common words have a good correlation with the output. But the common word features are also correlated with each other. So I have removed common_word_ratio and common_word_ratio_max columns.
We have 10 features.

2.4. FEATURE ENCODING

For feature encoding we have various options like BoW, TF-IDF, Word2Vec, Bert based, etc. For the initial phase I have opted binary bag of words and average word2vec.

That means we will have two sets of data one for binary BoW and other for average Word2Vec. And we will see which performs the best.

2.4.1. Binary BoW

In Binary BoW, we create vector, based on presence or absence of a word. If the word is present then the corresponding cell will be 1 and if not then it will be 0. The encodings will be very sparse and high dimensional. The count BoW has counts of occurrences of a word in the document.

I have used both job description and resume text to create the vocabulary.
I have used uni-gram, bi-gram and tri-gram to get some sequence information as well.
The minimum document frequency is 4. So it help in removing some non-useful words like names of candidates.
The maximum document frequency is 99%. That means word which are very frequent will be ignored.
The vocab size is 716. So we will get 716 dimensional output for both job description and resumes.
We have created two new features cosine_similarity and euclidean_distance.
cosine_similarity : It represents cosine similarity score between sentence embeddings (based on BoW) of job description and resume.
euclidean_distance : It represents euclidean distance between sentence embeddings (based on BoW) of job description and resume.
The SRCC between cosine_similarity and Match Percentage is 0.506, which is good.
Now we have total of 12 extracted features. And the total feature dimension is 10+2+716+716 = 1444

2.4.2. Average Word2Vec

The word2vec produces dense word embedding usually of small size like 300 dimensions. I have used the pre-trained model on google news data. I did not have much data that’s why I did not train my own word2vec model. After getting the word embeddings of each word in the document. I have summed them and divided by the number of total words to get average word2vec representation.

We have created two new features cosine_similarity and euclidean_distance.
cosine_similarity : It represents cosine similarity score between word embeddings (based on w2v) of job description and resume.
euclidean_distance : It represents euclidean distance between word embeddings (based on w2v) of job description and resume.
There is high SRCC between cosine_similarity and Match Percentage.
Now again we have total of 12 extracted features. And the total feature dimension is 10+2+300+300 = 612

3. MODELING AND ERROR ANALYSIS

3.1. MODELS WITH BOW FEATURES

3.1.1. Train Test split and Preprocessing

We have a very small data set. So I decided to use K-fold cross-validation for hyperparameter tuning. And I have done 70-30 split. Both the train and test data have same distribution for the dependent variable.

In preprocessing, I have done standardization which means mean centering and variance scaling.

3.1.2. Error Analysis

When I was building the basic model it was not showing great result. So I did forward feature selection and found that selecting only top 100 features is showing good result.

Now we have reached from 1444 features to 100 features, which means we have only 6.93% of the original features.

When we view the selected features we see that we have some of the extracted features. Also, we have more number of features or words from the resume than the jd. That’s because we have only one jd for the whole data set.

3.1.3. Modeling

I tried almost all the regression techniques from simple models to complex ones.

Here are the performances of the models on test data,

Here we can see that the best model is Linear kernel based SVR followed by the Linear regression with l2 regularizer.

3.2. MODELS WITH AVERAGE WORD2VEC FEATURES

3.2.1. Train Test split and Preprocessing

Just like before, I have done I have done 70–30 train test split. And standardized the columns.

3.2.2. Error Analysis

From the previous experience and from the fact that we have only one jd, I had a hunch that we can get better results after doing forward feature selection.

I used the linear regression for forward feature selection and selected 200 features out of 612 features.

3.2.3. Modeling

I tried all the classical models just like before and here are the performances,

From the above metrics we can see that RBF kernel based SVR is performing the best. But it was overfitting a lot. So the best model based on average word2vec features will be L2 linear regression.

3.3. STACKING ENSEMBLE (STACK OF BEST MODELS)

I was not much impressed with the result. So I thought, can we somehow combine the best of both the features to create a better model.

That reminded me of the stacking ensemble. So I chose the following models for level-0 (base model) of stacking ensemble,

Support Vector Regression (Linear Kernel) model based on BoW features
Linear Regression (L2 Regularized) model based on Word2Vec features

3.3.1. Finding the Meta model (level-1)

After standardizing the outputs of the base models. I tried simple models as the meta model. And the KNN regressor seems to be performing the best as the meta model for the stacking ensemble.

3.3.2. Performance of the stacking ensemble

With our best model which is the stacking ensemble model, where the base models or level-0 models are the Support Vector Regression (Linear Kernel) model based on BoW features and Linear Regression (L2 Regularized) model based on Word2Vec features, and the meta-model or level-1 model is a KNN Regressor. We get following results,

As you can see the performance of the stacking model is very impressive.

Distribution of the errors for test data,

Distribution of the errors on test data,

The minimum value is -9.91
The maximum value is 9.39
75% of the errors are between -1.6875 and 4.6275

4. ADVANCED MODELING AND FEATURE ENGINEERING

4.1. FEED FORWARD NEURAL NETWORK (FFN Network) WITH BOW AND AVERAGE W2V FEATURES

In advance modeling, first I started with the final BoW based features. But it was not performing well.

I thought of training a feed forward neural network with final BoW and final Average Word2Vec features. Here final features implies that the features that we got after forward feature selections.

4.1.1. Data set

As I mentioned earlier we have 100 features for the final BoW. And for the final average word2vec we had 200 features. Which brings our total feature dimension to 300.

I have divided the data into three parts: train, cross validation (cv) and test.

4.1.2. Model Architecture

I created simple feed forward neural network as below,

After a lot of trials I settled with this architecture. It has a total of 40,609 trainable parameters. And all three activation functions have ReLU activation unit.

In regression tasks, we generally use linear activation functions. But since this is a match percentage prediction task, where there are no negative values. So it makes sense to use ReLU instead of linear activation function.

4.1.3. Training and Evaluation

At first, the model ran for 59 epochs and it was overfitting. So I decided to execute it only for 20 epochs. And we got better result.

This is very impressive. As we can see this model is very comparable to the stacking ensemble model. This makes sense because both models are based on the same features.

4.2. BERT WITH FEED FORWARD NEURAL NETWORK

4.2.1. What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is a SOTA (state of the art) technique for NLP tasks. It is based on transformer which usage encoder only structure. And as the name indicates it is bi-directional. BERT base can take a maximum of 512 worded input and it outputs 768 dimensional representation for each token.

BERT is trained on huge data set of google books, wikipedia and other internet crawled datasets. It is trained with Masked Language Modeling task and Next Sentence Prediction task. So naturally it assumes the sequential nature of the text. But here we do not have a sequential data because we have very few sentences in the resumes. So even though it is sota for nlp tasks it should not work.

BERT combined multiple research papers ideas such as,

I) Semi supervised sequence learning —
It does semi-supervised sequence learning with Masked Language Modeling and Next Sentence Prediction tasks.

II) Contextualised word embeddings —
It got the idea of contextualised word embeddings from ELMO paper. Where each word embedding is generated based on the whole sentence.

III) Transfer Learning —
It got the idea of transfer learning from ULM-FiT. We get the pre-trained BERT model and then we can do fine tuning as per our need.

IV) Encoder Only Transformer —
It got this idea from OpenAI transformer which is a decoder only transformer.

Here I have used DistilBERT, which is a simpler and faster version of BERT with a very minimal decrease in performance. I have loaded the distilbert-base-uncased that means the case does not matter.

4.2.2. Data set

I have loaded the PDFs and then done encoding with DistilBERT based tokenizer with max length = 300 and zero padding if required. Zero padding is done so that we can perform batch training.

Then using the pre-trained model I have extracted the output corresponding to CLS token which will be of 768 dimension. I have done this for both job description and resume.

4.2.3. Model Architecture

Just like before the model has three dense layer and all of them have ReLU activation unit. The total trainable params are 198,817.

In all the models I have used Adam optimiser. Because it is the best optimiser as of now.

The model architecture is as shown below,

4.2.4. Training and Evaluation

As expected the performance is not that good.

5. DEPLOYMENT AND PRODUCTIONIZATION

5.1. THE MODEL

We have two models which are good enough for deployment. One is the stacking ensemble model and other is the FFN network (based on BoW and avg Word2vec features). The performance is close for both the models but stacking ensemble model is performing a little better. So I will use that for deployment.

5.2. OPTIMISATION

I have used some simple optimisation hacks to run the website on a small server.

Since the job description is fixed, I have preprocessed the job description and calculated BoW and average word2vec representation of the job description. This will certainly improve the prediction time.

The pretrained word2vec is 3.7GB that means we won’t be able to run it in a system with 1GB RAM. So I limited the vocab size to 3 lakh and also I changed the data type of the representation from float32 to float16. With this I was able to reduce the size to 190 MB.

5.3. ARCHITECTURE DIAGRAM

5.4. FLASK

I have decided to use Flask which is a micro web framework in python. It is generally used for creating APIs and building small websites. We will build a simple web-interface so we can use flasks for our task.

5.4.1. match.py

In match.py I have initialised flask. It will work as an entry point. Here I have mentioned the constants like upload folder, max size of the resume, etc.

I have created a route for the homepage which will be shown when the user browses either / or /index. For the homepage we have just rendered the template.

Another route is for the predict page (/predict) which only accepts POST requests. Here at first I have added some constraints which will redirect back to the homepage if not met.

Then I have uploaded the resume file to the server. And loaded the pdf to preprocess it. After preprocessing I have passed it through the models to get the prediction. And displayed the result on the prediction page.

5.4.2. helper.py

The helper.py contains all the preprocessing related functions. I have created this so that we do not clutter all the codes.

5.4.3. models folder

This contains all the preprocessors, models and some static values.

5.4.4. static folder

The static folder is for css, js and favicon.

5.4.5. templates folder

We have stored all the html templates in this folder.

5.4.6. uploads folder

This folder will be used to save the user uploaded resume files. This is a temporary folder so we should clean it from time to time.

5.5. GUNICORN

Gunicorn ‘Green Unicorn’ is a Python WSGI HTTP Server for UNIX. It provides a perfect balance of performance, flexibility, and configuration simplicity. With worker parameters it allows us to manage a number of processes to run parallelly for scalability.

5.6. AMAZON EC2

Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in AWS.

First I launched a t2.micro instance of amazon EC2 which is a free tier instance. It has only 1 CPU and 1 GB RAM.

Then I created an Elastic IP address and assigned it to the EC2 instance. Because the IP address of the instance might change if we restart the system.

Also we need to create and assign a security group which allows TCP connection to the port on which we will be running the Gunicorn. Otherwise we won’t be able to access the instance from the browser.

After the instance setup I transferred all the files of the web app to the EC2 with SFTP. Then I connected to the instance’s terminal with SSH and ran the gunicorn command to serve the website.

5.7. SCALABILITY AND LATENCY

5.7.1. Scalability

We can easily scale the website because we are hosting on EC2 instance and using gunicorn. So whenever required we can change the instance type and use a larger instance. Also we can spawn more instances of the application with gunicorn’s worker parameter.

5.7.2. Latency

On the EC2 instance where we have lots of constraints the latency is 3.9 seconds including the file upload time. Which is good.

On local system where we don’t have any constraints the latency is only 1.2 seconds including the file upload time.

We can use caching for the homepage to make it faster. Also we can store the word and the corresponding word2vec representation in an indexed table. This will help with RAM and the speed because then we don’t have to load the word2vec model and we can get the representation in O(1) time.

5.8. DEMO

You can find the demo at http://13.234.90.146:8080/

5.8.1. Homepage

I have built a homepage where the user can upload the resume. The interface is very simple as we can see below.

5.8.2. Predict page

On the predict page I have simply displayed the match percentage and the prediction time.

6. REFERENCES

Mukund. A Perfect Fit. https://www.kaggle.com/mukund23/a-perfect-fit (2021)
AppliedRoots. [https://www.appliedroots.com/]
Sanjay Chouhan. Towards data science. The Quora Question Pair Similarity Problem. 2021. [https://sanjayc.medium.com/the-quora-question-pair-similarity-problem-3598477af172]
FuzzyWuzzy. SeatGeek. [https://github.com/seatgeek/fuzzywuzzy]
Scikit-learn. [https://scikit-learn.org/stable/user_guide.html]
XGBoost doc. [https://xgboost.readthedocs.io/en/stable/]
Tensorflow API Doc. [https://www.tensorflow.org/api_docs/python/tf]
Transformers. Hugging Face. [https://huggingface.co/docs/transformers/index]
Jay Alammar. The Illustrated BERT, ELMo, and co. 2018. [https://jalammar.github.io/illustrated-bert/]
Flask. [https://flask.palletsprojects.com/]
Serve Flask with Gunicorn. Digitalocean. [https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-gunicorn-and-nginx-on-ubuntu-18-04]

Thanks for reading the blog! You can reach me through my LinkedIn account and can view my other works on my personal website.