hr analytics: job change of data scientists

This Kaggle competition is designed to understand the factors that lead a person to leave their current job for HR researches too. I ended up getting a slightly better result than the last time. Odds shows experience / enrolled in the unversity tends to have higher odds to move, Weight of evidence shows the same experience and those enrolled in university.;[. Therefore if an organization want to try to keep an employee then it might be a good idea to have a balance of candidates with other disciplines along with STEM. Missing imputation can be a part of your pipeline as well. Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. maybe job satisfaction? For this, Synthetic Minority Oversampling Technique (SMOTE) is used. To summarize our data, we created the following correlation matrix to see whether and how strongly pairs of variable were related: As we can see from this image (and many more that we observed), some of our data is imbalanced. Introduction The companies actively involved in big data and analytics spend money on employees to train and hire them for data scientist positions. After splitting the data into train and validation, we will get the following distribution of class labels which shows data does not follow the imbalance criterion. I made some predictions so I used city_development_index and enrollee_id trying to predict training_hours and here I used linear regression but I got a bad result as you can see. Problem Statement : This dataset contains a typical example of class imbalance, This problem is handled using SMOTE (Synthetic Minority Oversampling Technique). Some notes about the data: The data is imbalanced, most features are categorical, some with cardinality and missing imputation can be part of pipeline (https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists?select=sample_submission.csv). Many people signup for their training. Synthetically sampling the data using Synthetic Minority Oversampling Technique (SMOTE) results in the best performing Logistic Regression model, as seen from the highest F1 and Recall scores above. using these histograms I checked for the relationship between gender and education_level and I found out that most of the males had more education than females then I checked for the relationship between enrolled_university and relevent_experience and I found out that most of them have experience in the field so who isn't enrolled in university has more experience. The original dataset can be found on Kaggle, and full details including all of my code is available in a notebook on Kaggle. Three of our columns (experience, last_new_job and company_size) had mostly numerical values, but some values which contained, The relevant_experience column, which had only two kinds of entries (Has relevant experience and No relevant experience) was under the debate of whether to be dropped or not since the experience column contained more detailed information regarding experience. A violin plot plays a similar role as a box and whisker plot. Hiring process could be time and resource consuming if company targets all candidates only based on their training participation. Light GBM is almost 7 times faster than XGBOOST and is a much better approach when dealing with large datasets. For any suggestions or queries, leave your comments below and follow for updates. I do not own the dataset, which is available publicly on Kaggle. Answer Trying out modelling the data, Experience is a factor with a logistic regression model with an AUC of 0.75. Power BI) and data frameworks (e.g. Question 1. - Reformulate highly technical information into concise, understandable terms for presentations. Recommendation: As data suggests that employees who are in the company for less than an year or 1 or 2 years are more likely to leave as compared to someone who is in the company for 4+ years. We conclude our result and give recommendation based on it. Target isn't included in test but the test target values data file is in hands for related tasks. Then I decided the have a quick look at histograms showing what numeric values are given and info about them. To improve candidate selection in their recruitment processes, a company collects data and builds a model to predict whether a candidate will continue to keep work in the company or not. Director, Data Scientist - HR/People Analytics. These are the 4 most important features of our model. Human Resources. A company engaged in big data and data science wants to hire data scientists from people who have successfully passed their courses. Another interesting observation we made (as we can see below) was that, as the city development index for a particular city increases, a lesser number of people out of the total workforce are looking to change their job. There are many people who sign up. The Gradient boost Classifier gave us highest accuracy and AUC ROC score. Context and Content. I formulated the problem as a binary classification problem, predicting whether an employee will stay or switch job. This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. In addition, they want to find which variables affect candidate decisions. HR-Analytics-Job-Change-of-Data-Scientists. The goal is to a) understand the demographic variables that may lead to a job change, and b) predict if an employee is looking for a job change. Someone who is in the current role for 4+ years will more likely to work for company than someone who is in current role for less than an year. In order to control for the size of the target groups, I made a function to plot the stackplot to visualize correlations between variables. However, according to survey it seems some candidates leave the company once trained. HR Analytics: Job Change of Data Scientists | by Azizattia | Medium Write Sign up Sign In 500 Apologies, but something went wrong on our end. In addition, they want to find which variables affect candidate decisions. Insight: Major Discipline is the 3rd major important predictor of employees decision. You signed in with another tab or window. On the basis of the characteristics of the employees the HR of the want to understand the factors affecting the decision of an employee for staying or leaving the current job. Generally, the higher the AUCROC, the better the model is at predicting the classes: For our second model, we used a Random Forest Classifier. To achieve this purpose, we created a model that can be used to predict the probability of a candidate considering to work for another company based on the companys and the candidates key characteristics. Does the type of university of education matter? As XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed. Learn more. I also used the corr() function to calculate the correlation coefficient between city_development_index and target. This is the violin plot for the numeric variable city_development_index (CDI) and target. Smote works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line: Initially, we used Logistic regression as our model. Are you sure you want to create this branch? And some of the insights I could get from the analysis include: Prior to modeling, it is essential to encode all categorical features (both the target feature and the descriptive features) into a set of numerical features. Job. Knowledge & Key Skills: - Proven experience as a Data Scientist or Data Analyst - Experience in data mining - Understanding of machine-learning and operations research - Knowledge of R, SQL and Python; familiarity with Scala, Java or C++ is an asset - Experience using business intelligence tools (e.g. DBS Bank Singapore, Singapore. What is the effect of a major discipline? with this I have used pandas profiling. (including answers). MICE is used to fill in the missing values in those features. For details of the dataset, please visit here. (Difference in years between previous job and current job). In our case, the columns company_size and company_type have a more or less similar pattern of missing values. Job Analytics Schedule Regular Job Type Full-time Job Posting Jan 10, 2023, 9:42:00 AM Show more Show less Thats because I set the threshold to a relative difference of 50%, so that labels for groups with small differences wont clutter up the plot. Recommendation: The data suggests that employees with discipline major STEM are more likely to leave than other disciplines(Business, Humanities, Arts, Others). All dataset come from personal information of trainee when register the training. Your role. Question 2. I used seven different type of classification models for this project and after modelling the best is the XG Boost model. Scribd is the world's largest social reading and publishing site. In this article, I will showcase visualizing a dataset containing categorical and numerical data, and also build a pipeline that deals with missing data, imbalanced data and predicts a binary outcome. Before jumping into the data visualization, its good to take a look at what the meaning of each feature is: We can see the dataset includes numerical and categorical features, some of which have high cardinality. Only label encode columns that are categorical. Let us first start with removing unnecessary columns i.e., enrollee_id as those are unique values and city as it is not much significant in this case. However, according to survey it seems some candidates leave the company once trained. There was a problem preparing your codespace, please try again. Hadoop . This will help other Medium users find it. . The number of STEMs is quite high compared to others. HR-Analytics-Job-Change-of-Data-Scientists, https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists. This means that our predictions using the city development index might be less accurate for certain cities. For instance, there is an unevenly large population of employees that belong to the private sector. There are more than 70% people with relevant experience. Training data has 14 features on 19158 observations and 2129 observations with 13 features in testing dataset. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. There are a total 19,158 number of observations or rows. Some of them are numeric features, others are category features. The baseline model mark 0.74 ROC AUC score without any feature engineering steps. Kaggle data set HR Analytics: Job Change of Data Scientists (XGBoost) Internet 2021-02-27 01:46:00 views: null. Following models are built and evaluated. The pipeline I built for prediction reflects these aspects of the dataset. Agatha Putri Algustie - agthaptri@gmail.com. More. HR-Analytics-Job-Change-of-Data-Scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. Answer looking at the categorical variables though, Experience and being a full time student shows good indicators. 2023 Data Computing Journal. The number of data scientists who desire to change jobs is 4777 and those who don't want to change jobs is 14381, data follow an imbalanced situation! This operation is performed feature-wise in an independent way. This is therefore one important factor for a company to consider when deciding for a location to begin or relocate to. so I started by checking for any null values to drop and as you can see I found a lot. We used the RandomizedSearchCV function from the sklearn library to select the best parameters. The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. Underfitting vs. Overfitting (vs. Best Fitting) in Machine Learning, Feature Engineering Needs Domain Knowledge, SiaSearchA Tool to Tame the Data Flood of Intelligent Vehicles, What is important to be good host on Airbnb, How Netflix Documentaries Have Skyrocketed Wikipedia Pageviews, Open Data 101: What it is and why care about it, Predict the probability of a candidate will work for the company, is a, Interpret model(s) such a way that illustrates which features affect candidate decision. We hope to use more models in the future for even better efficiency! We can see from the plot there is a negative relationship between the two variables. The company provides 19158 training data and 2129 testing data with each observation having 13 features excluding the response variable. The company wants to know who is really looking for job opportunities after the training. Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly. 17 jobs. Group Human Resources Divisional Office. In preparation of data, as for many Kaggle example dataset, it has already been cleaned and structured the only thing i needed to work on is to identify null values and think of a way to manage them. Heatmap shows the correlation of missingness between every 2 columns. Does the gap of years between previous job and current job affect? Pre-processing, Use Git or checkout with SVN using the web URL. Learn more. Use Git or checkout with SVN using the web URL. HR Analytics: Job Change of Data Scientists Introduction Anh Tran :date_full HR Analytics: Job Change of Data Scientists In this post, I will give a brief introduction of my approach to tackling an HR-focused Machine Learning (ML) case study. Full-time. For the full end-to-end ML notebook with the complete codebase, please visit my Google Colab notebook. Executive Director-Head of Workforce Analytics (Human Resources Data and Analytics ) new. Target isn't included in test but the test target values data file is in hands for related tasks. To know more about us, visit https://www.nerdfortech.org/. After modelling the data, Experience and being a full time student shows good indicators Trying modelling... Some with high cardinality approach when dealing with large datasets these are the most! Of our model i formulated the problem as a box and whisker plot ROC score at! Hr-Analytics-Job-Change-Of-Data-Scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https: //www.nerdfortech.org/ leave your comments and! Found a lot ML notebook with the complete codebase, please visit here fill in future... I decided the have a quick look at histograms showing what numeric values are given and info about them time... Can be a part of your pipeline as well s largest social reading and publishing site, use or! Not handle them directly recommendation based on it does the gap of years between previous job and current job HR. Features of our model can not handle them directly see from the sklearn library to select the parameters! Our result and give recommendation based on their training participation is used spend on! Included in test but the test target values data file is in hands for related tasks companies involved! Consuming if company targets all candidates only based on their training participation are the 4 most important features our. And company_type have a more or less similar pattern of missing values give recommendation on. Them for data scientist positions is really looking for job opportunities after the training much better when. Come from personal information of trainee when register the training one important factor for a company engaged in big and... Instance, there is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project select the best is the 3rd Major predictor. Addition, they want to find which variables affect candidate decisions and AUC ROC score columns. Regression model with an AUC of 0.75 the private sector correlation coefficient between city_development_index and.. Experiences of experts from all over the world to the private sector, understandable terms presentations. In big data and 2129 observations with 13 features in testing dataset some with high cardinality less similar of... Correlation of missingness between every 2 columns are category features you sure you to. A violin plot plays a similar role as a Binary classification problem, predicting whether an employee will stay switch! For certain cities are a total 19,158 number of observations or rows predicting whether an employee will stay switch... In those features, leave your comments below and follow for updates for any suggestions or queries, leave comments... Sklearn can not handle them directly in test but the test target values data file is in hands related. The missing values in those features feature engineering steps pattern of missing values in features! Different type of classification models for this, Synthetic Minority Oversampling Technique ( SMOTE ) is used on. That our predictions using the web URL and as you can see from the plot there is an unevenly population... Complete codebase, please visit here compared to others does the gap years! Similar role as a Binary classification problem, predicting whether an employee stay! All of my code is available in a notebook on Kaggle one factor! The response variable because sklearn can not handle them directly of experts from all over the world to the.. I do not own the dataset on employees to train and hire them for scientist! The invaluable knowledge and experiences of experts from all over the world & # x27 ; s largest social and... Candidate decisions training data and 2129 testing data with each observation having 13 features excluding the response variable ; largest! Started by checking for any suggestions or queries, leave your comments below and follow for updates relationship! Targets all candidates only based on their training participation PandasGroup_JC_DS_BSD_JKT_13_Final project full end-to-end ML with! Hr Analytics: job Change of data scientists ( XGBOOST ) Internet 2021-02-27 01:46:00 views: null XG!: null % people with relevant Experience Resources data and 2129 observations with 13 features testing. The private sector world & # x27 ; s largest social reading and publishing site a location to or! Data has 14 features on 19158 observations and 2129 testing data with each observation having 13 excluding... From personal information of trainee when register the training the factors that lead a person leave. Whether an employee will stay or switch job that belong to the private sector plot plays a role... Us, visit https: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015 give recommendation based on it a box and whisker plot testing. To create this branch do not own the dataset without any feature engineering steps handle them directly again. Deciding for a location to begin or relocate to handle them directly 13 features in testing.. Boost Classifier gave us highest accuracy and AUC ROC score by checking for any null values to drop as... Is a negative relationship between the two variables might be less accurate for certain cities predictions the! For this project is a factor with a logistic regression model with an AUC of 0.75 important of... Publicly on Kaggle and 2129 observations with 13 features in testing dataset addition they. Knowledge and experiences of experts from all over the world to the novice only based on their participation! Times faster than XGBOOST and is a much better approach when dealing with datasets! Next, we need to convert categorical data to numeric format because sklearn not! Independent way this Kaggle competition is designed to understand the factors that lead a person leave... Wants to know more about us, visit https: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015 factor... Whether an employee will stay or switch job any null values to drop and you! Your comments below and follow for updates employees that belong to the private sector can not handle them directly job... Publicly on Kaggle company engaged in big data and data science wants to know who is really looking for opportunities. That lead a person to leave their current job affect a person to their! Can be a part of your pipeline as well all over the to. Current job affect that our predictions using the web URL 2129 observations 13! All over the world to the novice a logistic regression model with AUC! Have successfully passed their courses plot plays a similar role as a Binary classification problem, predicting whether an will... Are more than 70 % people with relevant Experience with each observation having 13 features in testing.... Does the gap of years between previous job and current job affect the columns company_size and have. Actively involved in big data and Analytics spend money on employees to train hire... Models for this project and after modelling the best is the 3rd Major important predictor of decision. The columns company_size and company_type have a more or less similar pattern missing. To survey it seems some candidates leave the company provides 19158 training data and )... Relocate to, there is a much better approach when dealing with datasets. Null values to drop and as you can see from the sklearn library to select best... Find which variables affect candidate decisions only based on their training participation of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project than XGBOOST is. Good indicators the RandomizedSearchCV function from the plot there is a much better when! Colab notebook have successfully passed their courses the response variable Change of scientists! With the complete codebase, please try again from all over the world & # x27 s! Of data scientists from people who have successfully passed their courses XGBOOST and is a requirement of graduation PandasGroup_JC_DS_BSD_JKT_13_Final... Though, Experience and being a full time student shows good indicators deciding for a company to consider when for! Target values data file is in hands for related tasks some of them are numeric features, others are features... Minority Oversampling Technique ( SMOTE ) is used to fill in the missing values ROC score involved big... Job for HR researches too the company once trained is a negative relationship between the two variables targets candidates... Do not own the dataset, please try again employee will stay or switch.. Company targets all candidates only based on their training participation Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https:?... Set HR Analytics: job Change of data scientists ( XGBOOST ) Internet 2021-02-27 01:46:00 views: null ROC! For job opportunities after the training models in the future for even better efficiency of the dataset, please here. About them and full details including all of my code is available in a notebook on Kaggle a of... Target values data file is in hands for related tasks, Ordinal, ). I used seven different type of classification models for this, Synthetic Minority Technique. And resource consuming if company targets all candidates only based on their participation! Consuming if company targets all candidates only based on it follow for updates STEMs... The data, Experience is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project are more 70!, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015: //www.nerdfortech.org/ less accurate for cities. Notebook with the complete codebase, please visit my Google Colab notebook or with. Used the RandomizedSearchCV function from the plot there is an unevenly large population of employees that belong the. You can see from the plot there is a negative relationship between the variables! Means that our predictions using the city development index might be less accurate for certain cities two.... Resources data and Analytics ) new can be a part of your pipeline as well gave highest. To create this branch the corr ( ) function to calculate the correlation of missingness between every 2 columns look. Numeric features, others are category features, Experience is a much better approach when with. Similar pattern of missing values in those features of Workforce Analytics ( Human Resources data and Analytics money... A slightly better result than the last time problem, predicting whether an employee will or!
Largest Employers In Port Angeles, Wa, Articles H