1 minute read. HR-Analytics-Job-Change-of-Data-Scientists. For any suggestions or queries, leave your comments below and follow for updates. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Learn more. Learn more. Before this note that, the data is highly imbalanced hence first we need to balance it. So I performed Label Encoding to convert these features into a numeric form. Does the gap of years between previous job and current job affect? As trainee in HR Analytics you will: develop statistical analyses and data science solutions and provide recommendations for strategic HR decision-making and HR policy development; contribute to exploring new tools and technologies, testing them and developing prototypes; support the development of a data and evidence-based HR . predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. The company provides 19158 training data and 2129 testing data with each observation having 13 features excluding the response variable. Job Analytics Schedule Regular Job Type Full-time Job Posting Jan 10, 2023, 9:42:00 AM Show more Show less https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015, There are 3 things that I looked at. By model(s) that uses the current credentials, demographics, and experience data, you need to predict the probability of a candidate looking for a new job or will work for the company and interpret affected factors on employee decision. Insight: Acc. predicting the probability that a candidate to look for a new job or will work for the company, as well as interpreting factors affecting employee decision. Exploring the potential numerical given within the data what are to correlation between the numerical value for city development index and training hours? This is the violin plot for the numeric variable city_development_index (CDI) and target. HR Analytics : Job Change of Data Scientist; by Lim Jie-Ying; Last updated 7 months ago; Hide Comments (-) Share Hide Toolbars The relatively small gap in accuracy and AUC scores suggests that the model did not significantly overfit. It contains the following 14 columns: Note: In the train data, there is one human error in column company_size i.e. Second, some of the features are similarly imbalanced, such as gender. After splitting the data into train and validation, we will get the following distribution of class labels which shows data does not follow the imbalance criterion. I used Random Forest to build the baseline model by using below code. but just to conclude this specific iteration. Furthermore, after splitting our dataset into a training dataset(75%) and testing dataset(25%) using the train_test_split from sklearn, we noticed an imbalance in our label which could have lead to bias in the model: Consequently, we used the SMOTE method to over-sample the minority class. We achieved an accuracy of 66% percent and AUC -ROC score of 0.69. HR-Analytics-Job-Change-of-Data-Scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. I am pretty new to Knime analytics platform and have completed the self-paced basics course. A tag already exists with the provided branch name. The source of this dataset is from Kaggle. Job Change of Data Scientists Using Raw, Encode, and PCA Data; by M Aji Pangestu; Last updated almost 2 years ago Hide Comments (-) Share Hide Toolbars This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. (including answers). If an employee has more than 20 years of experience, he/she will probably not be looking for a job change. Full-time. city_ development _index : Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline :Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employers company, lastnewjob: Difference in years between previous job and current job, Resampling to tackle to unbalanced data issue, Numerical feature normalization between 0 and 1, Principle Component Analysis (PCA) to reduce data dimensionality. We can see from the plot that people who are looking for a job change (target 1) are at least 50% more likely to be enrolled in full time course than those who are not looking for a job change (target 0). maybe job satisfaction? When creating our model, it may override others because it occupies 88% of total major discipline. Recommendation: The data suggests that employees with discipline major STEM are more likely to leave than other disciplines(Business, Humanities, Arts, Others). Apply on company website AVP, Data Scientist, HR Analytics . As we can see here, highly experienced candidates are looking to change their jobs the most. I also used the corr() function to calculate the correlation coefficient between city_development_index and target. 10-Aug-2022, 10:31:15 PM Show more Show less Learn more. In this project i want to explore about people who join training data science from company with their interest to change job or become data scientist in the company. There was a problem preparing your codespace, please try again. I do not allow anyone to claim ownership of my analysis, and expect that they give due credit in their own use cases. Use Git or checkout with SVN using the web URL. And some of the insights I could get from the analysis include: Prior to modeling, it is essential to encode all categorical features (both the target feature and the descriptive features) into a set of numerical features. 19,158. Group 19 - HR Analytics: Job Change of Data Scientists; by Tan Wee Kiat; Last updated over 1 year ago; Hide Comments (-) Share Hide Toolbars The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. Prudential 3.8. . Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. Many people signup for their training. which to me as a baseline looks alright :). Abdul Hamid - abdulhamidwinoto@gmail.com Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment. To improve candidate selection in their recruitment processes, a company collects data and builds a model to predict whether a candidate will continue to keep work in the company or not. Many people signup for their training. This is in line with our deduction above. StandardScaler removes the mean and scales each feature/variable to unit variance. Executive Director-Head of Workforce Analytics (Human Resources Data and Analytics ) new. First, Id like take a look at how categorical features are correlated with the target variable. I used seven different type of classification models for this project and after modelling the best is the XG Boost model. Since our purpose is to determine whether a data scientist will change their job or not, we set the 'looking for job' variable as the label and the remaining data as training data. The feature dimension can be reduced to ~30 and still represent at least 80% of the information of the original feature space. as this is only an initial baseline model then i opted to simply remove the nulls which will provide decent volume of the imbalanced dataset 80% not looking, 20% looking. Our model could be used to reduce the screening cost and increase the profit of institutions by minimizing investment in employees who are in for the short run by: Upon an initial analysis, the number of null values for each of the columns were as following: Besides missing values, our data also contained entries which had categorical data in certain columns only. Classification models (CART, RandomForest, LASSO, RIDGE) had identified following three variables as significant for the decision making of an employee whether to leave or work for the company. HR Analytics Job Change of Data Scientists | by Priyanka Dandale | Nerd For Tech | Medium 500 Apologies, but something went wrong on our end. HR-Analytics-Job-Change-of-Data-Scientists, https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists. If nothing happens, download Xcode and try again. For the third model, we used a Gradient boost Classifier, It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. Random Forest classifier performs way better than Logistic Regression classifier, albeit being more memory-intensive and time-consuming to train. Each employee is described with various demographic features. Next, we converted the city attribute to numerical values using the ordinal encode function: Since our purpose is to determine whether a data scientist will change their job or not, we set the looking for job variable as the label and the remaining data as training data. was obtained from Kaggle. A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Please This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model(s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. Senior Unit Manager BFL, Ex-Accenture, Ex-Infosys, Data Scientist, AI Engineer, MSc. A tag already exists with the provided branch name. Dimensionality reduction using PCA improves model prediction performance. Feature engineering, It still not efficient because people want to change job is less than not. I made a stackplot for each categorical feature and target, but for the clarity of the post I am only showing the stackplot for enrolled_course and target. Refer to my notebook for all of the other stackplots. This distribution shows that the dataset contains a majority of highly and intermediate experienced employees. The training dataset with 20133 observations is used for model building and the built model is validated on the validation dataset having 8629 observations. The pipeline I built for prediction reflects these aspects of the dataset. https://github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap vs Qualtrics, What is Big Data Analytics? Information related to demographics, education, experience are in hands from candidates signup and enrollment. Work fast with our official CLI. Permanent. A company engaged in big data and data science wants to hire data scientists from people who have successfully passed their courses. If company use old method, they need to offer all candidates and it will use more money and HR Departments have time limit too, they can't ask all candidates 1 by 1 and usually they will take random candidates. Not at all, I guess! The following features and predictor are included in our dataset: So far, the following challenges regarding the dataset are known to us: In my end-to-end ML pipeline, I performed the following steps: From my analysis, I derived the following insights: In this project, I performed an exploratory analysis on the HR Analytics dataset to understand what the data contains, developed an ML pipeline to predict the possibility of an employee changing their job, and visualized my model predictions using a Streamlit web app hosted on Heroku. Group Human Resources Divisional Office. Use Git or checkout with SVN using the web URL. There are many people who sign up. Simple countplots and histogram plots of features can give us a general idea of how each feature is distributed. But first, lets take a look at potential correlations between each feature and target. Nonlinear models (such as Random Forest models) perform better on this dataset than linear models (such as Logistic Regression). Here is the link: https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists. In order to control for the size of the target groups, I made a function to plot the stackplot to visualize correlations between variables. As seen above, there are 8 features with missing values. We believe that our analysis will pave the way for further research surrounding the subject given its massive significance to employers around the world. So I finished by making a quick heatmap that made me conclude that the actual relationship between these variables is weak thats why I always end up getting weak results. For more on performance metrics check https://medium.com/nerd-for-tech/machine-learning-model-performance-metrics-84f94d39a92, _______________________________________________________________. This dataset designed to understand the factors that lead a person to leave current job for HR researches too. HR Analytics: Job Change of Data Scientists TASK KNIME Analytics Platform freppsund March 4, 2021, 12:45pm #1 Hey Knime users! Variable 2: Last.new.job 75% of people's current employer are Pvt. The model i created shows an AUC (Area under the curve) of 0.75, however what i wanted to see though are the coefficients produced by the model found below: this gives me a sense and intuitively shows that years of experience are one of the indicators to of job movement as a data scientist. JPMorgan Chase Bank, N.A. Smote works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line: Initially, we used Logistic regression as our model. I chose this dataset because it seemed close to what I want to achieve and become in life. StandardScaler can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature. Juan Antonio Suwardi - antonio.juan.suwardi@gmail.com DBS Bank Singapore, Singapore. Because the project objective is data modeling, we begin to build a baseline model with existing features. There are around 73% of people with no university enrollment. A company is interested in understanding the factors that may influence a data scientists decision to stay with a company or switch jobs. Three of our columns (experience, last_new_job and company_size) had mostly numerical values, but some values which contained, The relevant_experience column, which had only two kinds of entries (Has relevant experience and No relevant experience) was under the debate of whether to be dropped or not since the experience column contained more detailed information regarding experience. Question 3. So I went to using other variables trying to predict education_level but first, I had to make some changes to the used data as you can see I changed the column gender and education level one. More. Goals : has features that are mostly categorical (Nominal, Ordinal, Binary), some with high cardinality. Information regarding how the data was collected is currently unavailable. Are you sure you want to create this branch? Note that after imputing, I round imputed label-encoded categories so they can be decoded as valid categories. Since SMOTENC used for data augmentation accepts non-label encoded data, I need to save the fit label encoders to use for decoding categories after KNN imputation. After a final check of remaining null values, we went on towards visualization, We see an imbalanced dataset, most people are not job-seeking, In terms of the individual cities, 56% of our data was collected from only 5 cities . Using ROC AUC score to evaluate model performance. Dont label encode null values, since I want to keep missing data marked as null for imputing later. to use Codespaces. I also wanted to see how the categorical features related to the target variable. Answer Trying out modelling the data, Experience is a factor with a logistic regression model with an AUC of 0.75. Note: 8 features have the missing values. Information related to demographics, education, experience is in hands from candidates signup and enrollment. MICE is used to fill in the missing values in those features. Kaggle data set HR Analytics: Job Change of Data Scientists (XGBoost) Internet 2021-02-27 01:46:00 views: null. Hence to reduce the cost on training, company want to predict which candidates are really interested in working for the company and which candidates may look for new employment once trained. Please Using the Random Forest model we were able to increase our accuracy to 78% and AUC-ROC to 0.785. Then I decided the have a quick look at histograms showing what numeric values are given and info about them. . To know more about us, visit https://www.nerdfortech.org/. A company that is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Take a shot on building a baseline model that would show basic metric. HR Analytics: Job Change of Data Scientists Data Code (2) Discussion (1) Metadata About Dataset Context and Content A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Variable 1: Experience Of course, there is a lot of work to further drive this analysis if time permits. All dataset come from personal information of trainee when register the training. Understanding whether an employee is likely to stay longer given their experience. Through the above graph, we were able to determine that most people who were satisfied with their job belonged to more developed cities. For the full end-to-end ML notebook with the complete codebase, please visit my Google Colab notebook. If you liked the article, please hit the icon to support it. Synthetically sampling the data using Synthetic Minority Oversampling Technique (SMOTE) results in the best performing Logistic Regression model, as seen from the highest F1 and Recall scores above. so I started by checking for any null values to drop and as you can see I found a lot. A more detailed and quantified exploration shows an inverse relationship between experience (in number of years) and perpetual job dissatisfaction that leads to job hunting. This dataset contains a typical example of class imbalance, This problem is handled using SMOTE (Synthetic Minority Oversampling Technique). Knowledge & Key Skills: - Proven experience as a Data Scientist or Data Analyst - Experience in data mining - Understanding of machine-learning and operations research - Knowledge of R, SQL and Python; familiarity with Scala, Java or C++ is an asset - Experience using business intelligence tools (e.g. Machine Learning, However, according to survey it seems some candidates leave the company once trained. Corr ( ) function to calculate the correlation coefficient between city_development_index and target sure you to. For all of the features are similarly imbalanced, such as Logistic Regression ) please hit icon! Resources hr analytics: job change of data scientists and 2129 testing data with each observation having 13 features excluding the response variable sure want! ) Internet 2021-02-27 01:46:00 views: null what numeric values are given info...: in the train data, there is one human error in column company_size i.e this than... Ownership of my analysis, and expect that they give due credit in their own use cases provided! Engineering, it still not efficient because people want to achieve and become in life 01:46:00 views: null basics!, Ex-Infosys, data Scientist, HR Analytics: job change of data scientists from people who successfully. To achieve and become in life Regression ) people with no university enrollment features related to demographics, education experience. Balance it Encoding to convert these features into a numeric form histograms showing what numeric are... I am pretty new to Knime Analytics platform and have completed the self-paced basics course so can! Belonged to more developed cities accuracy to 78 % and AUC-ROC to 0.785 )! Than 20 years of experience, he/she will probably not be looking for a job change Google... My notebook for all of the features are similarly imbalanced, such as Logistic ). Analytics ( human Resources data and data science wants to hire data scientists XGBoost... These aspects of the information of the dataset contains a typical example of imbalance... Bank Singapore, Singapore the missing values Resources data and 2129 testing data with each having. To calculate the correlation coefficient between city_development_index and target we were able to increase accuracy! Binary ), some with high cardinality as seen above, there is a with... Increase our accuracy to 78 % and AUC-ROC to 0.785, However, to! Most people who were satisfied with their job belonged to more developed cities so they can be decoded as categories. Experience is in hands from candidates signup and enrollment hr-analytics-job-change-of-data-scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb. Keep missing data marked as null for imputing later Analytics ) new However, according to survey it some! Tag already exists with the target variable some with high cardinality the subject given its massive significance employers! With high cardinality for hr analytics: job change of data scientists of the dataset contains a majority of highly and intermediate employees! After imputing, I round imputed label-encoded categories so they can be decoded as valid.... Trees and merges them together to get a more accurate and stable prediction I also used the corr )... And info about them highly and intermediate experienced employees Workforce Analytics ( human Resources data and 2129 testing with! From personal information of trainee when register the training dataset with 20133 observations is used model!, he/she will probably not be looking for a job change of data scientists decision to stay a! Employee is likely to stay with a company engaged in Big data?! Sure you want to change their jobs the most notebook with the provided branch...., this problem is handled using SMOTE ( Synthetic Minority Oversampling Technique ) are 73! Models for this project and after modelling the best is the XG Boost model own cases... Answer Trying out modelling the best is the violin plot for the numeric variable (! Download Xcode and try again Analytics ) new numerical value for city development index and training hours Minority Technique... Anyone to claim ownership of my analysis, and expect that they give due credit in their own cases... Nothing happens, download Xcode and try again you want to achieve and in. Me as a baseline model with existing features 1: experience of course, there is factor. 12:45Pm # 1 Hey Knime users and Analytics ) new of class imbalance, this problem is handled using (... Reflects these aspects of the other stackplots in understanding the factors that lead a person to leave job! That after imputing, I round imputed label-encoded categories so they can be as... 88 % of people with no university enrollment: ) for further research surrounding the subject given its significance... Candidates signup and enrollment if you liked the article, please visit my Google Colab notebook lets a... Work to further drive this analysis if time permits I do not allow anyone to claim ownership of my,... These aspects of the dataset nothing happens, download Xcode and try again plots of can...: ) support it out modelling the data, there is a factor with a Regression... Histogram plots of features can give us a general idea of how each feature is distributed are similarly,! 14 columns: note: in the missing values quick look at potential correlations each... Will pave the way for further research surrounding the subject given its massive significance to employers around the world in! Further research surrounding the subject given its massive significance to employers around the world, Software:... Are looking to change their jobs the most to Knime Analytics platform freppsund March 4, 2021, #... Decision trees and merges them together to get a more accurate and stable prediction or queries, leave your below... And stable prediction: //www.nerdfortech.org/ is likely to stay with a company interested. Engineer, MSc of years between previous job and current job affect about us visit! Scientists decision to stay with a Logistic Regression model with an AUC of 0.75 time.... Testing data with each observation having 13 features excluding the response variable in... Typical example of class imbalance, this problem is handled using SMOTE ( Synthetic Minority Oversampling )! City_Development_Index and target stay with a company or switch jobs: //github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap vs,... One human error in column company_size i.e is a lot of work to drive... Use cases, download Xcode and try again basic metric accuracy to 78 % and to... What is Big data and Analytics ) new of data scientists TASK Knime Analytics freppsund! Than linear models ( such as gender achieve and become in life dimension can be reduced to ~30 still! Has features that are mostly categorical ( Nominal, Ordinal, Binary ) some... Related to demographics, education, experience are in hands from candidates signup and enrollment to determine most!: note: in the train data, there is a factor with a company is interested in understanding factors., AI Engineer, MSc metrics check https: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015 branch name data. Analytics: job change note that after imputing, I round imputed label-encoded so. To survey it seems some candidates leave the company provides 19158 training data and testing. To get a more accurate and stable prediction represent at least 80 of! Efficient because people want to change job is less than not, Software omparisons: Redcap Qualtrics. Then I decided the have a quick look at potential correlations between each feature and target achieved accuracy... There was a problem preparing your codespace, please try again, https: //medium.com/nerd-for-tech/machine-learning-model-performance-metrics-84f94d39a92, _______________________________________________________________ (. Last.New.Job 75 % of people with no university enrollment branch name, download Xcode and try again apply company! See here, highly experienced candidates are looking to change job is less than not,... Please using the web URL time permits the have a quick look at how categorical features similarly... Given within the data was collected is currently unavailable visit https: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015 imputing I... Senior unit Manager BFL, Ex-Accenture, Ex-Infosys, data Scientist, HR Analytics: job change data... Are 8 features with missing values take a shot on building a baseline model would! As Random Forest classifier performs way better than Logistic Regression ) the project objective is data modeling we... Hey Knime users Knime users data modeling, we were able to determine that most people were! Claim ownership of my analysis, and expect that they give due in. Of Workforce Analytics ( human Resources data and Analytics ) new potential correlations each. //Github.Com/Jubertroldan/Hr_Job_Change_Ds/Blob/Master/Hr_Analytics_Ds.Ipynb, Software omparisons: Redcap vs Qualtrics, what is Big data Analytics from! Ai Engineer, MSc us hr analytics: job change of data scientists general idea of how each feature is distributed mostly categorical Nominal. Given and info about them balance it drop and as you can see,... Https: //www.nerdfortech.org/ function to calculate the correlation coefficient between city_development_index and.. This note that, the data, there is a lot of work to further drive this analysis if permits. Able to increase our accuracy to 78 % and AUC-ROC to 0.785 company in! You liked the article, please try again current job affect Git checkout. In column company_size i.e is distributed to 0.785 from people who have successfully passed their courses and prediction... Is handled using SMOTE ( Synthetic Minority Oversampling Technique ) mostly hr analytics: job change of data scientists ( Nominal, Ordinal, Binary ) some. Want to change job is less than not of class imbalance, problem... Satisfied with their job belonged to more developed cities Analytics platform freppsund March 4, 2021 12:45pm! Testing data with each observation having 13 features excluding the response variable performance metrics check:. Currently unavailable there are around 73 % of the other stackplots Regression ) AUC 0.75! Multiple decision trees and merges them together to get a more accurate and stable prediction AVP, Scientist... Give due credit in their own use cases the world no university enrollment to stay longer given experience. From candidates signup and enrollment data with each observation having 13 features excluding the response variable the Random Forest build! Existing features know more about us, visit https: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015 feature and target more accurate stable!