Problem Statement:
While navigating this project, the pivotal hurdle involved transforming diverse and varied data into a format conducive to machine learning analysis. Challenges encompassed merging disparate data frames, handling categorical variables effectively, and determining the optimal approach for dealing with offer-related information. Balancing the use of offer IDs for directness and offer details for predictive potential posed a significant dilemma. Resolving the issue of separate transactional and offer-related data involved merging and categorizing information for streamlined analysis. The ultimate objective was to create a predictive model leveraging customer demographics to forecast offer completion, facilitating informed decision-making in targeted offer distributions.
Metrics:
We will use accuracy for this problem to define if the machine learning model is effective or not. After defining the best model, we will test accuracy, fbeta, recall ,precision.
Data Exploration:
After importing the data, I started with basic exploration for the data, and find the simple relation in the row data, there are 17K customers and 10 offers, each customer has his age, gender, income mentioned, with some customers with no data, Also for the offers there are the offer type and broadcasting method Transaction file is the most complicated, as it has 4 types of events (transaction, offer received, offer viewed, and offer completed), offers and transactions are not connected with offer ID, that makes it hard to connect together.
Check full code on GitHub repository
Exploratory Visualization:
Data Processing:
After data exploration, I started to work on merging the 3 data frames to get one data frame containing all important information Then I started to merged transactions with offer completed events, as they should share same customer and same time and be on ordered, Then creating a for loop to check per customer, and mark the received transaction either it’s received only, or received, viewed but not done, received viewed and done, received and done but not viewed, or received, done but viewed after the transaction was done This was the longest step in the processing After that I dropped unneeded rows. So the machine learning model would work only on related events Then I created dummy variables for all categorial variables, standardized the continues variables , then dropped all unneeded columns to avoid redundant information, After that I started to apply and try the machine leaning techniques to get the best one.
Algorithms and Techniques:
This problem is a classification supervised problem, for that I needed to find relative models that can help in this problem, so I choose the below 4 models.
AdaBoostClassifier, GaussianNB, DecisionTreeClassifier, RandomForestClassifier
After trying each of them with default parameters the results for training and testing were as below
Training with AdaBoostClassifier
Training Accuracy: 0.8778
Testing Accuracy : 0.8753
Training with GaussianNB
Training Accuracy: 0.72981
Testing Accuracy : 0.7286
Training with DecisionTreeClassifier
Training Accuracy: 1.0
Testing Accuracy : 0.8266
Training with RandomForestClassifier
Training Accuracy: 0.9999
Testing Accuracy : 0.8677
Random forest classifier was chosen as the one to continue with, another good option would be the Ada Boost Classifier, I choose random forest as the testing accuracy is too close and the training accuracy is much better.
For hyperparameter tuning, using GridSearchCV would take hours and too much computational power, so instead I I ran RandomizedSearchCV for the hyper parameter tuning.
These are the hyperparameters
params = { 'n_estimators': [75, 100, 125, 150], 'max_features': ['sqrt', 'log2', None], 'max_depth' : [10,20,30,40,50], 'min_samples_split' : [2,10,15,20], 'min_samples_leaf' :[1,2,5,10]}
Then I chose number of iterations to be 200
After running the randomized search, it gave the best parameters
{'n_estimators': 75, 'min_samples_split': 15, 'min_samples_leaf': 10, 'max_features': None, 'max_depth': 10}
So after running the model again the new accuracy for the model is
Training Accuracy: 0.8854
Testing Accuracy : 0.8791
Benchmark:
To be able to compare the result, I ran hyperparameter tuning to Ada boost classifier to get another result to be able to benchmark with it
After running hyperparameter tuning
params = { 'n_estimators': [40, 45, 50, 55, 60], 'learning_rate' : [0.8,0.9,1,1.1,1.2], 'algorithm':['SAMME', 'SAMME.R'] }
The best parameters are {'n_estimators': 60, 'learning_rate': 1.1, 'algorithm': 'SAMME.R'}
And the results are
Training Accuracy: 0.8772
Testing Accuracy : 0.8745
The random forest classifier is slightly better than Ada boost classifier with about .5%
Implementing process:
To be able to implement the model, data needed to be precise for the machine learning model, having separate lines with different events like offer received and offer viewed doesn’t provide a clear out put f or the model to work on, there should be a clear features to test and clear target to predict,
That’s why merging the data into on line per offer, dropping non related transactions was a good option for the last form of the data.
After that I needed to find a good model for supervised classification, as mentioned in the previous step I tried some of them, but random forest provided a good result.
Improving process:
Improving the results has two parts Finding the best model and find the best hyperparameters through hyperparameter tuning Fist I tried 4 different models, and got the best 2 out of them To determine which of them is better, I tried hyperparameters tuning for both models With random grid CV, I got the best hyperparameters for both models, as mentioned before And then I created a model for both and got the accuracy scores
Results:
After I got the two models, I created another test to get the accuracy score, fbeta, recall and precision scores and compare both models to each other
Results were
For random forest
Testing Accuracy : 0.8789
Testing Fbeta : 0.8749
Testing Recall : 0.9263
Testing Precision : 0.8064
For ada boost
Testing Accuracy : 0.8745
Testing Fbeta : 0.8698
Testing Recall : 0.8922
Testing Precision : 0.8176
Random forest is better for overall scores, for 3 out of 4 scores, yet the only one is for precision, yet the difference is only 1%
To make sure of the results, tried same model with different seeds and got the below results
Seed : 42
Testing Accuracy : 0.8789
Testing Fbeta : 0.8748
Testing Recall : 0.9283
Testing Precision : 0.8045
Seed :789
Testing Accuracy : 0.8775
Testing Fbeta : 0.8739
Testing Recall : 0.9311
Testing Precision : 0.8022
Scores are similar for different random seeds. For the 4 scores, after testing on 2 models
No comments:
Post a Comment