Starbucks capstone project

 



Problem Statement:

While navigating this project, the pivotal hurdle involved transforming diverse and varied data into a format conducive to machine learning analysis. Challenges encompassed merging disparate data frames, handling categorical variables effectively, and determining the optimal approach for dealing with offer-related information. Balancing the use of offer IDs for directness and offer details for predictive potential posed a significant dilemma. Resolving the issue of separate transactional and offer-related data involved merging and categorizing information for streamlined analysis. The ultimate objective was to create a predictive model leveraging customer demographics to forecast offer completion, facilitating informed decision-making in targeted offer distributions.

Metrics:

We will use accuracy for this problem to define if the machine learning model is effective or not. After defining the best model, we will test accuracy, fbeta, recall ,precision.

Data Exploration:

After importing the data, I started with basic exploration for the data, and find the simple relation in the row data, there are 17K customers and 10 offers, each customer has his age, gender, income mentioned, with some customers with no data, Also for the offers there are the offer type and broadcasting method Transaction file is the most complicated, as it has 4 types of events (transaction, offer received, offer viewed, and offer completed), offers and transactions are not connected with offer ID, that makes it hard to connect together.


Check full code on GitHub repository



Exploratory Visualization:

Data Processing:

After data exploration, I started to work on merging the 3 data frames to get one data frame containing all important information Then I started to merged transactions with offer completed events, as they should share same customer and same time and be on ordered, Then creating a for loop to check per customer, and mark the received transaction either it’s received only, or received, viewed but not done, received viewed and done, received and done but not viewed, or received, done but viewed after the transaction was done This was the longest step in the processing After that I dropped unneeded rows. So the machine learning model would work only on related events Then I created dummy variables for all categorial variables, standardized the continues variables , then dropped all unneeded columns to avoid redundant information, After that I started to apply and try the machine leaning techniques to get the best one.

 

 

Algorithms and Techniques:

This problem is a classification supervised problem, for that I needed to find relative models that can help in this problem, so I choose the below 4 models.

AdaBoostClassifier, GaussianNB, DecisionTreeClassifier, RandomForestClassifier

After trying each of them with default parameters the results for training and testing were as below

Training with AdaBoostClassifier

Training Accuracy: 0.8778

Testing Accuracy : 0.8753

Training with GaussianNB

Training Accuracy: 0.72981

Testing Accuracy : 0.7286

Training with DecisionTreeClassifier

Training Accuracy: 1.0

Testing Accuracy : 0.8266

Training with RandomForestClassifier

Training Accuracy: 0.9999

Testing Accuracy : 0.8677

Random forest classifier was chosen as the one to continue with, another good option would be the Ada Boost Classifier, I choose random forest as the testing accuracy is too close and the training accuracy is much better.

For hyperparameter tuning, using GridSearchCV would take hours and too much computational power, so instead I I ran RandomizedSearchCV for the hyper parameter tuning.

These are the hyperparameters

params = { 'n_estimators': [75, 100, 125, 150], 'max_features': ['sqrt', 'log2', None], 'max_depth' : [10,20,30,40,50], 'min_samples_split' : [2,10,15,20], 'min_samples_leaf' :[1,2,5,10]}

Then I chose number of iterations to be 200

After running the randomized search, it gave the best parameters

{'n_estimators': 75, 'min_samples_split': 15, 'min_samples_leaf': 10, 'max_features': None, 'max_depth': 10}

So after running the model again the new accuracy for the model is

Training Accuracy: 0.8854

Testing Accuracy : 0.8791

Benchmark:

To be able to compare the result, I ran hyperparameter tuning to Ada boost classifier to get another result to be able to benchmark with it

After running hyperparameter tuning

params = { 'n_estimators': [40, 45, 50, 55, 60], 'learning_rate' : [0.8,0.9,1,1.1,1.2], 'algorithm':['SAMME', 'SAMME.R'] }

The best parameters are {'n_estimators': 60, 'learning_rate': 1.1, 'algorithm': 'SAMME.R'}

And the results are

Training Accuracy: 0.8772

Testing Accuracy : 0.8745

The random forest classifier is slightly better than Ada boost classifier with about .5%

Implementing process:

To be able to implement the model, data needed to be precise for the machine learning model, having separate lines with different events like offer received and offer viewed doesn’t provide a clear out put f or the model to work on, there should be a clear features to test and clear target to predict,

That’s why merging the data into on line per offer, dropping non related transactions was a good option for the last form of the data.

After that I needed to find a good model for supervised classification, as mentioned in the previous step I tried some of them, but random forest provided a good result.

Improving process:

Improving the results has two parts Finding the best model and find the best hyperparameters through hyperparameter tuning Fist I tried 4 different models, and got the best 2 out of them To determine which of them is better, I tried hyperparameters tuning for both models With random grid CV, I got the best hyperparameters for both models, as mentioned before And then I created a model for both and got the accuracy scores

Results:

After I got the two models, I created another test to get the accuracy score, fbeta, recall and precision scores and compare both models to each other

Results were

For random forest

Testing Accuracy : 0.8789

Testing Fbeta : 0.8749

Testing Recall : 0.9263

Testing Precision : 0.8064

For ada boost

Testing Accuracy : 0.8745

Testing Fbeta : 0.8698

Testing Recall : 0.8922

Testing Precision : 0.8176

Random forest is better for overall scores, for 3 out of 4 scores, yet the only one is for precision, yet the difference is only 1%

To make sure of the results, tried same model with different seeds and got the below results

Seed : 42

Testing Accuracy : 0.8789

Testing Fbeta : 0.8748

Testing Recall : 0.9283

Testing Precision : 0.8045

Seed :789

Testing Accuracy : 0.8775

Testing Fbeta : 0.8739

Testing Recall : 0.9311

Testing Precision : 0.8022

Scores are similar for different random seeds. For the 4 scores, after testing on 2 models


No comments:

Post a Comment

Pages