The Incredible Efficiency of Alvin Kamara

I posted an article on Saturday on All-Pro wide receiver Michael Thomas. For my next piece, I figured that it’d be appropriate to take a look at the Saints’ other offensive prodigy (no, not Ryan…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Customer Segmentation Report for Arvato Financial Services

In this project, we are going to solve a real-life data science project on customer segmentation and target-advertising. Demographics data of both the customer and general population of a mail-order sale company in Germany are provided by Bertelsmann Arvato Analytics. Firstly, unsupervised learning techniques are applied to perform customer segmentation and the parts of the population that best describe the core customer base of the company are analyzed. Then, supervised learning models are utilized to train a third dataset with demographics information for targets of a marketing campaign for the company, and the optimized model is used to predict which individuals are most likely to convert into becoming customers for the company. Finally, the results of the prediction on a fourth dataset are uploaded to Kaggle Competition.

There are four data files associated with this project:

A feature to missing value map is provided to map the corresponding Nan values in each feature. After the nan values mapping, the percentage of missing values for each feature is plotted in the left figure, while most features have less than 20% missing values, all features with more than 20% missing values will be dropped.

Then we checked the missing features in each row, the results are plotted in the left figure, we can see over 90% of data has missing features less than 10, therefore all the rows will more than 10 missing features are also dropped.

With the above processing, a data-cleaning function can be provided, and the function is also applied to the Customer dataset. After cleaning, the shape of Azdias Dataset is: (146323, 369), and the shape of Customer Dataset is: (131203, 369).

In this session, Principle component analysis(PCA) and K-mean clustering methods are applied to investigate the pattern matching between general population dataset and customer dataset.

Prior to analysis, additional processing methods are required to transform the dataframe to all number based matrix. The first step is imputing all ‘nans’ to value, here we use “Imputer” function from sklean.preprocessing, and the strategy is imputing by mean. The second step is normalizing all data, and we use “StandardScaler” function also from sklean.preprocessing.

After the above preprocessing, we can apply dimensionality reduction to the datasets. PCA is applied for the dimension reduction, we initialized the process by checking components’ information percentage. The below figure shows information percentage by first N principle component’s summation, we can see 70% information is retained with only 125 components, therefore we can apply PCA by the number of 125.

In this substep, we will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster’s centroid to decide on a number of clusters to keep. Sklearn’s KMeans class is used to perform k-means clustering on the PCA-transformed data.

The average distance score is plotted in the left figure with a different number of centroids in the K-mean clustering method. Elbow method is used to determine the number of clusters we are using, from the left figure, 28 clusters seem to be a reasonable value. Therefore we can apply the same transformation methods to customer datasets.

Now that we have clusters and cluster centers for the general population, it’s time to see how the customer data maps on to those clusters, and compare customer data to the demographics data.

The above figures show the value counts on each cluster and the percentage difference between the customer dataset and general dataset. We can see Cluster 7 and Cluster 10 are the two most over-represented cluster by the customer dataset, while Custer2 is the most under-represented cluster.

The features that correlated to the clusters are further investigated through the inverse-transformation of PCA.

Now that we’ve found which parts of the population are more likely to be customers of the mail-order company, it’s time to build a prediction model. Each of the rows in the “MAILOUT” data files represents an individual that was targeted for a mailout campaign. The “MAILOUT” data has been split into two approximately equal parts, each with almost 43 000 data rows.

Class distribution of output label

Similar to the unsupervised learning model, we need to clean the data by dropping the high percentage of missing-value features. However, one big difference here is that we cannot drop the row data, since it may drop the already short positive-labeled data.

To start, RandomForestClassifer, GradientBoostingClassifier and XGBClassifier are used to train the model, RandomForestClassifer did a bad job with only 0.51 AUC value, while XGBClassifier and GradientBoostingClassifier show similar performance, have 0.756 and 0.754 AUC value respectively, therefore XGBClassifier is adopted for optimization and training.

The parameter tubing and best predicter information are listed in the above figures.

With the first entry, I got a rank of №13, and the auc score is only 0.0035 less than the leader, I believe with more hyperparameter tubing and processing, the result could be better.

In this post, a real-life data science project on customer segmentation and targeting advertising is completed. On the dataset of demographic information for both the general population and customer, both unsupervised learning and supervised learning methods are investigated and analyzed.

With the unsupervised learning method, PCA and K-mean cluster methods are used to create the segmentation report between customer data and general population data. During the supervised learning, due to highly imbalanced data, StratifiedKFold method is applied to split the train-test dataset, then the boosting tree classifier is used to predict customer’s responses based on demographical data, and . In the Kaggle competition, a rank of 13 is achieved using the output from the optimized classifier.

In the future, to further increase the performance of the prediction, we can try other ensemble methods like ADABoostingTree, train longer and with more folds in cross-validation, search and finetune more hyperparameters, or preprocess the dataset using the information from unsupervised learning.

Finally, I want to give a special thank to Udacity for providing the training material in Data Science, and Bertelsmann Arvato Analytics for providing the real-life data.

Add a comment

Related posts:

Do you Know the Purpose of your Life?

To understand your purpose of life, first you need to have a look on the place above in picture. Everything is so beautiful, it is attracting our attention, and anyone can fall in love with this…

Darico

The modern economic sphere is developing every day more and more. The usual economic mechanisms are replaced by more efficient ones. Digital technologies demonstrate higher standards of work with…

Announcing the Catalyst Blog

After over four years of building the idea and iterating on the development (I’ve already scrapped four versions of the app), I’m so excited to finally be getting this out there. As of now, I’m…