My first hands-on Data Science Project during my MSc. was a Kaggle Competition where we were assigned to predict whether an e-commerce fashion product is expected to be returned by the customer. We used supervised predictive analytics to reach the probability of return, since the business plans to display warnings for the ones most likely to be returned. My submission reached an AUC score of 0.72 where the top leaderboard position sat at 0.76. The assignment was as follows:
Assignment
This assignment will allow you to apply your skills in business analytics on real-world data from the field of e-commerce and customer targeting and practice the scientific methods for rigorous testing and documentation. Passing the assignment is required to participate in the exam for the class Business Analytics and Data Science.
For the assignment, you are highly encouraged to go beyond the standard methods taught in class, make use of the scientific literature and conduct and document your own experiments with the data. Make sure to consider all stages of a typical modeling process:
- research the relevant technical and task-related knowledge in the literature
- gather, clean and preprocess the relevant data
- select the best model and model parameters
- deploy and assess the model in terms of performance and plausibility
with possibly revision of any step or the whole process.Setting
Customers send back a substantial part of the products that they purchase online. Return shipping is expensive for online platforms and return orders are said to reach 50% for certain industries and products. Nevertheless, free or inexpensive return shipping has become a customer expectation and de-facto standard in the fierce online competition on clothing, but shops have indirect ways to influence customer purchase behavior. For purchases where return seems likely, a shop could, for example, restrict payment options or display additional marketing communication.
For this assignment, you are provided with real-world data by an online retailer. Your task is to identify the items that are likely to be returned. When a customer is about to purchase a item, which is likely to be returned, the shops is planning to show a warning message. Your task is to build a targeting model to balance potential sales and return risk in order to optimize shop revenue. The data you receive is artificially balanced (1:1 ratio between returns and non-returns).
Evaluation
The evaluation metric for this competition is the Area-under-the-ROC-curve (AUC) based on the probability scores of your model. The model will be evaluated on the class dataset. To pass the assignment and qualify for the exam, a final AUC score of at least 0.68 on the public leaderboard is required. You can submit multiple predictions and improve your score until the end date of the competition.
For every order-item combination in the dataset, submission files should contain two and only two named columns: order_item_id and return. A higher probability score denotes a higher chance of return (1).
You can open .csv files in a text editor (not Excel) to check that everything is exactly right.The file should contain a header and have the following format:
user_item_id,return
100001,0.4
100002,0.05
etc.
Code used for submission was as follows:
#This is the submission of Alex Maarouf
#First, working Directory
getwd()
setwd("C:/Users/Alex/Desktop/Kaggle Project")
df1 <- read.csv("BADS_known.csv")
df2 <- read.csv("BADS_unknown.csv")
str(df2)
summary(df1)
str(dfmix)
####Data cleaning of df1
{
#Note that df1 is the only one cleaned for now, and after modeling and testing
#(and deciding that this cleaning sequence is the most appropriate for my model)
#df2 would be cleaned in the same way
#First, we start changing variables into their correct format
df1$return <- as.factor(df1$return)
df1$delivery_date <- as.Date(df1$delivery_date)
df1$order_date <- as.Date(df1$order_date)
df1$user_dob <- as.Date(df1$user_dob)
df1$user_reg_date <- as.Date(df1$user_reg_date)
df1$user_id <- as.factor(df1$user_id)
df1$item_id <- as.factor(df1$item_id)
df1$brand_id <- as.factor(df1$brand_id)
str(df1)
#it seems everything is ready now to continue
#I would like to take a look at the data by graphing some of the variables,
#after some time with the data, I realized that the unknown set covers
#a different period of time, so I decided to look at the data as a whole
#I created a dummy copy of df1 and removed the return variable from it
df1_p <- df1
df1_p$return <- NULL
dfmix <- rbind(df1_p, df2)
rm(df1_p)
#Data visualization
library(VennDiagram)
library(dplyr)
library(ggplot2)
# dfmix %>%
# ggplot(aes(x= order_date, y= item_id)) +
# geom_point()
#I have tried many variations of ggplot just to take a look at the data,
#I only included this example that clearly shows how the data is split into
#the known and unknown set sorted by order date, and with a big chunk of missing
#observations in the middle
#I wouldn't expect it would be easy for a model to accurately
#predict data on clothing sold in a season (late autumn early winter), given that it was trained on a
#set of a different season (summer and early autumn)
#time to create some variables
#the variable delivery time is then created, as it seems that it would be a relevant variable
#to the model
df1$delivery_time <- df1$delivery_date - df1$order_date
#because unreal values of delivery date (being before order date) exist
#this is cleaned for both delivery date and delivery time
df1$delivery_time <- as.numeric(df1$delivery_time)
summary(df1$delivery_time)
df1$delivery_time <- sapply(df1$delivery_time, function(x) if(is.na(x) || x<0) {NA} else {x})
summary(df1$delivery_time)
#to clean delivery date, the unreal values were replaced with the order date plus
#the mean delivery time
df1$delivery_date <- mapply(function(x,y) if(is.na(x) || x<y) {y+7} else {x}, df1$delivery_date, df1$order_date)
df1$delivery_date <- as.Date(df1$delivery_date, origin="1970-01-01")
#now to replace our NA values of delivery time
df1$delivery_time <- df1$delivery_date - df1$order_date
#now, user age is created
df1$user_age <- df1$order_date - df1$user_dob
df1$user_age <- as.numeric(df1$user_age)
df1$user_age_years <- df1$user_age%/%365
df1$user_age <- NULL
#also, it would be interesting to find users who registered after ordering (there wasn't and this part
#was commented out)
# df1$error_reg <- if((df1$user_reg_date - df1$order_date) < 0) {TRUE} else {FALSE}
# summary(df1)
#
# #after finding out there are none, this variable is deleted
#
# df1$error_reg <- NULL
#back to age, I wouldn't consider users above 100 or below 13 to
#be reliable data, so they're replaced with NA
df1$user_age_years <- as.numeric(df1$user_age_years)
summary(df1$user_age_years)
df1$user_age_years <- sapply(df1$user_age_years, function(x) if(is.na(x) || x < 13 || x > 100) {NA} else {x})
summary(df1$user_age_years)
#I have a theory that maybe if the same user ordered the same order, there would be a higher
#chance of them returning it (indicates replacement because of various possible reasons)
library(data.table)
df1_dup_cols <- df1[, c("user_id", "item_id")]
df1$is_duplicated <- duplicated(df1_dup_cols) | duplicated(df1_dup_cols, fromLast = TRUE)
#Because some item prices are zero, and it wouldn't be replaced with mean or median
#price, the following code is meant to create an index of all items by item id,
#and their respective prices if not 0, and use that to return the price of the item
#that has a zero price value
#
#note that some observations still made it out with zeros after running this,
#but it's decided to be left as is, as it might be a free item
#the reason this wasn't assumed for all items with zero price is that cases
#exist where they are returned, and no (logical) reason to return a free item
item_prices_df1 <- df1[df1$item_price > 0, c("item_id", "item_price")]
item_prices_df1 <- item_prices_df1[!duplicated(item_prices_df1[, c("item_id")], fromLast = T),]
df1 <- merge(x = df1, y = item_prices_df1, by = "item_id", all.x = T)
df1$item_price <- mapply(function (price_x, price_y) if (price_x == 0 && !(is.na(price_y))) {price_y} else {price_x}, df1$item_price.x, df1$item_price.y)
df1$item_price.x <- NULL
df1$item_price.y <- NULL
rm(item_prices_df1)
#the following code has the same logic behind the previous one but for user age, according to user id
users_age_df1 <- df1[!is.na(df1$user_age_years), c("user_id", "user_age_years")]
users_age_df1 <- users_age_df1[!duplicated(users_age_df1[, c("user_id")], fromLast = T),]
df1 <- merge(x = df1, y = users_age_df1, by = "user_id", all.x = T)
df1$user_age_years <- mapply(function (age_x, age_y) if (is.na(age_x) && !(is.na(age_y))) {age_y} else {age_x}, df1$user_age_years.x, df1$user_age_years.y)
df1$user_age_years.x <- NULL
df1$user_age_years.y <- NULL
rm(users_age_df1)
df1$delivery_date <- as.numeric(df1$delivery_date)
df1$order_date <- as.numeric(df1$order_date)
df1$user_dob <- as.numeric(df1$user_dob)
df1$user_reg_date <- as.numeric(df1$user_reg_date)
df1$delivery_time <- as.numeric(df1$delivery_time)
df1$is_duplicated <- as.integer(df1$is_duplicated)
}
####Data cleaning of df2
{
#First, we start changing variables into their correct format
df2$delivery_date <- as.Date(df2$delivery_date)
df2$order_date <- as.Date(df2$order_date)
df2$user_dob <- as.Date(df2$user_dob)
df2$user_reg_date <- as.Date(df2$user_reg_date)
df2$user_id <- as.factor(df2$user_id)
df2$item_id <- as.factor(df2$item_id)
df2$brand_id <- as.factor(df2$brand_id)
str(df2)
#cool, it seems everything is ready now to continue
#I would like to take a look at the data by graphing some of the variables,
#after some time with the data, I realized that the unknown set covers
#a different period of time, so I decided to look at the data as a whole
#I created a dummy copy of df2 and removed the return variable from it
df2_p <- df2
df2_p$return <- NULL
dfmix <- rbind(df2_p, df2)
rm(df2_p)
#Data visualization
library(VennDiagram)
library(dplyr)
library(ggplot2)
# dfmix %>%
# ggplot(aes(x= order_date, y= item_id)) +
# geom_point()
#I have tried many variations of ggplot just to take a look at the data,
#I only included this example that clearly shows how the data is split into
#the known and unknown set sorted by order date, and with a big chunk of missing
#observations in the middle
#I wouldn't expect it would be easy for a model to accurately
#predict data on clothing sold in a season (late autumn early winter), given that it was trained on a
#set of a different season (summer and early autumn)
#time to create some variables
#the variable delivery time is then created, as it seems that it would be a relevant variable
#to the model
df2$delivery_time <- df2$delivery_date - df2$order_date
#because unreal values of delivery date (being before order date) exist
#this is cleaned for both delivery date and delivery time
df2$delivery_time <- as.numeric(df2$delivery_time)
summary(df2$delivery_time)
df2$delivery_time <- sapply(df2$delivery_time, function(x) if(is.na(x) || x<0) {NA} else {x})
summary(df2$delivery_time)
#to clean delivery date, the unreal values were replaced with the order date plus
#the mean delivery time
df2$delivery_date <- mapply(function(x,y) if(is.na(x) || x<y) {y+7} else {x}, df2$delivery_date, df2$order_date)
df2$delivery_date <- as.Date(df2$delivery_date, origin="1970-01-01")
#now to replace our NA values of delivery time
df2$delivery_time <- df2$delivery_date - df2$order_date
#now, user age is created
df2$user_age <- df2$order_date - df2$user_dob
df2$user_age <- as.numeric(df2$user_age)
df2$user_age_years <- df2$user_age%/%365
df2$user_age <- NULL
#also, it would be interesting to find users who registered after ordering (there wasn't and this part
#was commented out)
# df2$error_reg <- if((df2$user_reg_date - df2$order_date) < 0) {TRUE} else {FALSE}
# summary(df2)
#
# #after finding out there are none, this variable is deleted
#
# df2$error_reg <- NULL
#back to age, I wouldn't consider users above 100 or below 13 to
#be reliable data, so they're replaced with NA
df2$user_age_years <- as.numeric(df2$user_age_years)
summary(df2$user_age_years)
df2$user_age_years <- sapply(df2$user_age_years, function(x) if(is.na(x) || x < 13 || x > 100) {NA} else {x})
summary(df2$user_age_years)
#I have a theory that maybe if the same user ordered the same order, there would be a higher
#chance of them returning it (indicates replacement because of various possible reasons)
library(data.table)
df2_dup_cols <- df2[, c("user_id", "item_id")]
df2$is_duplicated <- duplicated(df2_dup_cols) | duplicated(df2_dup_cols, fromLast = TRUE)
#Because some item prices are zero, and it wouldn't be replaced with mean or median
#price, the following code is meant to create an index of all items by item id,
#and their respective prices if not 0, and use that to return the price of the item
#that has a zero price value
#
#note that some observations still made it out with zeros after running this,
#but it's decided to be left as is, as it might be a free item
#the reason this wasn't assumed for all items with zero price is that cases
#exist where they are returned, and no (logical) reason to return a free item
item_prices_df2 <- df2[df2$item_price > 0, c("item_id", "item_price")]
item_prices_df2 <- item_prices_df2[!duplicated(item_prices_df2[, c("item_id")], fromLast = T),]
df2 <- merge(x = df2, y = item_prices_df2, by = "item_id", all.x = T)
df2$item_price <- mapply(function (price_x, price_y) if (price_x == 0 && !(is.na(price_y))) {price_y} else {price_x}, df2$item_price.x, df2$item_price.y)
df2$item_price.x <- NULL
df2$item_price.y <- NULL
rm(item_prices_df2)
#the following code has the same logic behind the previous one but for user age, according to user id
users_age_df2 <- df2[!is.na(df2$user_age_years), c("user_id", "user_age_years")]
users_age_df2 <- users_age_df2[!duplicated(users_age_df2[, c("user_id")], fromLast = T),]
df2 <- merge(x = df2, y = users_age_df2, by = "user_id", all.x = T)
df2$user_age_years <- mapply(function (age_x, age_y) if (is.na(age_x) && !(is.na(age_y))) {age_y} else {age_x}, df2$user_age_years.x, df2$user_age_years.y)
df2$user_age_years.x <- NULL
df2$user_age_years.y <- NULL
rm(users_age_df2)
df2$delivery_date <- as.numeric(df2$delivery_date)
df2$order_date <- as.numeric(df2$order_date)
df2$user_dob <- as.numeric(df2$user_dob)
df2$user_reg_date <- as.numeric(df2$user_reg_date)
df2$delivery_time <- as.numeric(df2$delivery_time)
df2$is_duplicated <- as.integer(df2$is_duplicated)
}
df1_p <- df1
df1_p$return <- NULL
dfmix <- rbind(df1_p,df2)
dfmix <- dfmix[order(dfmix$order_item_id),]
datam <- mlr::createDummyFeatures(dfmix[,c(-1,-2)])
idx2 <- c(1:100000)
idx3 <- c(100001:150000)
data <- datam[idx2,]
data2 <- datam[idx3,]
df1 <- df1[order(df1$order_item_id),]
data$return <- df1$return
####Modeling
##
#At this point, I have tried many iterations of decision trees that failed to predict return
#a randomly selected test set, so I moved on to random forests
##
#The function below is to test my models using AUC
library(pROC)
print_accuracy <- function (target, prediction, name = "") {
acc_string <- paste(name, "AUC:", auc(target, prediction), sep = " ")
return(acc_string)
}
#Splitting my data into test and training sets
#First, for comparability, I set a seed
set.seed(1234)
#Randomly splitting my data
library(caret)
# idx <- createDataPartition(df1$return, p=0.8, list = FALSE)
# df1_train <- df1[idx,]
# df1_test <- df1[-idx,]
#After lots of trials, it seems like when my data is randomly split
#the model gets better predictions, however, the results are far
#from the result displayed when submitted,
#I think this is because the unknown data set was split sorted by order date
#so I tried splitting my data non randomly in the same way to test my models,
#and I got similar results to the one submitted on kaggle
#this implies that maybe randomization removes noise somehow?
idx1 <- c(1:80000)
idx <- c(80001:100000)
df1_train <- data[idx1,]
df1_test <- data[idx,]
#Because random forest doesn't accept NA values, and I'm not sure
#if a future trial would happen with a model that does accept it,
#NA values were replaced with column mean using a loop function
#this is done on a third data fram set just in case a new model is used
df3_train <- df1_train
df3_test <- df1_test
for(i in 1:ncol(df3_train)){
df3_train[is.na(df3_train[,i]), i] <- mean(df3_train[,i], na.rm = TRUE)
}
for(i in 1:ncol(df3_test)){
df3_test[is.na(df3_test[,i]), i] <- mean(df3_test[,i], na.rm = TRUE)
}
#As ranger is a powerful random forest function
#(and it's fast) that can use factor variables with more than 53 levels
#I chose to use that model
library(ranger)
str(df3_train)
#
# rf <- ranger(return~order_date+user_title+user_state+item_id+brand_id+item_size+delivery_time+user_age_years+item_price+is_duplicated,
# data = df3_train, num.trees = 500, mtry=3, probability = TRUE, importance = "impurity")
# prediction <- predict(rf, df3_test)$predictions[,2]
# print_accuracy(df3_test$return, prediction, name = "Ranger RF # 1")
# importance(rf)
task <- makeClassifTask(data = df3_train, target = "return")
xgb.mlr <- makeLearner("classif.xgboost", predict.type = "prob",
par.vals = list("nrounds" = 100, "verbose" = 0, "max_depth" = 4, "eta" = 0.15,
"gamma" = 0, "colsample_bytree" = 0.8,"min_child_weight" = 1, "subsample" = 0.8))
xgb <- mlr::train(xgb.mlr, task = task)
task2 <- makeClassifTask(data = df3_test, target = "return")
prediction <- predict(xgb, task = task2)
prediction$data$prob.1
print_accuracy(df3_test$return, prediction$data$prob.1, name = "Ranger RF # 1")
# data2 <- mlr::createDummyFeatures(df2[,c(-1,-2, -8, -6,-7, -5,-10)])
task3 <- makeClassifTask(data = data, target = "return")
xgb.mlr <- makeLearner("classif.xgboost", predict.type = "prob",
par.vals = list("nrounds" = 100, "verbose" = 0, "max_depth" = 4, "eta" = 0.15,
"gamma" = 0, "colsample_bytree" = 0.8,"min_child_weight" = 1, "subsample" = 0.8))
xgb <- mlr::train(xgb.mlr, task = task3)
# task4 <- makeClassifTask(data = data2, target = "return")
prediction <- predict(xgb, newdata = data2)
str(prediction)
df2$return <- prediction$data$prob.1
submission <- df2[,c("order_item_id", "return")]
write.csv(submission, "submission6.csv", row.names = FALSE)