Do promotional offers can buy you a coffee?

We always receive a lot of promotional offer, that we may think have no impact on our purchasing decision. That’s true?
Is Data Science able to optimize the strategy in order to sell more frappuccino?


We are going to analyze a simulated Starbucks data set, were promotional offers are sent to users, and then transaction and offers completion are saved. Offer can be of different types: simply informational, or discount, or buy one get one free, ...

Some users might not receive any offer during certain weeks, and not all users receive the same offer, and that is the challenge to solve with this data set.

The objective of the analysis is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

In this article, we will start by cleaning and preparing the data set, and performing some statistical and graphical analysis to understand the subject.

After that, we will try to find a strategy for selecting the best offer for a specific user. The problem will be approached with machine learning, applying some models for classifying a specific user&offer as working or not. The idea is that every day, we can decide if sending an offer to a user or not by running the model and following the result of the prediction.

Considering how the problem has been approached, the best and easy to understand performance metric can be the percentage of correct prediction of the model.

For having a look at data and my analysis in detail, feel free to click here!

Part I — Data cleaning

Looking at the data (for more detail, lclick here), we’ve got:

portfolio.json with specification of 10 offers, like the duration of the offer, or the reward, …

profile.json with 17k users with the following distribution for age and gender. We can see that some processing is needed: in fact there is a group of people which is 118 years old, that has been cleaned with the average of the distribution.

transcript.json contains:

  • the offers sent to users at specific times
  • the offers seen by users
  • the offers completed by users
  • the transaction made by users

A big part of the cleaning has been to process this information, for example to avoid to consider successful the completion of an offer, when the user hasn’t seen it! We don’t want to send offer to user that will completed them anyway!

For going on with selected strategy, “promotional offer” are not suited, because can’t be classified, and so are removed from the data set.

The result of the data cleaning has been:

  • a dataframe, with for every offer and user available (with related data), the indication if the offer has been viewed, completed, how may times completed after viewing it, ad to summarize, if the offer is working and the amount of money brought by the offer.
  • Some variation of the first dataframe, with indication if it’s working or if the amount of money brought for all offers as columns and users as rows.

A lot of effort has been spent for getting the data as described, in order to perform the following steps having a trusted and reliable parameter to be used.

Cleaning the data set is fundamental in this project

Part II — Statistical analysis

There are some differences on the population, that can be used to optimize the strategies of offers?

It seems so, because by looking at the variation of density for age, member days and income, the two group of people (yes: people who successfully complete offers; no: people who don’t complete offer) present a different behavior.

For example users are more interested in offers as soon as they register their profile, but then there is a decrease.

Some differences can be seen instead by looking at the same graph generated only for a particular type of offers (BOGO: buy one get one)

Part III — Forecast

Can we build a model that can help us in selecting the best offer?

I tried to solve this problem with machine learning, classifying a specific user related to a specific offer as “working” or “not working” (boolean classification). The idea is that we can use the model to decide if a specific offer can be successful or not if presented to a specific user.

The input of the model have been users information combined with the offers information; running the model with those both input (a specific user & a specific offer) will return a prediction of True or False, that answer the question if the offer will work for that user.

The performance are evaluated as the percentage of correct classification of the model. The train has been performed on 70% of the data set, leaving the remaining for testing, in order to avoid over-fitting, and havemore reliable results.

The first attempt, return a performance of 67.5% with a Random Forest.

Part IV — Improve performances

How can we do better?

First we can try to improve the performances with other models, like Ada Boost or Gradient Boosting classifier.

The best result was given by a Gradient Boosting classifier, that returned the correct result in 70.4% of cases.

Then I tried to tune models parameters with grid search:

  • Random Forest performances increased to 70.5%
  • Gradient Boosting instead was already nearly its maximum, because the final result has been 70.6%

The improvement of Random Forest is interesting, gaining a 3% of performance; instead for the other model, there is no improvement.

Best result: over 70% of correct user&offer classified as working or not!


This project has been interesting, because I’ve seen that there are a lot of possibility by analyzing the Starbucks data set.

I decided to approach it in steps:

  • cleaning and preparing the data set
  • some statistical and graphical analysis
  • classification of offer&user as working or not

The obtained performance of more than 70% of correct classification is a starting point.

I would spend more time in cleaning and preparing the data, which I found fundamental for having reliable results, because the problem is complex and there are a lot of information in the data set that need pre-processing.
Also other approach can be tested, trying to focus more on the transaction data, and also on “promotional offers”: a strategy with higher performances could be a combination of different models.

Cleaning the data set and combination of strategy…

In fact I tried also some regression on the amount of money brought by offers, but low performances (probably due to the few data related to people) stopped me.

A recommendation engine can also be a good approach: trying to select the best offer, maybe followed with a step of confirmation of the choice to be done with the described classification model, can led to good results.

To complete the development of the strategy, a good A/B test would be required, confirming the performances obtained with back-testing also in the real word.

Have you worked on similar subject? What’s your approach?

I am a biomedical engineer, I like technology and software, from data science to web developmentc, but also comics and boardgames, walking, swimming, friends!