IMB 621
Kiran R, Doctoral Student, Indian Institute of Management Lucknow, Arunabha Mukhopadhyay, Associate Professor, Indian Institute of
Management Lucknow, and U. Dinesh Kumar, Professor of DSIS, Indian Institute of Management Bangalore prepared this case for class
discussion. This case is not intended to serve as an endorsement, source of primary data, or to show effective or inefficient handling of decision
or business processes.
Copyright © 2017 by the Indian Institute of Management Bangalore. No part of the publication may be reproduced or transmitted in any form or
by any means – electronic, mechanical, photocopying, recording, or otherwise (including internet) – without the permission of Indian Institute of
Management Bangalore.
MACHINE LEARNING ALGORITHMS TO DRIVE CRM
IN THE ONLINE E-COMMERCE SITE AT VMWARE
KIRAN R, ARUNABHA MUKHOPADHYAY AND U DINESH KUMAR
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 2 of 16
On February 25, 2016, in the VMWare (VMW) office in Silicon Valley, next to Stanford University, in
the sprawling 100+ acre green campus in Palo Alto, California, winter had just ended and it was warm
weather, as great as it possibly could be in February. In his office cabin in building Hilltop E, Michael
Butler, the global head of the store business of VMW was in discussion with Parag Girish Chitalia, the
global leader for advanced analytics and data sciences. Michael and Parag were discussing how to drive
more revenues from Workstation business in the VMW store. The VMW store was the online portal of
VMW (store.vmware.com), where end-customers could purchase certain products of VMW such as
Fusion and Workstation online. The store was similar to any e-commerce site with a home page, category
pages, and product detail pages, add to cart pages, checkout page and a confirmation of order page.
Fusion helped end-customers and businesses run Windows on top of Mac machines, whereas Workstation
helped customers run Mac on top of Windows machines. Since many customers would like to have both
Windows and Mac operating systems on their computers, VMW store received many visitors to its
website. Data on customer’s usage of VMW store is collected to understand consumer behavior. With
rich behavioral data of the VMW website, Michael Butler was keen to see how the data sciences and
analytics team could be leveraged to drive further Workstation sales as it was a key product in the
competitive business environment.
ABOUT VMWare
VMware (VMW) has been a Palo Alto headquartered software company that reported USD 6.57 billion in
2015, up 9% from 2014. VMW has been one of the most profitable software companies in history with
GAAP net income of approximately USD 1 billion in 2015. Cash flows were healthy as well with free
cash of USD 1.56 billion generated in 2015 (Exhibit 1). Founded in 1998 by Stanford Professors Diane
Greene and Mendel Rosenblum, the company was headed by Pat Gelsinger in 2016 and had more than
18,000 employees worldwide. VMW has been the industry leader in virtualization business with more
than 80% market share. Virtualization is about using software to virtualize hardware – for example, the
same central processing unit (CPU) can be shared by multiple users using the VMW software.
Virtualization brings about great savings in costs to IT departments of companies and VMW has been the
industry leader by a distance in this space with market share several times that of its nearest competitors.
VMW garnered its revenues from three streams namely software defined data center (vSphere – for
computing virtualization, NSX – for software defined networking & security, VSAN – for storage
virtualization), end-user computing (Airwatch – for mobile computing, Horizon – enterprise desktop,
Fusion, Workstation), and cloud (Private cloud vCloud Air).
Michael Butler was in charge of the store (Exhibit 13) business powered by Fusion and Workstation
products. Parag had joined VMW in 2014 to set up the advanced analytics and data sciences team called
Analytics Community of Excellence under the Information Innovation Center/Enterprise Information
Management organization. The team comprised data scientists and analysts hired from premier institutes
in India such as the Indian Institute of Technology (IITs) and Indian Institute of Management (IIMs) and
from around the world such as Georgia Tech and Stanford. Ravi Kondapalli was the lead data scientist in
the data sciences innovations team powering Parag’s team. Ravi, a NIT Warangal grad with double
Masters from Georgia Tech and IIM Bangalore had more than 15 years of experience in the industry.
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 3 of 16
Driving Higher Workstation Revenues from the Store
The primary objective of the meeting between Parag and Michael was to discuss how Parag’s newly
formed data sciences group could assist in increasing store revenues with focus on key products starting
with Workstation. Michael started the meeting by saying:
Workstation forms the bulk of the purchases for our online store/e-commerce business for
which we have both individual consumers and businesses as our customers. Growing
revenues this year will be a challenge as there is no new version of Workstation planned.
In a software business, renewals via upgrade to a latest software version form a major
portion of the revenue and this year will be a challenge. I would like to understand how
we can leverage data sciences and advanced analytics to target new workstation
customers, up-sell to existing customers, cross-sell to customers that do not have
Workstation.
Parag shared some macro-level data on Workstation sales that Ravi, his lead data scientist in the data
sciences innovations team, had compiled. Workstation revenues had doubled in the last 8 years
(Exhibit 2) and formed a significant portion of the Overall Store Bookings (Exhibit 3). Different
versions of Workstation had been launched over the years. Workstation 6 was launched in 2007 and the
latest versions of the Workstation product were Workstation 12 and Workstation 12 Player. Significant
portion of VMW Workstation customers upgraded to higher versions of Workstation. In Exhibit 4, each
cell xij in the table denotes the number of customers that upgraded from Workstation version in the row i
to the Workstation version in the column j. There was an opportunity in the sense that a large number of
the customer base had not yet upgraded to the latest versions of Workstation. The store was visited by
approximately 7 million visitors annually of whom approximately 2 million viewed some page related to
Workstation products. However, only around 1.6 million visitors out of the 7 million were identifiable
with an e-mail id (Exhibit 5). The visitor data contained rich clickstream/digital data that was housed in a
Hadoop big data environment that the analytics team leveraged continually for their analysis. Apart from
this visitor behavior, all previous purchases (if any) by the e-mail ids were stored in the Greenplum data
warehouse. Greenplum is a massively parallel database and owned by Pivotal that has proven to be better
than Teradata, Oracle, and other data warehouses. Online–offline integration for the de-anonymized
visitors was possible with “e-mail id” as the common inter-linking key.
Parag had driven the following key points to lay the ground for a discussion on analytics engagement.
Workstation was going to be an important driver of the overall store revenues.
There was untapped opportunity in the form of the old Workstation customers that had not yet
upgraded to the latest version of Workstation, presenting an opportunity for up-sell.
There were a large number of visitors to the online store that included those that had bought other
store products presenting an opportunity for cross-sell.
The data sciences and analytics team had access to rich sets of information about the customers
and also the potential customers including their digital footprint (online) and their purchase
history (offline).
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 4 of 16
Being a sales leader, Michael liked the key points. He got straight to the point:
We definitely have a great Customer Relationship Management (CRM) opportunity here
in the form of up-sell, cross-sell and targeting. These present multiple challenges that I
and my management team will go into in detail. For example: While I can drive
incremental sales with coupons, I would want to give the coupons only to those customers
that are most likely to buy and not indiscriminately to all.
Can your team provide me with the list of the email ids most likely to purchase our latest
products Workstation 12 or Workstation 12 Player in the next 3 months so that my team
can target these email ids?
Parag immediately proposed a propensity model as a quick win. A propensity model rank ordered e-mail
ids or customers in their decreasing order of likelihood to purchase. His advanced analytics team powered
by the data sciences innovations team had delivered great results in the past by the usage of these models.
This propensity model could leverage the online and offline data for the e-mail ids and rank order them
using machine learning techniques.
Michael said:
That’s awesome Parag! I only believe things that cause an increase in my sales! If you
can create such a list, I will be happy to execute via one of the digital marketing channels
(email with coupon, re-targeting on other websites, social targeting) or by targeting on
our website.
I will believe your list only when the cash machine rings up Workstation sales and when I
can measure the upside scientifically.
Michael was a technology geek and would only believe things once they were scientifically proven. Parag
said he would get back to Michael with a propensity scored list within a couple of weeks. The presence of
Ravi in the team gave Parag the confidence to suggest two weeks.
PROPENSITY MODEL DEVELOPMENT
Through e-mail, Parag briefed Ravi, the lead data scientist to be ready with what it would take to build a
propensity model and also to brainstorm on what should be the overall data sciences plan that was to be
presented to Michael. They had a detailed telephonic conversation the next day.
Ravi set the baseline for the discussion:
This is an example of a binary classification problem, where the visitor either buys or
does not buy Workstation. The target variable will be if a visitor who visited the site buys
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 5 of 16
Workstation in the next few months. The value of that target can be either 0 or 1, making
it a classical binary classification problem.
Ravi went through a deck which highlighted the following challenges.
What should be the entity on which we should build a propensity model? As Exhibit 5 shows,
only about 1.6 million out of about 7 million visitors had an e-mail id.
We should decide on the sampling strategy, Should we use random sampling, time-based
sampling or stratified random sampling?
What data sciences and machine learning techniques should we try out in this instance?
What cross-validation or training-validation technique should we use in order to have an estimate
of how the model would perform in the real world?
Ravi’s recommendations to Parag were as follows:
Given a quick win, we should model on e-mail id level for the first cut. Longer term, we have to
think of analytical approaches to target those without an e-mail id.
There is only one right way to perform cross-validation. In this instance, we should do time-based
cross-validation. In this method, we simulate the real world by aggregating data to a period and
then predicting for the next period.
o For example: Say we need to predict who will buy during April–June 2016. In this instance:
For training, we could aggregate data up to September 2015 and predict the Workstation
buyers during October–December 2015.
For validation, we could aggregate data up to December 2015 and compare the
predictions against actual Workstation buyers during January–March 2016.
For scoring, we could aggregate the data up to March 2016.
We could try any 2-class classifier such as Naïve Bayes, Logistic Regression, Decision Tree, or
machine learning algorithms such as Random Forest, Gradient Boosting, etc. We could compare
the lift curves of different models to see which one would work best.
We could use the lift numbers on the validation set to obtain an estimate of the real world.
Ravi further explained the time-based cross-validation using the following conceptual diagram.
Ravi said he could build the model in a couple of weeks.
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 6 of 16
DATA DESCRIPTION
In order to build a detailed propensity model, Ravi collected data from 2008 to 2016. A stratified sample
of 100,000 de-anonymized customers was used (provided in a separate spreadsheet). He aggregated data
at an e-mail id level to come up with a set of features across online and offline (Exhibit 6), which could
be used for model building. Sample training data is shown in Exhibit 7 with the variable names in
Exhibit 8.
DATA ANALYSIS
To understand which features were important, Ravi’s team examined odds ratios of the target variable
against each of the features. Odds ratio is explained in Exhibit 9. The key findings are shown in
Exhibit 10. Odds ratio greater than 1 indicates that the feature is favorable towards purchase and odds
ratio less than 1 indicates the opposite. A higher odds ratio would indicate a higher degree of favorability.
OBJECTIVES
The final objective was to leverage data sciences and analytics for targeting, up-sell and cross-sell to
customers in the online store, thereby increasing customer value. The immediate need was a propensity to
buy a model that could result in the set of top customers that Michael and team should target.
At this point, Ravi had the following questions in mind.
What feature selection techniques could he use?
If he were to use the standard techniques – logistic regression or decision tree and any one
advanced technique (random forest or neural network or support vector machine or gradient
boosting…), how would the lift curves appear?
Based on the lift curve, how should he communicate the potential opportunity from the model to
Michael?
Could there be incremental lift or other approaches that he could adopt – for example, clustering
before classification?
Having built several propensity models at VMW, Ravi knew that sales teams liked Whitebox models.
Whitebox models are models whose workings can be explained to the sales teams. For example:
Customer X is more likely to upgrade if the support for the older version is coming to an end OR if a
compelling newer version is being launched. Sales leaders are not comfortable with just getting a list that
works. They also want to know why the list worked. The question on Ravi’s mind was also how best to
explain the characteristics of a Workstation buyer to the business.
At the same point, Parag had a further list of questions to discuss with Ravi once the model was fully
built.
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 7 of 16
How should Parag and Ravi arrive at the number of e-mail ids that Michael should send?
o Remember the e-mails were to be sent with a coupon. Sending too many could impact the
margins.
o Should this list be different for different marketing channels?
How do we interpret the results for business decision making?
While lift is an analytics or internal validation measure, what marketing intervention should he
suggest to Michael so that there can be a scientific measurement of the return on investment to
the store business from the exercise?
o Can we conduct some form of Control–Test experiment to quantify the upside? If yes, how
should the experiment be set up?
Parag was also thinking about how he should set up an executive deck to summarize the results and
measurement plan to Michael. At the same time, he was wondering about the overall value proposition
that he could drive for the VMW store using analytics and data sciences.
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 8 of 16
Exhibit 1
VMW Financials
Source: VMW Publicly available annual report: http://d1lge852tjjqow.cloudfront.net/CIK-0001124610/67b316e9-d82e-4848-ade6-
e046775865be.pdf
VMW Q4 2015 Earnings Call: http://s2.q4cdn.com/112802898/files/doc_financials/2015/q4/Q4-15_earnings_w_tables_final.pdf
Exhibit 2
Workstation Revenues over the Years1
Source: Bookings Data (masked)
1 All numbers are directional and for illustration purposes only. The data shared is masked and only illustrative of real data. These have been done
to maintain confidentiality.
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 9 of 16
Exhibit 3
Workstation as a Proportion of Store Bookings* (masked)
Source: Bookings Data (masked)
Exhibit 4
Cross-Sell Behavior of Workstation (masked)
Source: Bookings Data (masked)
Workstation 6 Workstation 7 Workstation 8 Workstation 9 Workstation 10 Workstation 11 Workstation 12 Workstation 12 Player
Workstation 6 97593 10842 6604 5213 4420 2602 2179 109
Workstation 7 97431 24005 19858 15939 9376 8050 293
Workstation 8 67588 24326 21903 12319 10408 311
Workstation 9 65935 23648 15326 11683 373
Workstation 10 68998 18294 16665 508
Workstation 11 45851 13535 485
Workstation 12 41650 623
Workstation 12 Player 5139
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 10 of 16
Exhibit 5
De-anonymized/Anonymized Store Visitor Funnel (masked)
Source: Bookings Data (masked)
17 MM (# of Visitors to VMW Store)
5 MM (# of Visitors to Workstation in Store)
~4MM (store.vmware.com visitors with email id)
~1.7MM(Unique emails of Personal Desktop Buyers)
~500K unique
emails of
Workstation
buyers
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 11 of 16
Exhibit 6
List of Feature Buckets
Source: Bookings Data (masked)
Exhibit 7
Training Data
Data with 1,00,000 rows can be downloaded from the following link:
http://hrm.iimb.ernet.in/iimb/download/IMB_621.htm
Source: Bookings Data (masked)
Metrics for the
Dimension
Inputs
Workstation, Fusion, vSphere, vCenter, vSOM, Horizon,
vRealize
Activation, Download, Registration, Page Views, Cart Add/Remove/View,
Checkout, Purchase, Form Success, Form Abandon, Buy Now etc.
Internal, Paid Search, Email, Social Network, Search Engines etc.
Google, Bing, Yahoo, MSN, YOL etc.
OS like Android, iOS, Linux, Mobile iOS, OS X, Windows OS,
Windows Mobile and Browser like Apple, Blackberry, Google, Dolphin, Microsoft,
AOL etc.
Dig
ital D
ata
Digital & Non-Digital Feature Engineering (Offline + Online)
Store
Products
Event Wise
Search
Engine Wise
OS/Browser
Wise
Referrer Type
Marketing
Channel
DemandBase Data, IDM Data
De-
anonymizatio
n Features
Paid/Organic Vehicle Data
Non D
igital
Data
Revenue
History
Responses/Camp
aign Features
Marketing
Channel ShareOther Products
Bought
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 12 of 16
Exhibit 8
Sample Feature Names
Variable Meaning
Train_period_workstation_purchase_f
lag
Outcome variable (Whether the customer purchased
workstation (coded as 1) or not (coded as 0))
fswk_booking_pct Share of Fusion and Workstation bookings
total_bookings_amount Total bookings from this customer
personal_desktop_booking_pct Share of Personal Desktop Bookings
tot_windows_visits Total no. of visits to vmware.com webpage from Windows OS
days_since_first_personal_desktop_p
urchase_date
Length of Relationship with VMW w.r.t Personal Desktop
products
ftr_growth_personal_desktop_13_14 Growth in 'Personal Desktop' product bookings from 2013 to
2014
num_orders Total no. of lifetime orders this customer placed with VMW
num_order_lines Total no. of lifetime order lines this customer placed with VMW
ftr_growth_personal_desktop_14_15 Growth in 'Personal Desktop' product bookings from 2014 to
2015
idm_total_no_of_day_visits_to Total no. of visits to MyVMware Portal (required for customers
to interact with VMWare support)
ftr_growth_personal_desktop_12_13 Growth in 'Personal Desktop' product bookings from 2012 to
2013
tot_osx_visits Total no. of visits to vmware.com webpage from OSX OS
tot_apple_browser_visits Total no. of visits to vmware.com webpage from Apple Safari
Browser
idm_no_of_day_visits_to_home_page Total no. of visits to MyVMware Portal Home page
tot_microsoft_browser_visits Total no. of visits to vmware.com webpage from Microsoft
Internet Explorer Browser
tot_store_page_views Total no. of views to VMW Store Page
idm_no_of_day_visits_to_download_
page
Total no. of visits to MyVMware Portal Download Page
tot_page_views Total vmware.com page views
tot_first_touch_direct_views Total no. of page views by marketing channel
idm_no_of_day_visits_to_info_page Total no. of visits to MyVMware Portal Info Page
idm_no_of_day_visits_to_license_pag
e
Total no. of visits to MyVMware Portal License Page
tot_first_touch_natural_search_views Total no. of page views by marketing channel
gu_num_of_employees Total no. of employees in the customer company as per DNB
data
tot_google_browser_visits Total no. of visits to vmware.com webpage from Google
Chrome Browser
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 13 of 16
idm_no_of_day_visits_to_eval_page Total no. of visits to MyVMware Portal Eval Page
tot_visits Total vmware.com page visits
purchase_events Total vmware.com purchase events
tot_mozilla_browser_visits Total no. of visits to vmware.com webpage from Mozilla
Firefox Browser
tot_last_touch_direct_views Total no. of page views by marketing channel
tot_first_touch_internal_views Total no. of page views by marketing channel
tot_page_views_l90d Total vmware.com page views in last 90 days
ftr_growth_vsom_14_15 Growth in 'vSOM' Bookings from 2014 to 2015
tot_last_touch_natural_search_views Total no. of page views by marketing channel
num_any_campaign_responses No. of responses from this customer for all VMW campaigns
tot_last_touch_internal_views Total no. of page views by marketing channel
tot_visits_l90d Total vmware.com visits in last 90 days
ftr_growth_enterprise_desktop_13_14 Growth in 'Enterprise Desktop' product bookings from 2013 to
2014
Source: Data Analysis
Exhibit 9
Odds Ratio Explanation
Target = 0
Target = 1
Feature = 0
Feature = 1
Odds for feature = 1 is defined as d/c
Odds for feature = 0 is defined as b/a
Odds ratio = (d/c)/(b/a) = da/bc
Source: Data Analysis
A b
C d
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 14 of 16
Exhibit 10
Sample Averages of Features versus Target Variable
Source: Data Analysis
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 15 of 16
Exhibit 11
Purchasers of Workstation as of End of Each Quarter from 2013 (data masked)
Quarter No. of Workstation
Buyers
13Q1 2784
13Q2 2300
13Q3 3020
13Q4 4198
14Q1 2480
14Q2 2530
14Q3 1878
14Q4 3808
15Q1 2988
15Q2 2582
15Q3 3370
15Q4 4164
16Q1 2726
16Q2 2264
16Q3 2340
16Q4 1194
Source: Bookings Data (masked)
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.
Machine Learning Algorithms to Drive CRM in the Online E-Commerce Site at VMWare
Page 16 of 16
Exhibit 12
Sample Purchase Paths on E-commerce
Home Page → Product Detail Page → Cart → Checkout → Purchase
Source: VMWare
Exhibit 13
About the VMW Store
The store sells many products of which Fusion and Workstation are key to helping run Windows
on Mac and Mac on Windows, respectively. It is an e-commerce site in the truest sense and is
frequented for purchases both by consumers and businesses. The link to the store is provided
here: http://store.vmware.com/store/vmware/en_US/home
The store is a collection of pages. A sample purchase path for a user is indicated in Exhibit 12.
This is by no means the only path and there could be several paths but is shown to indicate how
the visitors purchase on the site.
Source: VMWare
For the exclusive use of M. Abouzahra, 2019.
This document is authorized for use only by Mohamed Abouzahra in 2019.