🎉 Bedrock Data is now Formstack Sync! · Learn More 🎉

Jeremy Martin | July 27, 2018

Using R to Analyze Your Customer Data Warehouse

While SQL is one of the most widely used programming languages for querying datasets from PostgresSQL to SQL Server, R is quickly rising the top for analysts and data scientists looking to enrich their statistical repertoires. Once the language of tweedy academics, pollster pundits and scholars, the rise of data-driven marketing and sales operations has swiftly forayed R into the startup and enterprise worlds. Owing to its massive suite of libraries and visualizations, boasting over 10,000 packages on CRAN alone, R is just as powerful, if not more, as any leading BI tool when it comes to exploring correlations and causal relationships in your customer datasets. As an IDE, RStudio's GUI is also perfect for querying datasets in a syntax not dissimilar to SQL and Python.

And with a unified dataset like Fusion, in R you can analyze:

  • How to send the right message to the right customers at the right time
  • Breakdown active and non-active users, demographics, and geography
  • Train models, cluster, and extrapolate predictive patterns with machine learning algorithms like “KNN” (k-nearest neighbor) through packages like caret.

rstudio-workbench

In this article, we’ll go over how to connect your Fusion MySQL database to R/RStudio. Then we'll give you a few pointers on what kinds of commands to run on your MySQL database to get started with data splitting, model tuning, regressions, classification, and more. Note that this post assumes you've:

If you haven’t, go ahead and do so. Now let's get started!

 

Connecting R to MySQL

Fusion is a relational SQL data warehouse hosted on AWS on Amazon RDS. Thanks to its MySQL architecture, Fusion provides total relational database support for structured columnar stores and comprehensive support for the SQL language. But because Fusion is an acceptable data source for any service that accepts MySQL, it also works great for data scientists who love working with R.

To connect any MySQL database to R, you'll first want to install the RMySQL package with the install.packages() function and load the RMySQL library using the library() function, as shown:

 

These two commands allow R to speak to your relational database management system (R/DBMS) through the DBI dependency.

 

Connecting Fusion to R

Now that you have the RMySQL library installed, the next step is to connect to your Fusion MySQL database. If you’ve yet to create a Fusion account, you can do so here and add any connectors by watching this video.

 

If you have a Fusion account already, just login to Fusion and copy or download your credentials, which should look something like this:

 

Then, as you would in any BI tool, you would enter your credentials, only using the dbConnect() function. For example: 

> mydb = dbConnect(MySQL(), user='e4bab23d80487eb8', password='x', dbname='e4bab23d80487eb8', host='prod-freedom-mysql.csxkpi7oalo6.us-east-1.rds.amazonaws.com') 

To check that you're connected, I like to list the tables in my database with dbListTables(), like so: 

> dbListTables(mydb) 

Which should return a list of all the tables — fused, raw, and links — in your Fusion MySQL database:

  

In my case, this is 500+ tables, so I know everything is connected. And likewise, if this is your first time navigating Fusion, you'll likely be interested to see the fields for a specific table, which I've done using dbListFields(mydb, 'fused_contact'):

 

Now that we know how to connect Fusion to R and explore our tables and fields, the first order of business for a marketing or sales analyst is to use their data to conjure a good communication strategy that will increase Long Term Value (LTV) in specific customer segments.

 

Customer Segmentation in R with a Unified Dataset

Customer segmentation is the act of grouping customers by characteristics so you may better serve their needs.

One way to segment customers in R by using your Fusion MySQL database is by looking at a set of customers':

  • Time since last purchase
  • Purchase density
  • Product preference
  • Lifetime monetary value

So if, for example, I may want to measure customers whose last purchase was 365 days ago or less before I analyze their purchasing behavior and partitioning customers into the types, number, and monetary value of products they purchased within this time frame.

Let's have a look at each in detail.

 

Time Since Last Purchase

How long has it been since a specific cohort of customers last bought a product, placed an order, or subscribed to services within a given time frame?

The fused_order table is a great source of data for purchase information, as it gives you access to fields like order_id, currency, amount, and occurred_at.  

fused_order

The fused_activity table will also key me into fields like type and "deal_id", which may help me segment customers by what they purchased.

Likewise, the fused_opportunity is going to give me insight into statuses like "is_won", "is_closed", "stage", "last_activity_date" , and  "product_family".    

The power of a unified dataset is that you can now segment customers in categories such as New, Active, Lapsed, or Lost based on contact, activity, and opportunity data — from multiple data sources, but only one dataset.

 

Purchase density using K-means:

How many orders have they placed within this given period of time? Are they a repeat customer?

The flexclust function's framework for k-centroids, a class of algorithms used for partitioning cluster analysis, supports distance measures and K-points/centroid computation. To visualize your cluster results, you may use neighborhood graphs, convex cluster hulls, barcharts of centroids, or any bootstrap methods for the analysis of cluster stability and cluster by K-Means, K-Medians, or Neural Gas.

While there are thousands packages in the R library, the tidyverse, cluster and factoextra packages are useful for data manipulation, clustering algorithms, and visualization.

K-means

The tidyverse package works in harmony because they share common data representations and API designs; cluster contains methods for cluster analysis; factoextra lets you extract and visualize the results of multivariate data analyses, including 'PCA' (Principal Component Analysis), 'CA' (Correspondence Analysis), 'MCA' (Multiple Correspondence Analysis), 'FAMD' (Factor Analysis of Mixed Data), 'MFA' (Multiple Factor Analysis) and 'HMFA' (Hierarchical Multiple Factor Analysis) functions from different R packages. It contains also functions for simplifying some clustering analysis steps and provides some truly remarkable ggplot2 visualizations.

ggplot_masterlist_49

Product Preference & Machine Learning

Cluster analysis can also identify a competitive sets of products, or groups of assets whose prices co-move, or for geo-demographic segmentation like the percentages of voters for a certain candidate in any given precinct, and their relation to surrounding precincts.

Cluster Analysis

For customer segmentation, you can analyze which products certain customers did or did not buy in a given time frame, either based on SKU, product_id, or any other such measure.

To find a relationship between these segments, you could use cor.test() function, which computes the correlation between two continuous variables — say, customer type and product_id — in order test if y is dependent on x, or if the null hypothesis is null.

Because the k-means clustering algorithm works by finding like groups based on a measure of distance or similarity, you could select k groups such as customer segments to cluster and let the algorithm find the best centroids for the k groups to determine which factors group members relate. For customers, these would be their buying preferences. 

This machine learning algorithm for partitioning your MySQL dataset classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). Because Fusion already sorts your data into Fused Objects — Activity, Opportunity, Order, Contact, Company, Ticket, and so forth — it's a great candidate for measuring product preference in tandem with another independent variable.

Product tech graphics-09

Lifetime Monetary Value & Repurchase Rate

Customer Lifetime Value (CLV) is the present value of the future cash flows attributed to the customer during his/her entire relationship with the company. In other words, you would be seeking to compare the total dollar amount of all orders during a given time period, such as how long they've been a customer with you, and predicting future purchases predicated on past behaviors.

Probability Purchase

Exploring the relationships between Repurchase Rate and Recency, Frequency, and Monetary Value, you'll want to calculate the number of customers grouped by Recency values, then sub-categorize them into Buy / No Buy buckets according to purchasing cycle time to get the percentage of customers who repurchased with a certain Recency value. 

The ddply() function will help you split your data frame into these segments and return results. Just be sure to specify a time frame, with a start and end date, to bookend your Recency values.

Screen Shot 2018-07-27 at 3.20.11 PM-1

Using R with your fused dataset is a great way to segment customers into profitable industries, verticals, events, and more. The good thing is Fusion is free and easy to try. We invite you to start your free trial today.

 

 

Subscribe Here!