Pivot tables with R
For the data analyst, Excel is undoubtedly the tool for doing quick and dirty calculations. However, recently I’ve been running into some ugly situations with analyzing large files in Excel. Trying to open a 1GB file (~1 million rows) on my 32-bit Excel leads to extremely sluggish performance and after Ctrl + Shift + Del-ing out a couple times, I was starting to lose my sanity.
Thankfully, we have R.
This post is to demonstrate how to quickly create pivot tables in R with the
tidyr packages. The content is designed for anyone who wants to learn R and for me to reference later on.
For the examples below, we will use data on a telecom’s customers. The dataset can be found on IBM Watson Analytics Sample Data. The dataset contains:
- A Unique customer ID
- Churn status - Customers who left within the last month
- Services that each customer has signed up for - phone, internet, tv, etc…
- Accounting details - tenure, contract, payment method, invoice amounts, etc…
- Demographic info - gender, age range, and if they have partners/dependents
We start off by loading the data
df <- read.csv(url("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv"))
Thankfully this data is very clean, so we can start analyzing right away. A common question from the telecom business might be about the breakdown of customers by the products/services they have.
To answer this, we pivot by the PhoneService column and count the rows.
library(dplyr) df %>% group_by(PhoneService) %>% summarise(counts = n())
First we have to load the
dplyr package, which gives us functions like
n. If you haven’t installed it already, you can run
install.packages("dplyr"). Once loaded, we can use the functions that allow us to group by and sum up the counts for our dataset.
A natural follow-up question might be about the monthly revenue from phone subscribers/non-subscribers.
df %>% group_by(PhoneService) %>% summarise(counts = n(), monthly.revenue = sum(MonthlyCharges))
We can sum up any quantitative column by our grouping column by using the
sum() function within the
summarise() function as shown above. Notice that I named the new column as
monthly.revenue when calculating the sum of monthly charges; Otherwise it will just show up as
sum(MonthlyCharges). The same concept applies to the
Counts are useful, but a proportions are obviously easier to interpret.
df %>% group_by(PhoneService) %>% summarise(counts = n(), monthly.revenue = sum(MonthlyCharges)) %>% mutate(count.prop = counts/sum(counts), mr.prop = monthly.revenue/sum(monthly.revenue))
mutate allows us to create new calculations from the data that is being passed down by the
%>%. The pipe funnels down data by grouping it, summarising (summing counts and revenue) to our mutate function. The equation for proportion is from division of counts in that row by the column total of counts.
We can group by multiple columns too. Let’s see the breakdown of customers with multiple lines as well. We’ll round the proportions and multiply by 100 for a cleaner look.
df %>% group_by(PhoneService, MultipleLines) %>% summarise(counts = n(), monthly.revenue = sum(MonthlyCharges)) %>% mutate(count.prop = counts/sum(counts), mr.prop = monthly.revenue/sum(monthly.revenue)) %>% mutate(count.prop = round(count.prop*100), mr.prop = round(mr.prop*100))
|No||No phone service||682||28663.5||100||100|
The secondary breakdown shows that out of phone service customers, 47% have multiple lines. We can also use filters to only see relevant information. Filtering for
PhoneServce == "Yes" passes down only rows with that meet the condition to the
df %>% filter(PhoneService == "Yes") %>% group_by(PhoneService, MultipleLines) %>% summarise(counts = n(), monthly.revenue = sum(MonthlyCharges)) %>% mutate(count.prop = counts/sum(counts), mr.prop = monthly.revenue/sum(monthly.revenue))
Pivotting on another variable via the columns is a common exercise in excel. Let’s breakdown the sum of monthly revenue by contract type
library(tidyr) df %>% filter(PhoneService == "Yes") %>% group_by(PhoneService, MultipleLines, Contract) %>% summarise(monthly.revenue = sum(MonthlyCharges)) %>% spread(key = Contract, value = monthly.revenue)
|PhoneService||MultipleLines||Month-to-month||One year||Two year|
The average revenue from each customer may be of interest too. We simply replace the
sum() function with the
mean() to compute monthly revenue averages.
df %>% filter(PhoneService == "Yes") %>% group_by(PhoneService, MultipleLines, Contract) %>% summarise(avg.monthly.revenue = round(mean(MonthlyCharges),1))%>% spread(key = Contract, value = avg.monthly.revenue)
|PhoneService||MultipleLines||Month-to-month||One year||Two year|
With the functions above, pivot tables with data too big for Excel should be a breeze. You can read more on
dplyr functionality in the documentation.
Of course R has more capabilities than just slicing and dicing data.
What else can we do with R?
One is wrangling the data in the desired format. Since our services (Phone, Internet, TV) are encoded in separate dummy columns for each service, we can’t a breakdown for all services.
|Yes||DSL||No||Phone / Internet|
|Yes||Fiber optic||No||Phone / Internet|
|Yes||Fiber optic||Yes||Phone / Internet / TV|
By using the
ifelse functions we can create a new column called
Services which gives us one column for all three subscriptions.
df %>% mutate(Services = ifelse(PhoneService == "Yes", "Phone/ ", "")) %>% mutate(Services = ifelse(InternetService %in% c("DSL", "Fiber optic"), paste(Services, "Internet ", "/"), Services)) %>% mutate(Services = ifelse(StreamingTV == "Yes", paste(Services, "TV", "/"), Services)) %>% group_by(Services, Contract) %>% summarise(avg.monthly.revenue = round(mean(MonthlyCharges), 1)) %>% spread(key = Contract, value = avg.monthly.revenue)
|Services||Month-to-month||One year||Two year|
|Internet / TV /||47.1||52.3||57.4|
|Phone/ Internet /||69.6||71.3||74.7|
|Phone/ Internet / TV /||91.4||93.6||95.7|
mutate function checks the
PhoneService column and fills the
Services column with Phone when appropriate. The following
mutate functions does the same with
InternetService, but pastes on to the
Services which already has phone in it. The last funtion works the same way for
StreamingTV and we end up with one column that breaks down the customer base by all services. As expected, the highest value customers are those with all three services on 2-year contracts at $95.7/month.
At this point, you may be wondering how do I easily export these numbers into Excel or Powerpoint? The standard way of exporting data out of R is using
PS_Breakdown <- df %>% group_by(PhoneService) %>% summarise(counts = n()) write.csv(PS_Breakdown, "PS_Breakdown.csv")
However, if you have excel already open, you can just CTRL + V into a open workbook by copying the table on to your clipboard like this.
write.table(PS_Breakdown, "clipboard", sep = "\t", row.names = FALSE)
Another tip to share when working with large files in R is to save a
Rda copy of the data especially when you are coming back to the file to do more analysis in R because loading csvs can be slow for large (+1GB) files. For example, the 1GB
csv I was working with was no more than 35MB in
Of course the above is just the tip of the iceberg with R’s capabilities. A major strengths of R that I didn’t go into are the statistical modelling and machine learning models that can be readily used.
Too big for R?
Since running R on your computer locally processes data in memory, the amount of RAM on your computer is a limiting factor. I have 8GB on my laptop and loading anything more than 5GB slows things down significantly.