Justin SJ Lee

Pivot tables with R

slowcel

For the data analyst, Excel is undoubtedly the tool for doing quick and dirty calculations. However, recently I’ve been running into some ugly situations with analyzing large files in Excel. Trying to open a 1GB file (~1 million rows) on my 32-bit Excel leads to extremely sluggish performance and after Ctrl + Shift + Del-ing out a couple times, I was starting to lose my sanity.

Thankfully, we have R.

This post is to demonstrate how to quickly create pivot tables in R with the dplyr and tidyr packages. The content is designed for anyone who wants to learn R and for me to reference later on.

For the examples below, we will use data on a telecom’s customers. The dataset can be found on IBM Watson Analytics Sample Data. The dataset contains:

We start off by loading the data

df <- read.csv(url("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv"))

Thankfully this data is very clean, so we can start analyzing right away. A common question from the telecom business might be about the breakdown of customers by the products/services they have.

To answer this, we pivot by the PhoneService column and count the rows.

library(dplyr)

df %>%
  group_by(PhoneService) %>%
  summarise(counts = n()) 
PhoneService counts
No 682
Yes 6361

First we have to load the dplyr package, which gives us functions like group_by, summarise and n. If you haven’t installed it already, you can run install.packages("dplyr"). Once loaded, we can use the functions that allow us to group by and sum up the counts for our dataset.

A natural follow-up question might be about the monthly revenue from phone subscribers/non-subscribers.

df %>%
  group_by(PhoneService) %>%
  summarise(counts = n(), monthly.revenue = sum(MonthlyCharges))   
PhoneService counts monthly.revenue
No 682 28663.5
Yes 6361 427453.1

We can sum up any quantitative column by our grouping column by using the sum() function within the summarise() function as shown above. Notice that I named the new column as monthly.revenue when calculating the sum of monthly charges; Otherwise it will just show up as sum(MonthlyCharges). The same concept applies to the counts column.

Counts are useful, but a proportions are obviously easier to interpret.

df %>%
  group_by(PhoneService) %>%
  summarise(counts = n(),
            monthly.revenue = sum(MonthlyCharges)) %>%
  mutate(count.prop = counts/sum(counts),
         mr.prop = monthly.revenue/sum(monthly.revenue))   
PhoneService counts monthly.revenue count.prop mr.prop
No 682 28663 0.0968337 0.0628425
Yes 6361 427453 0.9031663 0.9371575

mutate allows us to create new calculations from the data that is being passed down by the %>%. The pipe funnels down data by grouping it, summarising (summing counts and revenue) to our mutate function. The equation for proportion is from division of counts in that row by the column total of counts.

We can group by multiple columns too. Let’s see the breakdown of customers with multiple lines as well. We’ll round the proportions and multiply by 100 for a cleaner look.

df %>%
  group_by(PhoneService,
           MultipleLines) %>%
  summarise(counts = n(),
            monthly.revenue = sum(MonthlyCharges)) %>%
  mutate(count.prop = counts/sum(counts),
         mr.prop = monthly.revenue/sum(monthly.revenue)) %>%
  mutate(count.prop = round(count.prop*100),
         mr.prop = round(mr.prop*100))  
PhoneService MultipleLines counts monthly.revenue count.prop mr.prop
No No phone service 682 28663.5 100 100
Yes No 3390 183721.2 53 43
Yes Yes 2971 243731.9 47 57

The secondary breakdown shows that out of phone service customers, 47% have multiple lines. We can also use filters to only see relevant information. Filtering for PhoneServce == "Yes" passes down only rows with that meet the condition to the group_by function.

df %>%
  filter(PhoneService == "Yes") %>%
  group_by(PhoneService,
           MultipleLines) %>%
  summarise(counts = n(),
            monthly.revenue = sum(MonthlyCharges)) %>%
  mutate(count.prop = counts/sum(counts),
         mr.prop = monthly.revenue/sum(monthly.revenue))       
PhoneService MultipleLines counts monthly.revenue count.prop mr.prop
Yes No 3390 183721.2 0.5329351 0.4298045
Yes Yes 2971 243731.9 0.4670649 0.5701955

Pivotting on another variable via the columns is a common exercise in excel. Let’s breakdown the sum of monthly revenue by contract type

library(tidyr)

df %>%
  filter(PhoneService == "Yes") %>%
  group_by(PhoneService, MultipleLines, Contract) %>%
  summarise(monthly.revenue = sum(MonthlyCharges)) %>%
  spread(key = Contract, value = monthly.revenue)
PhoneService MultipleLines Month-to-month One year Two year
Yes No 117968.9 37266.85 28485.55
Yes Yes 125480.6 52018.55 66232.65

The average revenue from each customer may be of interest too. We simply replace the sum() function with the mean() to compute monthly revenue averages.

df %>%
  filter(PhoneService == "Yes") %>%
  group_by(PhoneService, MultipleLines, Contract) %>%
  summarise(avg.monthly.revenue = round(mean(MonthlyCharges),1))%>%
  spread(key = Contract, value = avg.monthly.revenue)
PhoneService MultipleLines Month-to-month One year Two year
Yes No 58.5 52.3 43.1
Yes Yes 84.7 84.4 75.9

With the functions above, pivot tables with data too big for Excel should be a breeze. You can read more on dplyr functionality in the documentation.

Of course R has more capabilities than just slicing and dicing data.

GiS

What else can we do with R?

One is wrangling the data in the desired format. Since our services (Phone, Internet, TV) are encoded in separate dummy columns for each service, we can’t a breakdown for all services.

PhoneService InternetService StreamingTV Services
No DSL No Internet
Yes No No Phone
Yes DSL No Phone / Internet
No DSL No Internet
Yes Fiber optic No Phone / Internet
Yes Fiber optic Yes Phone / Internet / TV

By using the paste and ifelse functions we can create a new column called Services which gives us one column for all three subscriptions.

df %>%
  mutate(Services = ifelse(PhoneService == "Yes",
                           "Phone/ ", "")) %>%
  mutate(Services = ifelse(InternetService %in% c("DSL",
  "Fiber optic"), paste(Services, "Internet ", "/"), Services)) %>%
  mutate(Services = ifelse(StreamingTV == "Yes",
                       paste(Services, "TV", "/"), Services)) %>%
  group_by(Services, Contract) %>%
  summarise(avg.monthly.revenue = round(mean(MonthlyCharges), 1)) %>%
  spread(key = Contract, value = avg.monthly.revenue)
Services Month-to-month One year Two year
Internet / 32.0 38.2 43.0
Internet / TV / 47.1 52.3 57.4
Phone/ 20.4 20.8 21.8
Phone/ Internet / 69.6 71.3 74.7
Phone/ Internet / TV / 91.4 93.6 95.7

The first mutate function checks the PhoneService column and fills the Services column with Phone when appropriate. The following mutate functions does the same with InternetService, but pastes on to the Services which already has phone in it. The last funtion works the same way for StreamingTV and we end up with one column that breaks down the customer base by all services. As expected, the highest value customers are those with all three services on 2-year contracts at $95.7/month.

At this point, you may be wondering how do I easily export these numbers into Excel or Powerpoint? The standard way of exporting data out of R is using write.csv

PS_Breakdown <- df %>%
  group_by(PhoneService) %>%
  summarise(counts = n())
  
write.csv(PS_Breakdown, "PS_Breakdown.csv") 

However, if you have excel already open, you can just CTRL + V into a open workbook by copying the table on to your clipboard like this.

write.table(PS_Breakdown, "clipboard", sep = "\t", row.names = FALSE)

Another tip to share when working with large files in R is to save a Rda copy of the data especially when you are coming back to the file to do more analysis in R because loading csvs can be slow for large (+1GB) files. For example, the 1GB csv I was working with was no more than 35MB in Rda.

saveRDS(PS_Breakdown, "PS_Breakdown.Rda")

Too big for R?

Since running R on your computer locally processes data in memory, the amount of RAM on your computer is a limiting factor. I have 8GB on my laptop and loading anything more than 5GB slows things down significantly.

My solution so far has been to use VMs with more cores and RAM. RStudio Server Pro is available on all the major cloud platforms (GCP, AWS, Azure).