How to Do Exploratory Data Analysis (EDA) in R (With Examples)

Exploratory data analysis (EDA) involves utilizing a variety of statistical and visualization techniques to better understand the underlying patterns and trends within a dataset. Performing thorough EDA is a crucial first step which provides the foundation for any additional data analysis or modeling tasks.

Content Navigation show

In this comprehensive guide, you will learn how to effectively carry out exploratory data analysis in R by leveraging various packages and functions to summarize, visualize, and analyze both small and large datasets.

Prerequisites

To follow along with the examples demonstrated in this guide, you will need:

R and RStudio installed on your system
Basic knowledge of R syntax for working with data frames
Comfort loading external R packages

While not essential, some previous experience in R will be helpful in more easily understanding the concepts covered.

Packages for EDA

R has many excellent open source packages purpose-built for exploratory data analysis tasks. We will utilize some of the most popular and useful ones:

tidyverse – Includes core data manipulation (dplyr) and plotting (ggplot2) packages
skimr – Concise statistical data summaries
forecast – Time series analysis and forecasting
tsibble – Temporal data structures for tidy time series analysis
lubridate – Working with dates and date-times

Let‘s load these packages to start:

library(tidyverse)
library(skimr) 
library(forecast)
library(tsibble)
library(lubridate)

Descriptive Statistics and Summaries

We always like to start EDA by getting a broad overview of the data we are working with. There are a few helpful base R functions that provide this:

str() – structure of the dataset
summary() – summary metrics like mean and percentiles
glimpse() – concise summary including column names and data types

The skimr package offers an improved summary displaying key statistics in a neatly formatted data frame:

skim(data)

This quickly allows us to check:

Total rows and columns
Data types of each column
Any missing values
Summary metrics like mean, SD, quantiles

With this initial glimpse, we can also create tables and basic visualizations that further summarize features of the data.

Histograms, density plots, and boxplots are common numerical column summaries. For categorical data, bar charts displaying the frequency distribution can be insightful.

Data Visualization

Visualizing data through plots is an integral piece of EDA that enables us to more easily identify patterns and trends. R offers many options through the ggplot2 package to create both basic and customizable publication-quality graphics.

Useful plots for initial investigations include:

Scatterplots – Assess relationships between two continuous variables. Can identify positive, negative, or more complex correlations in the data.

Histograms/Density Plots – Visualize distribution of a single continuous variable. Assess center, spread, skewness.

Boxplots – Great for comparing distribution differences across groups and spotting outliers.

Time Series Plots – Line plots to assess trends over time. Check for seasonality.

Correlation Plot – Heatmap summary of correlation between all numeric variables. Quickly identify strongly related variables in dataset.

Faceting – Subset plot into panels based on a categorical variable. Helps find group differences.

In addition to tailored plots for different data types, having an iterative, exploratory process where we customize the aesthetics, change axes scales, highlight points, zoom in on areas, and filter to subsets allows new insights to emerge.

Time Series Analysis

For datasets involving observations over time, we can extend our EDA process with time series specific analysis techniques.

Useful time oriented plots include:

Line charts with trends and smoothing
Time based faceting into periods
Overlaying meaningful events

We may want to check if the time series is stationary – having constant statistical properties like mean and variance over time. We can formally test for stationarity using:

ndiffs(data$var1, test = "adf")

Autocorrelation (ACF) and Partial Autocorrelation (PACF) plots help uncover daily, weekly, monthly or other cyclical patterns:

Acf(data$var1, lag.max = 100) 
Pacf(data$var1)

Grouping data into time windows with functions like tsibble::index_by() or lubridate::floor_date() can facilitate lag analysis.

Advanced Analysis Methods

In addition to traditional descriptive techniques, we may incorporate some more advanced, flexible data mining procedures for deeper EDA:

Dimensionality Reduction – Methods like principal component analysis (PCA) and t-SNE analyze multivariate data and output simplified 2D projections capturing most of its variance structure. Can reveal clustered groups and outliers.

Anomaly Detection – Unsupervised techniques to identify unusual or fraudulent behavior using distance/density metrics.

Association Rules – Discover interesting connections and relationships between variables in transactional data.

Let‘s demonstrate PCA in R for example:

pca_result <- prcomp(df, scale = TRUE) 

ggplot(data = pca_result$x[,1:2]) + 
  geom_point()

We can then visualize data points projected onto the first two principal components.

The appropriate approach depends on the data characteristics and analysis objectives. But applying both traditional and modern methods expands our exploratory toolkit.

Putting It All Together

To tie together the wide variety of EDA techniques discussed, let‘s walk through an example analysis flow with the nycflights13::flights dataset of flight arrival stats for all US flights in 2013:

// 1) Load data 
data("nycflights13::flights")
flights <- nycflights13::flights

// 2) High level overview
str(flights)
summary(flights)
skim(flights)

// 3) Visualize key relationships 
plot1 <- flights %>% 
          ggplot(aes(x = air_time, y = distance)) +
          geom_point()

// 4) Time specific handling
flights_ts <- flights %>%
             mutate(date = as_date(time_hour)) %>% 
         as_tsibble(index = date)

// 5) Statistical modeling
fit <- tslm(arr_delay ~ air_time + distance, data = flights_ts)
forecast(fit)

By combining both simple and sophisticated techniques that analyze different aspects of a dataset, exploratory data analysis ultimately enables us to develop an intimate understanding of the data that then seeds downstream analysis and modeling work.

Conclusion

Exploratory data analysis represents a crucial initial step in the data science lifecycle. Leveraging statistical summaries, customizable plots, and sophisticated analysis procedures allows us to rigorously investigate, visualize, transform, and model the patterns within small to extremely large datasets.

This guide covered key packages, essential methods, and best practices for EDA in R. With the analytical toolkit developed here, you should feel equipped to start asking and answering interesting questions for your own data!