library(readr)
# Read CSV files, comma separation
<- read_csv("file.csv")
data
# Read files with semicolon separation (common in European data)
<- read_csv2("file.csv")
data
# Read tab-separated files
<- read_tsv("file.tsv")
data
# Read files with custom delimiters, here a pipe: "|"
<- read_delim("file.txt", delim = "|") data
3 Importing data
3.1 Packages for data import
There is a wide range of R packages for importing all different data formats to R. R also has its own data format called “.rds”. Datasets can be written and read respectively via the R functions saveRDS()
and readRDS()
. However, you will rarely find data saved in rds format on a webpage where you can download it. R has a package for importing almost any data format, and if in doubt, ask your favorite chatbot for help. The sections below go through the most common data formats.
3.1.1 Delimited files (readr
)
The package readr, which is also part of the tidyverse, has a lot of functions to deal with CSV files and other delimited files. The following code shows examples of the most commonly used functions from the readr
package. Note the difference between read_csv()
(comma delimiter) and read_csv2()
(semicolon delimiter).
The readr
functions automatically detect column types and handle common issues like missing values. You can specify column types explicitly using the col_types
argument:
<- read_csv("file.csv",
data col_types = cols(
id = col_integer(),
name = col_character(),
date = col_date(),
value = col_double()
))
3.1.2 Excel files (readxl
)
For Excel files (.xlsx and .xls), the readxl package can be used and is also part of the tidyverse ecosystem. There are a few alternatives as well: the xlsx
package and the openxlsx
package. If you have to write data to Excel or load formatted sheets, it might be relevant to look at especially openxlsx
, but for reading simple Excel files, readxl
should do the trick.
library(readxl)
# Read the first sheet
<- read_excel("file.xlsx")
data
# Read a specific sheet by name or number
<- read_excel("file.xlsx", sheet = "Sheet2")
data <- read_excel("file.xlsx", sheet = 2)
data
# Read a specific range
<- read_excel("file.xlsx", range = "A1:D10")
data
# Skip rows (useful for files with headers or metadata)
<- read_excel("file.xlsx", skip = 3) data
You can also list all sheet names in an Excel file:
excel_sheets("file.xlsx")
3.1.3 Statistical software formats (haven
)
The haven package imports data from SPSS, Stata, and SAS.
library(haven)
# Stata files
<- read_dta("file.dta")
data
# SAS files
<- read_sas("file.sas7bdat")
data
# SPSS files
<- read_sav("file.sav") data
These functions preserve variable labels and value labels from the original statistical software. This can also make the datasets time-consuming to import. The n_max
argument can be used to only import part of a dataset. Afterwards, the col_select
argument can be used to read only the necessary columns. The following code shows an example of this workflow.
# Reading only first observation to inspect columns of the data
<- read_dta("file.dta", n_max = 1)
data_obs1
# Reading the necessary columns
<- read_dta("file.dta", col_select = c("id", "age", "sex", "education", "income", "district")) data
3.1.4 JSON files (jsonlite
)
JSON (JavaScript Object Notation) is increasingly common for data exchange, especially from APIs. The jsonlite package handles JSON data efficiently.
library(jsonlite)
# Read JSON from file
<- fromJSON("file.json")
data
# Read JSON from URL
<- fromJSON("https://api.example.com/data.json")
data
# Convert R object to JSON
<- toJSON(data, pretty = TRUE) json_string
3.1.5 Large files (data.table
and vroom
)
For very large files, specialized packages offer better performance, especially the data.table
and vroom
packages.
library(data.table)
library(vroom)
# data.table's fast reader
<- fread("large_file.csv")
data
# vroom for very fast reading
<- vroom("large_file.csv") data
3.2 Reproducible paths and R projects
One of the most common issues when sharing R code or moving projects between computers is broken file paths. Using R Projects and the here package creates reproducible workflows that work across different operating systems and directory structures. For this reason, it is recommended to have an R project when working on your code.
3.2.1 R Projects
R Projects (.Rproj files) create a self-contained workspace for your analysis. When opening an R Project, RStudio automatically sets the working directory to the project folder. This means you can use relative paths that work regardless of where the project is stored on your computer. Furthermore, you can export your folder structure and send it to another person, who will be able to run the same code through the R Project.
To create an R Project:
- File → New Project → New Directory → New Project
- Give your project a name and choose a location
- RStudio will create a .Rproj file and set up the folder structure
3.2.2 The here
package
The here package builds file paths relative to your project root, and therefore makes sense to use within your project. This also makes your code more portable and robust.
library(here)
# Instead of this (brittle, won't work on other computers):
<- read_csv("C:/Users/YourName/Documents/my_project/data/file.csv")
data
# Use this (portable, works anywhere):
<- read_csv(here("data", "file.csv"))
data
# The here() function builds the complete path
here("data", "file.csv")
# Returns: "/path/to/your/project/data/file.csv"