PX-files can contain a lot of data, which is great. However this also means that somewhere along the process it will require handling of this large data. The chapter Importing data gives some tips about handling large datasets via functions in the haven package with n_max and col_select. The same arguments are also available for csv-files e.g. in the function readr::read_csv().
11.1.1 Parquet files and rds-files
An efficient file format in terms of file size is the parquet format. To exemplify this we retrieve a large PX-table from Statistics Sweden’s Statistical Database. We have found a large demography table and imported that through the functions in the pxweb package, which helps dealing with PX-web API. Afterwards, we use pxmake’s px() function to get it as a PX-object in R.
# A tibble: 275,184 × 8
Hushallsstallning Fodelseregion Vistelsetid Kon Alder ContentsCode Tid
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ABarn 200 TOT TOT2 -4 000004SM 2011
2 ABarn 200 TOT TOT2 -4 000004SM 2012
3 ABarn 200 TOT TOT2 -4 000004SM 2013
4 ABarn 200 TOT TOT2 -4 000004SM 2014
5 ABarn 200 TOT TOT2 -4 000004SM 2015
6 ABarn 200 TOT TOT2 -4 000004SM 2016
7 ABarn 200 TOT TOT2 -4 000004SM 2017
8 ABarn 200 TOT TOT2 5-9 000004SM 2011
9 ABarn 200 TOT TOT2 5-9 000004SM 2012
10 ABarn 200 TOT TOT2 5-9 000004SM 2013
# ℹ 275,174 more rows
# ℹ 1 more variable: figures_ <dbl>
The dataset has 275184 rows and 7 explanatory variables and one frequency variable.
The px_save() function also allows for saving data in Excel, rds or parquet. This can be useful in different situations. The Excel-format is not very efficient for storing large datasets in terms of file size, so it will not be the focus for this section. However, it can be useful for other cases, see for example the chapter about Multiple languages in PX-files.
Using the px_save function, we can easily save the data as both rds and parquet format using the data_path argument.
Note that when using the data_path argument to save to parquet or rds, we also save an R-script. The R-script contains the code to create the PX-object associated with the dataset.
The rds-format is R’s data format and works fairly quick with R to read and write files. However the alternative, parquet files might sometimes be even better in terms of speed (read/write) and size.
In the case above, the dataset, even though it has many rows, is not necessarily a large dataset. Sometimes a dataset could have millions of rows if it for example covers population data for a country.
11.2 Splitting data and efficient packages
When working with datasets containing millions or tens of millions of observations, traditional R approaches can quickly become memory-intensive and computationally slow.
Here we can instead use two strategies: splitting the dataset into smaller chunks and using packages with the specific focus of working with large datasets in R. These chunks could for instance be each year or a similar division of the data, where you can run your code in meaningful chunks.
Furthermore, it can be advised to take advantage of R packages with a specific focus on working with large data. Some of the most common packages are data.table, dplyr with dtplyr, arrow, and vroom.
Sometimes we might have a large external file, like a big Stata dataset, which can be time-consuming for R to read. Then we can use the arguments skip and n_max to read in the data in chunks, combine it and then write to an rds-file so it is easier to read that data to R in the future.
The code below does just that: it takes a large Stata file, reads it in chunks of 100000 observations (can be adjusted) and saves a combined rds-file.
library(haven)library(data.table)# Set chunks to write to and initial valueschunks <-list()skip_rows <-0i <-1repeat { chunk <-read_stata("large_set.dta", skip = skip_rows, n_max =100000)if (nrow(chunk) ==0) break chunks[[i]] <- chunk skip_rows <- skip_rows +100000 i <- i +1if (nrow(chunk) <100000) break}# Combine and savecombined_data <-rbindlist(chunks)saveRDS(combined_data, "data_test.rds")