11 Large PX-files and data

11.1 Size matters

PX-files can contain a lot of data, which is great. However this also means that somewhere along the process it will require handling of this large data. The chapter Importing data gives some tips about handling large datasets via functions in the haven package with n_max and col_select. The same arguments are also available for csv-files e.g. in the function readr::read_csv().

11.1.1 Parquet files and rds-files

An efficient file format in terms of file size is the parquet format. To exemplify this we retrieve a large PX-table from Statistics Sweden’s Statistical Database. We have found a large demography table and imported that through the functions in the pxweb package, which helps dealing with PX-web API. Afterwards, we use pxmake’s px() function to get it as a PX-object in R.

Get PX-table from API

library(pxmake)
library(tidyverse)
library(pxweb)

file_api <- pxweb_get_data(url = "https://api.scb.se/OV0104/v1/doris/en/ssd/START/LE/LE0105/LE0105C/LE0105Demogr02", query = '{
  "query": [
    {
      "code": "Hushallsstallning",
      "selection": {
        "filter": "item",
        "values": [
          "ABarn",
          "aensf",
          "Aensm",
          "Asamm",
          "Agift",
          "Abild",
          "Aovri",
          "Apers"
        ]
      }
    },
    {
      "code": "Fodelseregion",
      "selection": {
        "filter": "item",
        "values": [
          "200",
          "020",
          "030",
          "040",
          "050",
          "010",
          "100"
        ]
      }
    },
    {
      "code": "Vistelsetid",
      "selection": {
        "filter": "item",
        "values": [
          "TOT",
          "INRF",
          "0-3",
          "4-9",
          "10-",
          "US"
        ]
      }
    },
    {
      "code": "Alder",
      "selection": {
        "filter": "vs:ÅlderLE0105fem",
        "values": [
          "-4",
          "5-9",
          "10-14",
          "15-19",
          "20-24",
          "25-29",
          "30-34",
          "35-39",
          "40-44",
          "45-49",
          "50-54",
          "55-59",
          "60-64",
          "65-69",
          "70-74",
          "75-79",
          "80-84",
          "85-89",
          "90-94",
          "95-99",
          "100+"
        ]
      }
    },
    {
      "code": "ContentsCode",
      "selection": {
        "filter": "item",
        "values": [
          "000004SM"
        ]
      }
    },
    {
      "code": "Tid",
      "selection": {
        "filter": "item",
        "values": [
          "2011",
          "2012",
          "2013",
          "2014",
          "2015",
          "2016",
          "2017",
          "2018",
          "2019",
          "2020",
          "2021",
          "2022",
          "2023"
        ]
      }
    }
  ],
  "response": {
    "format": "px"
  }
}')

file_api2 <- file_api %>% 
  pivot_longer(cols = everything())

# getting one of the two files collected in batches as px
px_api <- px(file_api2$value[[1]])

px_from_api <- map(file_api2$value, px) %>% 
  map(px_data) %>% 
  bind_rows() %>% 
  px_data(px_api, .)

Let’s just view the data.

px_from_api %>% 
  px_data()

# A tibble: 275,184 × 8
   Hushallsstallning Fodelseregion Vistelsetid Kon   Alder ContentsCode Tid  
   <chr>             <chr>         <chr>       <chr> <chr> <chr>        <chr>
 1 ABarn             200           TOT         TOT2  -4    000004SM     2011 
 2 ABarn             200           TOT         TOT2  -4    000004SM     2012 
 3 ABarn             200           TOT         TOT2  -4    000004SM     2013 
 4 ABarn             200           TOT         TOT2  -4    000004SM     2014 
 5 ABarn             200           TOT         TOT2  -4    000004SM     2015 
 6 ABarn             200           TOT         TOT2  -4    000004SM     2016 
 7 ABarn             200           TOT         TOT2  -4    000004SM     2017 
 8 ABarn             200           TOT         TOT2  5-9   000004SM     2011 
 9 ABarn             200           TOT         TOT2  5-9   000004SM     2012 
10 ABarn             200           TOT         TOT2  5-9   000004SM     2013 
# ℹ 275,174 more rows
# ℹ 1 more variable: figures_ <dbl>

The dataset has 275184 rows and 7 explanatory variables and one frequency variable.

The px_save() function also allows for saving data in Excel, rds or parquet. This can be useful in different situations. The Excel-format is not very efficient for storing large datasets in terms of file size, so it will not be the focus for this section. However, it can be useful for other cases, see for example the chapter about Multiple languages in PX-files.

Using the px_save function, we can easily save the data as both rds and parquet format using the data_path argument.

px_save(px_from_api, "px_api.R", 
        data_path = "px_api.rds")

px_save(px_from_api, "px_api2.R",
        data_path = "px_api.parquet")

Note that when using the data_path argument to save to parquet or rds, we also save an R-script. The R-script contains the code to create the PX-object associated with the dataset.

The rds-format is R’s data format and works fairly quick with R to read and write files. However the alternative, parquet files might sometimes be even better in terms of speed (read/write) and size.

In the case above, the dataset, even though it has many rows, is not necessarily a large dataset. Sometimes a dataset could have millions of rows if it for example covers population data for a country.

11.2 Splitting data and efficient packages

When working with datasets containing millions or tens of millions of observations, traditional R approaches can quickly become memory-intensive and computationally slow.

Here we can instead use two strategies: splitting the dataset into smaller chunks and using packages with the specific focus of working with large datasets in R. These chunks could for instance be each year or a similar division of the data, where you can run your code in meaningful chunks.

Furthermore, it can be advised to take advantage of R packages with a specific focus on working with large data. Some of the most common packages are data.table, dplyr with dtplyr, arrow, and vroom.

Sometimes we might have a large external file, like a big Stata dataset, which can be time-consuming for R to read. Then we can use the arguments skip and n_max to read in the data in chunks, combine it and then write to an rds-file so it is easier to read that data to R in the future.

The code below does just that: it takes a large Stata file, reads it in chunks of 100000 observations (can be adjusted) and saves a combined rds-file.

library(haven)
library(data.table)

# Set chunks to write to and initial values
chunks <- list()
skip_rows <- 0
i <- 1

repeat {
  chunk <- read_stata("large_set.dta", 
                      skip = skip_rows, 
                      n_max = 100000)
  if (nrow(chunk) == 0) break
  
  chunks[[i]] <- chunk
  skip_rows <- skip_rows + 100000
  i <- i + 1
  
  if (nrow(chunk) < 100000) break
}

# Combine and save
combined_data <- rbindlist(chunks)
saveRDS(combined_data, "data_test.rds")