10  Validating PX-files via API

10.1 The pxweb package

The pxweb package has functions to work with PX-web APIs. It works with any PX-web database that has the API feature enabled. Below we collect a table from the Ghanaian StatsBank.

library(pxweb)
library(tidyverse)

from_api <- pxweb_get(url = "https://statsbank.statsghana.gov.gh:443/api/v1/en/PHC 2021 StatsBank/Education and Literacy/attended_table.px",
                      query = list(
                        Highest_level_of_education = c("*"),
                        Geographic_Area = c("Ghana"),
                        Locality = c("*"),
                        Sex = c("*"),
                        Age = c("All ages", "15-17 years", "18 years & older"))) %>% 
  as.data.frame() %>% 
  as_tibble() # as_tibble for nicer presentation

head(from_api) %>% 
  kableExtra::kable()
Highest_level_of_education Geographic_Area Locality Sex Age Population (15 years and older)
Total Ghana All Locality Types Both sexes All ages 11289655
Total Ghana All Locality Types Both sexes 15-17 years 187272
Total Ghana All Locality Types Both sexes 18 years & older 11102383
Total Ghana All Locality Types Male All ages 5834212
Total Ghana All Locality Types Male 15-17 years 88760
Total Ghana All Locality Types Male 18 years & older 5745452

If we want to update this table with data for a new year, we can use data from the API to validate that our data for the new year looks as expected. Now we create some new random data from the dataset for a new year.

generate random data
set.seed(42); new_year <- from_api %>% 
  mutate(`Population (15 years and older)_new_year` = sample(`Population (15 years and older)`),
         .keep = "unused")

Now we can join the data together and check for developments.

pct_change <- from_api %>% 
  left_join(new_year,
            by = join_by(Highest_level_of_education, 
                         Geographic_Area, Locality, 
                         Sex, Age)) %>% 
  mutate(pct_change = (`Population (15 years and older)_new_year` - `Population (15 years and older)`)/
           `Population (15 years and older)`*100)

Now we can use this new column pct_change to see if we have some unexpectedly high changes (above 10%), which might indicate problems in our code generating the data.

if(max(abs(pct_change$pct_change), na.rm = TRUE) > 10){
    warning("There are changes above 10% for the new year. This may indicate problems with the data")
  pct_change %>% 
    filter(abs(pct_change) > 10) %>% 
    head(n = 20) %>% 
    kableExtra::kable()
} else {
  cat("No changes above 10%")
}
Warning: There are changes above 10% for the new year. This may indicate
problems with the data
Highest_level_of_education Geographic_Area Locality Sex Age Population (15 years and older) Population (15 years and older)_new_year pct_change
Total Ghana All Locality Types Both sexes All ages 11289655 569 -99.99496
Total Ghana All Locality Types Both sexes 15-17 years 187272 153601 -17.97973
Total Ghana All Locality Types Both sexes 18 years & older 11102383 173400 -98.43817
Total Ghana All Locality Types Male All ages 5834212 195 -99.99666
Total Ghana All Locality Types Male 15-17 years 88760 49698 -44.00856
Total Ghana All Locality Types Male 18 years & older 5745452 0 -100.00000
Total Ghana All Locality Types Female All ages 5455443 19340 -99.64549
Total Ghana All Locality Types Female 15-17 years 98512 4527 -95.40462
Total Ghana All Locality Types Female 18 years & older 5356931 43518 -99.18763
Total Ghana Rural Both sexes All ages 3876215 184216 -95.24753
Total Ghana Rural Both sexes 15-17 years 100607 3707120 3584.75355
Total Ghana Rural Both sexes 18 years & older 3775608 815026 -78.41338
Total Ghana Rural Male All ages 2087334 0 -100.00000
Total Ghana Rural Male 15-17 years 49002 39338 -19.72164
Total Ghana Rural Male 18 years & older 2038332 2487408 22.03154
Total Ghana Rural Female All ages 1788881 88854 -95.03298
Total Ghana Rural Female 15-17 years 51605 86665 67.93915
Total Ghana Rural Female 18 years & older 1737276 0 -100.00000
Total Ghana Urban Both sexes All ages 7413440 117606 -98.41361
Total Ghana Urban Both sexes 15-17 years 86665 16277 -81.21849

We see that we have some extremely high changes. For example, the value for the total across the variables has gone from 11,289,655 to 569. This clearly indicates a mistake. In this case, the mistake stems from our random data generation. We can instead do it in a more controlled process.

Controlled data generation
set.seed(42); new_year <- from_api %>% 
  mutate(`Population (15 years and older)_new_year` = 
           `Population (15 years and older)` * runif(n(), 0.9, 1.1),
         .keep = "unused")
pct_change <- from_api %>% 
  left_join(new_year,
            by = join_by(Highest_level_of_education, 
                         Geographic_Area, Locality, 
                         Sex, Age)) %>% 
  mutate(pct_change = (`Population (15 years and older)_new_year` - `Population (15 years and older)`)/
           `Population (15 years and older)`*100)

if(max(abs(pct_change$pct_change), na.rm = TRUE) > 10){
  pct_change %>% 
    filter(abs(pct_change) > 10) %>% 
    head(n = 20) %>% 
    kableExtra::kable()
  stop("There are changes above 10% for the new year. This may indicate problems with the data")
} else {
  cat("No changes above 10%")
}
No changes above 10%

Now we have no year-to-year changes above 10% (of course, because our data generation process set a limit of 10%). This limit of 10% is not set in stone and should be adjusted according to the content of the data. For example, some countries have seen inflation rates above 10%, so in that case the relevant number might be different.

Here we checked using data from the PX-web API, also to showcase the API. If the PX-file is available on disk, it may be easier to just load that and then do some validation checks.

The R package validate is specifically focused on tools and functions for validating data. For more information on data validation, see The Data Validation Cookbook, which introduces the R package validate and data validation concepts and implementations in R in general.

10.2 Check cross-sums

Usually in a Statbank/PX-web database, we have multiple tables covering one topic, e.g., multiple tables to cover the theme of population statistics. Often these tables will share some variables (sex and age would be prime examples), and therefore we expect values to be the same across tables.

We have our data collected from the Ghanaian StatsBank about education statistics for persons 15 years and older who attended school in the past.

We have another table about Population (3 years and older) by School Attendance Status, District, Region, Type of Locality, Age and Sex, and here we would expect that it shares some values with our table from earlier if we filter school attendance status to “attended in past” and age to over 15 years. Now we collect this data using the pxweb package.

check_table <- pxweb_get(url = "https://statsbank.statsghana.gov.gh:443/api/v1/en/PHC 2021 StatsBank/Education and Literacy/sch_attend_stat_table.px",
                      query = list(
                        Schoolattendancestatus = c("Attended in the past"),
                        Geographic_Area = c("Ghana"),
                        Locality = c("*"),
                        Sex = c("*"),
                        Age = c("15-19", "20-24", "25-29", "30-34",
                                "35-39", "40-44", "45-49", "50-54",
                                "55-59", "60-64", "65-69", "70-74",
                                "75-79", "80-84", "85-89", "90-94",
                                "95-99", "100+"))) %>% 
  as.data.frame() %>% 
  mutate(Locality = if_else(Locality == "All locality types", 
                            "All Locality Types",
                            Locality)) %>% 
  as_tibble() # as_tibble for nicer presentation

head(check_table) %>% 
  kableExtra::kable()
Schoolattendancestatus Geographic_Area Locality Sex Age Population (3 years and older)
Attended in the past Ghana All Locality Types Both sexes 15-19 542151
Attended in the past Ghana All Locality Types Both sexes 20-24 1661537
Attended in the past Ghana All Locality Types Both sexes 25-29 1923249
Attended in the past Ghana All Locality Types Both sexes 30-34 1725033
Attended in the past Ghana All Locality Types Both sexes 35-39 1448599
Attended in the past Ghana All Locality Types Both sexes 40-44 1059990

This table we want to check uses another age definition, so we just need to sum up all the age groups to have persons over the age of 15.

check_table_above_15 <- check_table %>% 
  summarise(`Population (3 years and older)` = sum(`Population (3 years and older)`),
            .by = c(Schoolattendancestatus, Geographic_Area, Locality, Sex))

Now we have the values for persons aged 15 and over, which would correspond to the value for all ages in our initial table. Let’s select the corresponding variables between the tables and join them together.

cross_check <- from_api %>%
  filter(Age == "All ages" & Highest_level_of_education == "Total") %>% 
  select(Geographic_Area, Locality, Sex, `Population (15 years and older)`) %>% 
  ## Joining data together
  left_join(check_table_above_15 %>% 
              select(Geographic_Area, Locality, Sex, `Population (3 years and older)`),
            by = join_by(Geographic_Area, Locality, Sex)) %>% 
  ## Calculating difference
  mutate(diff = `Population (15 years and older)` - `Population (3 years and older)`)

Now we have joined the two tables together and calculated a difference variable that hopefully should be 0.

unique(cross_check$diff)
[1] 0

Which it is. The values shared across the two tables are the same, which is good as it indicates a consistent methodology in the two tables. This could also be a way to check a newly generated PX-table if you know that you have a PX-table in your database that shares some of the values.