The pxweb package has functions to work with PX-web APIs. It works with any PX-web database that has the API feature enabled. Below we collect a table from the Ghanaian StatsBank.
library(pxweb)library(tidyverse)from_api <-pxweb_get(url ="https://statsbank.statsghana.gov.gh:443/api/v1/en/PHC 2021 StatsBank/Education and Literacy/attended_table.px",query =list(Highest_level_of_education =c("*"),Geographic_Area =c("Ghana"),Locality =c("*"),Sex =c("*"),Age =c("All ages", "15-17 years", "18 years & older"))) %>%as.data.frame() %>%as_tibble() # as_tibble for nicer presentationhead(from_api) %>% kableExtra::kable()
Highest_level_of_education
Geographic_Area
Locality
Sex
Age
Population (15 years and older)
Total
Ghana
All Locality Types
Both sexes
All ages
11289655
Total
Ghana
All Locality Types
Both sexes
15-17 years
187272
Total
Ghana
All Locality Types
Both sexes
18 years & older
11102383
Total
Ghana
All Locality Types
Male
All ages
5834212
Total
Ghana
All Locality Types
Male
15-17 years
88760
Total
Ghana
All Locality Types
Male
18 years & older
5745452
If we want to update this table with data for a new year, we can use data from the API to validate that our data for the new year looks as expected. Now we create some new random data from the dataset for a new year.
generate random data
set.seed(42); new_year <- from_api %>%mutate(`Population (15 years and older)_new_year`=sample(`Population (15 years and older)`),.keep ="unused")
Now we can join the data together and check for developments.
pct_change <- from_api %>%left_join(new_year,by =join_by(Highest_level_of_education, Geographic_Area, Locality, Sex, Age)) %>%mutate(pct_change = (`Population (15 years and older)_new_year`-`Population (15 years and older)`)/`Population (15 years and older)`*100)
Now we can use this new column pct_change to see if we have some unexpectedly high changes (above 10%), which might indicate problems in our code generating the data.
if(max(abs(pct_change$pct_change), na.rm =TRUE) >10){warning("There are changes above 10% for the new year. This may indicate problems with the data") pct_change %>%filter(abs(pct_change) >10) %>%head(n =20) %>% kableExtra::kable()} else {cat("No changes above 10%")}
Warning: There are changes above 10% for the new year. This may indicate
problems with the data
Highest_level_of_education
Geographic_Area
Locality
Sex
Age
Population (15 years and older)
Population (15 years and older)_new_year
pct_change
Total
Ghana
All Locality Types
Both sexes
All ages
11289655
569
-99.99496
Total
Ghana
All Locality Types
Both sexes
15-17 years
187272
153601
-17.97973
Total
Ghana
All Locality Types
Both sexes
18 years & older
11102383
173400
-98.43817
Total
Ghana
All Locality Types
Male
All ages
5834212
195
-99.99666
Total
Ghana
All Locality Types
Male
15-17 years
88760
49698
-44.00856
Total
Ghana
All Locality Types
Male
18 years & older
5745452
0
-100.00000
Total
Ghana
All Locality Types
Female
All ages
5455443
19340
-99.64549
Total
Ghana
All Locality Types
Female
15-17 years
98512
4527
-95.40462
Total
Ghana
All Locality Types
Female
18 years & older
5356931
43518
-99.18763
Total
Ghana
Rural
Both sexes
All ages
3876215
184216
-95.24753
Total
Ghana
Rural
Both sexes
15-17 years
100607
3707120
3584.75355
Total
Ghana
Rural
Both sexes
18 years & older
3775608
815026
-78.41338
Total
Ghana
Rural
Male
All ages
2087334
0
-100.00000
Total
Ghana
Rural
Male
15-17 years
49002
39338
-19.72164
Total
Ghana
Rural
Male
18 years & older
2038332
2487408
22.03154
Total
Ghana
Rural
Female
All ages
1788881
88854
-95.03298
Total
Ghana
Rural
Female
15-17 years
51605
86665
67.93915
Total
Ghana
Rural
Female
18 years & older
1737276
0
-100.00000
Total
Ghana
Urban
Both sexes
All ages
7413440
117606
-98.41361
Total
Ghana
Urban
Both sexes
15-17 years
86665
16277
-81.21849
We see that we have some extremely high changes. For example, the value for the total across the variables has gone from 11,289,655 to 569. This clearly indicates a mistake. In this case, the mistake stems from our random data generation. We can instead do it in a more controlled process.
Controlled data generation
set.seed(42); new_year <- from_api %>%mutate(`Population (15 years and older)_new_year`=`Population (15 years and older)`*runif(n(), 0.9, 1.1),.keep ="unused")
pct_change <- from_api %>%left_join(new_year,by =join_by(Highest_level_of_education, Geographic_Area, Locality, Sex, Age)) %>%mutate(pct_change = (`Population (15 years and older)_new_year`-`Population (15 years and older)`)/`Population (15 years and older)`*100)if(max(abs(pct_change$pct_change), na.rm =TRUE) >10){ pct_change %>%filter(abs(pct_change) >10) %>%head(n =20) %>% kableExtra::kable()stop("There are changes above 10% for the new year. This may indicate problems with the data")} else {cat("No changes above 10%")}
No changes above 10%
Now we have no year-to-year changes above 10% (of course, because our data generation process set a limit of 10%). This limit of 10% is not set in stone and should be adjusted according to the content of the data. For example, some countries have seen inflation rates above 10%, so in that case the relevant number might be different.
Here we checked using data from the PX-web API, also to showcase the API. If the PX-file is available on disk, it may be easier to just load that and then do some validation checks.
The R package validate is specifically focused on tools and functions for validating data. For more information on data validation, see The Data Validation Cookbook, which introduces the R package validate and data validation concepts and implementations in R in general.
10.2 Check cross-sums
Usually in a Statbank/PX-web database, we have multiple tables covering one topic, e.g., multiple tables to cover the theme of population statistics. Often these tables will share some variables (sex and age would be prime examples), and therefore we expect values to be the same across tables.
We have our data collected from the Ghanaian StatsBank about education statistics for persons 15 years and older who attended school in the past.
check_table <-pxweb_get(url ="https://statsbank.statsghana.gov.gh:443/api/v1/en/PHC 2021 StatsBank/Education and Literacy/sch_attend_stat_table.px",query =list(Schoolattendancestatus =c("Attended in the past"),Geographic_Area =c("Ghana"),Locality =c("*"),Sex =c("*"),Age =c("15-19", "20-24", "25-29", "30-34","35-39", "40-44", "45-49", "50-54","55-59", "60-64", "65-69", "70-74","75-79", "80-84", "85-89", "90-94","95-99", "100+"))) %>%as.data.frame() %>%mutate(Locality =if_else(Locality =="All locality types", "All Locality Types", Locality)) %>%as_tibble() # as_tibble for nicer presentationhead(check_table) %>% kableExtra::kable()
Schoolattendancestatus
Geographic_Area
Locality
Sex
Age
Population (3 years and older)
Attended in the past
Ghana
All Locality Types
Both sexes
15-19
542151
Attended in the past
Ghana
All Locality Types
Both sexes
20-24
1661537
Attended in the past
Ghana
All Locality Types
Both sexes
25-29
1923249
Attended in the past
Ghana
All Locality Types
Both sexes
30-34
1725033
Attended in the past
Ghana
All Locality Types
Both sexes
35-39
1448599
Attended in the past
Ghana
All Locality Types
Both sexes
40-44
1059990
This table we want to check uses another age definition, so we just need to sum up all the age groups to have persons over the age of 15.
check_table_above_15 <- check_table %>%summarise(`Population (3 years and older)`=sum(`Population (3 years and older)`),.by =c(Schoolattendancestatus, Geographic_Area, Locality, Sex))
Now we have the values for persons aged 15 and over, which would correspond to the value for all ages in our initial table. Let’s select the corresponding variables between the tables and join them together.
cross_check <- from_api %>%filter(Age =="All ages"& Highest_level_of_education =="Total") %>%select(Geographic_Area, Locality, Sex, `Population (15 years and older)`) %>%## Joining data togetherleft_join(check_table_above_15 %>%select(Geographic_Area, Locality, Sex, `Population (3 years and older)`),by =join_by(Geographic_Area, Locality, Sex)) %>%## Calculating differencemutate(diff =`Population (15 years and older)`-`Population (3 years and older)`)
Now we have joined the two tables together and calculated a difference variable that hopefully should be 0.
unique(cross_check$diff)
[1] 0
Which it is. The values shared across the two tables are the same, which is good as it indicates a consistent methodology in the two tables. This could also be a way to check a newly generated PX-table if you know that you have a PX-table in your database that shares some of the values.