5 PX-file from scratch

5.1 Small PX-file from scratch

This guide walks through how to take microdata and create a PX-file. There might be some considerations in this process depending on the types of variables you have in your data. Microdata is understood in the way that each row is an observation, so for population statistics each row will be a person.

5.1.1 Inspecting data

We use the test data greenlanders from the pxmake package, which is a small dataset about the Greenlandic population containing the variables: cohort, gender, age and municipality.

library(pxmake)

str(greenlanders)

tibble [100 × 4] (S3: tbl_df/tbl/data.frame)
 $ cohort      : chr [1:100] "A" "A" "A" "A" ...
 $ gender      : chr [1:100] "female" "female" "female" "female" ...
 $ age         : int [1:100] 21 21 24 31 31 36 37 38 38 52 ...
 $ municipality: chr [1:100] "Sermersooq" "Sermersooq" "Kujalleq" "Kujalleq" ...

head(greenlanders)

# A tibble: 6 × 4
  cohort gender   age municipality
  <chr>  <chr>  <int> <chr>       
1 A      female    21 Sermersooq  
2 A      female    21 Sermersooq  
3 A      female    24 Kujalleq    
4 A      female    31 Kujalleq    
5 A      female    31 Sermersooq  
6 A      female    36 Kujalleq

5.1.2 Modifying and saving the file

We also want to use some data manipulation functions so we load the tidyverse package, where the pipe operator especially comes in handy.

First we calculate a frequency variable for the data

library(tidyverse)

gl_freq <- greenlanders %>% 
  count(across(everything()), name = "freq", .drop = FALSE)

head(gl_freq)

# A tibble: 6 × 5
  cohort gender   age municipality  freq
  <chr>  <chr>  <int> <chr>        <int>
1 A      female    21 Sermersooq       2
2 A      female    24 Kujalleq         1
3 A      female    31 Kujalleq         1
4 A      female    31 Sermersooq       1
5 A      female    36 Kujalleq         1
6 A      female    37 Kujalleq         1

Now we can start making our PX-file, adding a title, contact information and more

gl_freq %>% 
  # converting to px
  px() %>% 
  # adding title
  px_title("Test file with Greenland sample population") %>% 
  # adding a matrix-name
  px_matrix("POP1") %>% 
  # information about units
  px_units("Persons") %>% 
  # Subject area
  px_subject_area("Population") %>% 
  # Subject code
  px_subject_code("POP") %>% 
  # data source
  px_source("Statistics Greenland") %>% 
  # contact 
  px_contact("Emil Thranholm") %>% 
  # add a total level for variables
  px_add_totals(c("cohort", "gender", "age", "municipality")) %>% 
  # variable labels, changing names to be displayed in px-web
  px_variable_label(tribble(~`variable-code`, ~`variable-label`,
                            "cohort", "Cohort",
                            "gender", "Gender",
                            "age", "Age",
                            "municipality", "Location")) %>% 
  # saving px-file
  px_save("px_greenland_testfile.px")

This showed how to use a dataset already in R, the next chapter Updating PX-files shows how to import from Stata and convert it to a PX-file. It is the same procedure as shown above, just with a different data source.

5.2 Hierarchies in PX-files

Above we made a simple PX-file from microdata. However, sometimes we might have a variable with a hierarchical structure. In this case the variable age will be the example, as it can be grouped in multiple ways, e.g. 5-year groups and 10-year groups or other groups.

5.2.1 Selection lists

Currently the age variable is in one year groups. It could make sense to group these into 5-year and 10-year groups. Therefore, we make 5 and 10 year groups in our data using the age_groups() function from the AMR package.

# If not installed
# install.packages("AMR")

age_classification <- gl_freq %>% 
  mutate(age5 = AMR::age_groups(age, split_at = "fives"),
         age10 = AMR::age_groups(age, split_at = "tens")) %>% 
  distinct(age, age5, age10) %>% 
  arrange(age5) %>% 
  select(valuecode = age, valuetext = age,
         `5 years classes` = age5,
         `10 years classes` = age10)

head(age_classification)

# A tibble: 6 × 4
  valuecode valuetext `5 years classes` `10 years classes`
      <int>     <int> <ord>             <ord>             
1        18        18 15-19             10-19             
2        19        19 15-19             10-19             
3        21        21 20-24             20-29             
4        24        24 20-24             20-29             
5        22        22 20-24             20-29             
6        20        20 20-24             20-29

The code above creates a data frame that can be used as input to the function px_classification(). It must have a mandatory column valuecode indicating the codes in the data for the given variable. The column valuetext is also included here, but is only required if there are texts associated with the codes in the PX-file, e.g. when using data for economic activity (ISIC or similar) then the codes may have texts. Any additional column represents an aggregation and the column name, will be the aggregation name shown in PX-web. In this case we have two aggregations 5 years classes and 10 years classes. As the column names are the text shown on PX-web the names have been enclosed in backticks (`) to allow for spacing in the name.

Now we use px_classification to create the selection lists in R. Afterwards, we use px_save_classification() to export the relevant files. It creates a “.vs” file and one or more “.agg” files. In very simple terms the “.vs” file tells PX-web which selection/aggregation lists to use, while the “.agg” files are the actual aggregations, e.g. five year classes and ten year classes in this case. More information can be found in the documentation for the PX-file format.

age_class <- px_classification(name = "age_select", prestext = "Age classification",
                               domain = "age_class", df = age_classification)

px_save_classification(age_class, getwd())

The name argument will be the name of the “.vs” file, in this case it saves the file as “age_select.vs”. df is the data frame with the classification/selection list just created. The domain is the name that we need to put in our PX-file to indicate that the PX-file should use the particular selection list. So the code below reads the PX-file to R and sets the domain for age as age_class (so for the variable age the selection lists with the domain age_class should be used).

px("px_greenland_testfile.px") %>% 
  px_domain(tribble(~`variable-code`, ~domain,
                    "age", "age_class")) %>% 
  px_save("px_greenland_testfile.px")

Afterwards, the modified PX-file is saved again and is now ready to be loaded to PX-web and saved in the same folder as the “.vs” and “.agg” files.

5.2.2 Tree like hierarchy

Instead of selection lists, it is also possible to have a hierarchy in a tree-like structure. However, this feature is not yet supported in pxmake, hence it is recommended to use the selection lists presented above for now.

If it is still wished to use the tree like hierarchy, it is a bit more complicated, but can be done through the HIERARCHIES keyword in the PX-file format.