| Title: | The Shell Game - Audit Geographic Data Transformations |
|---|---|
| Description: | Reveals how data quality silently degrades during geographic transformations while variable labels remain unchanged. Demonstrates that transformation error is agnostic to both the variable (population, income, etc.) and the tool (R, Python, etc.). Provides a reproducible audit framework for quantifying the shift from observed to imputed data at each transformation hop. |
| Authors: | Phinn Markson [aut, cre] (ORCID: <https://orcid.org/0000-0002-9169-6095>) |
| Maintainer: | Phinn Markson <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-28 11:09:45 UTC |
| Source: | https://github.com/phinnphace/shellgame |
Main function to audit a complete geographic transformation pipeline. Quantifies the perturbation introduced at each hop and reveals the shell game.
audit_transformation( baseline_data, zip_zcta_map, hud_crosswalk, county_fips, variable_name = "value", value_col = "estimate" )audit_transformation( baseline_data, zip_zcta_map, hud_crosswalk, county_fips, variable_name = "value", value_col = "estimate" )
baseline_data |
Data frame with baseline data at source geography |
zip_zcta_map |
ZIP-ZCTA association crosswalk |
hud_crosswalk |
HUD ZIP-County crosswalk |
county_fips |
Target county FIPS code |
variable_name |
Name of the variable being tracked (for reporting) |
value_col |
Name of the value column in baseline_data |
An object of class "shellgame_audit" with audit results
## Not run: result <- audit_transformation( baseline_data = hennepin_zcta_baseline, zip_zcta_map = hennepin_zip_zcta_map, hud_crosswalk = hennepin_hud_crosswalk, county_fips = "27053", variable_name = "population" ) summary(result) ## End(Not run)## Not run: result <- audit_transformation( baseline_data = hennepin_zcta_baseline, zip_zcta_map = hennepin_zip_zcta_map, hud_crosswalk = hennepin_hud_crosswalk, county_fips = "27053", variable_name = "population" ) summary(result) ## End(Not run)
Validates that a Census API key is available for tidycensus.
check_census_key(install = FALSE)check_census_key(install = FALSE)
install |
Logical, whether to install the key for future sessions |
Invisible TRUE if key exists, stops with error if not
Generates all visualizations for an audit.
create_audit_report( audit_result, zcta_baseline_sf = NULL, zcta_geometric_sf = NULL, county_sf = NULL )create_audit_report( audit_result, zcta_baseline_sf = NULL, zcta_geometric_sf = NULL, county_sf = NULL )
audit_result |
A shellgame_audit object |
zcta_baseline_sf |
Optional: SF object with baseline ZCTAs |
zcta_geometric_sf |
Optional: SF object with geometric ZCTAs |
county_sf |
Optional: SF object with county boundary |
List of ggplot2 objects
Returns a data frame of counties that received population redistributed from the target county during the transformation, ordered by magnitude.
extract_perturbed_population(audit_result, top_n = 10)extract_perturbed_population(audit_result, top_n = 10)
audit_result |
A shellgame_audit object |
top_n |
Number of top counties to return (default: 10) |
Data frame with columns: county, value
Fetches ACS 5-year estimates for a specified variable at the ZCTA level using the Census API via the tidycensus package. Requires a Census API key (see https://api.census.gov/data/key_signup.html) and the tidycensus package to be installed.
get_zcta_baseline(variable, year = 2022, zctas = NULL)get_zcta_baseline(variable, year = 2022, zctas = NULL)
variable |
ACS variable code (e.g., "B01001_001" for total population) |
year |
ACS year (default: 2022) |
zctas |
Optional character vector of ZCTAs to filter to |
Data frame with columns: zcta, estimate, moe
## Not run: # Get population for all ZCTAs pop_data <- get_zcta_baseline("B01001_001", year = 2022) # Get population for specific ZCTAs hennepin_zctas <- c("55401", "55402", "55403") pop_data <- get_zcta_baseline("B01001_001", zctas = hennepin_zctas) ## End(Not run)## Not run: # Get population for all ZCTAs pop_data <- get_zcta_baseline("B01001_001", year = 2022) # Get population for specific ZCTAs hennepin_zctas <- c("55401", "55402", "55403") pop_data <- get_zcta_baseline("B01001_001", zctas = hennepin_zctas) ## End(Not run)
Ensures geographic identifiers are zero-padded to 5 digits.
pad_geoid(geoid)pad_geoid(geoid)
geoid |
Character or numeric vector of geographic identifiers |
Character vector of 5-digit zero-padded GEOIDs
pad_geoid(c("123", "45678", 789)) #> [1] "00123" "45678" "00789"pad_geoid(c("123", "45678", 789)) #> [1] "00123" "45678" "00789"
Creates a map showing the baseline ZCTAs used in the analysis.
plot_baseline_zctas(zcta_sf, county_sf, title = "Baseline ZCTAs")plot_baseline_zctas(zcta_sf, county_sf, title = "Baseline ZCTAs")
zcta_sf |
SF object with ZCTA geometries |
county_sf |
SF object with county boundary |
title |
Plot title |
A ggplot2 object
Visualizes the discrepancy between geometric intersection and relationship-based membership.
plot_geometric_vs_relationship( zcta_baseline_sf, zcta_geometric_sf, county_sf, title = "Geometric vs Relationship Membership" )plot_geometric_vs_relationship( zcta_baseline_sf, zcta_geometric_sf, county_sf, title = "Geometric vs Relationship Membership" )
zcta_baseline_sf |
SF object with baseline ZCTAs (relationship-based) |
zcta_geometric_sf |
SF object with all geometrically intersecting ZCTAs |
county_sf |
SF object with county boundary |
title |
Plot title |
A ggplot2 object
Creates a simple bar chart showing baseline vs recovered values.
plot_transformation_perturbation(audit_result)plot_transformation_perturbation(audit_result)
audit_result |
A shellgame_audit object |
A ggplot2 object
Standardizes HUD crosswalk data with proper column names and formatting.
prep_hud_crosswalk(data, ratio_col = "TOT_RATIO")prep_hud_crosswalk(data, ratio_col = "TOT_RATIO")
data |
Raw HUD crosswalk data frame |
ratio_col |
Name of the ratio column to use (default: "TOT_RATIO") |
Data frame with standardized columns: zip, county, tot_ratio
## Not run: hud_raw <- read.csv("HUD_ZIP_COUNTY.csv") hud <- prep_hud_crosswalk(hud_raw) ## End(Not run)## Not run: hud_raw <- read.csv("HUD_ZIP_COUNTY.csv") hud <- prep_hud_crosswalk(hud_raw) ## End(Not run)
Standardizes ZIP-ZCTA crosswalk data with proper column names and formatting.
prep_zip_zcta(data, zip_col = NULL, zcta_col = "zcta")prep_zip_zcta(data, zip_col = NULL, zcta_col = "zcta")
data |
Raw ZIP-ZCTA crosswalk data frame |
zip_col |
Name of the ZIP code column (default: "ZIP_CODE" or "zip") |
zcta_col |
Name of the ZCTA column (default: "zcta") |
Data frame with standardized columns: zcta, zip
## Not run: zip_zcta_raw <- read.csv("ZiptoZCTA-Table 1.csv") zip_zcta <- prep_zip_zcta(zip_zcta_raw) ## End(Not run)## Not run: zip_zcta_raw <- read.csv("ZiptoZCTA-Table 1.csv") zip_zcta <- prep_zip_zcta(zip_zcta_raw) ## End(Not run)
Print method for shellgame_audit
## S3 method for class 'shellgame_audit' print(x, ...)## S3 method for class 'shellgame_audit' print(x, ...)
x |
A shellgame_audit object |
... |
Additional arguments (ignored) |
Executes both hops: ZCTA → ZIP → County. Tracks the complete swap from observed to imputed data.
run_full_transformation( baseline_data, zip_zcta_map, hud_crosswalk, value_col = "estimate", county_fips = NULL )run_full_transformation( baseline_data, zip_zcta_map, hud_crosswalk, value_col = "estimate", county_fips = NULL )
baseline_data |
Data frame with ZCTA-level baseline data |
zip_zcta_map |
ZIP-ZCTA association table |
hud_crosswalk |
HUD ZIP-County crosswalk |
value_col |
Name of value column in baseline_data |
county_fips |
Optional county FIPS to filter final result |
List with intermediate and final results
## Not run: result <- run_full_transformation( baseline_data = zcta_pop, zip_zcta_map = zip_zcta, hud_crosswalk = hud, value_col = "pop", county_fips = "27053" ) ## End(Not run)## Not run: result <- run_full_transformation( baseline_data = zcta_pop, zip_zcta_map = zip_zcta, hud_crosswalk = hud, value_col = "pop", county_fips = "27053" ) ## End(Not run)
Summary method for shellgame_audit
## S3 method for class 'shellgame_audit' summary(object, ...)## S3 method for class 'shellgame_audit' summary(object, ...)
object |
A shellgame_audit object |
... |
Additional arguments (ignored) |
Performs the first hop: ZCTA → ZIP using association-based allocation. This is where the first swap occurs: observed data → imputed data.
transform_zcta_to_zip(baseline_data, zip_zcta_map, value_col = "estimate")transform_zcta_to_zip(baseline_data, zip_zcta_map, value_col = "estimate")
baseline_data |
Data frame with columns: zcta, and a value column |
zip_zcta_map |
Data frame with columns: zcta, zip |
value_col |
Name of the value column in baseline_data (default: "estimate") |
Data frame with columns: zip, value (allocated to ZIP level)
## Not run: zip_data <- transform_zcta_to_zip( baseline_data = zcta_pop, zip_zcta_map = zip_zcta_assoc, value_col = "pop" ) ## End(Not run)## Not run: zip_data <- transform_zcta_to_zip( baseline_data = zcta_pop, zip_zcta_map = zip_zcta_assoc, value_col = "pop" ) ## End(Not run)
Performs the second hop: ZIP → County using HUD TOT_RATIO allocation. This is where the second swap occurs: further imputation via proxy.
transform_zip_to_county(zip_data, hud_crosswalk, county_fips = NULL)transform_zip_to_county(zip_data, hud_crosswalk, county_fips = NULL)
zip_data |
Data frame with columns: zip, value |
hud_crosswalk |
Data frame with columns: zip, county, tot_ratio |
county_fips |
Optional FIPS code to filter to specific county |
Data frame with columns: county, value (allocated to county level)
## Not run: county_data <- transform_zip_to_county( zip_data = zip_pop, hud_crosswalk = hud, county_fips = "27053" ) ## End(Not run)## Not run: county_data <- transform_zip_to_county( zip_data = zip_pop, hud_crosswalk = hud, county_fips = "27053" ) ## End(Not run)