--- title: "Comparing Rurality Classification Schemes with rurality_spec()" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Comparing Rurality Classification Schemes with rurality_spec()} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5 ) set.seed(42) ``` ## The problem Research on "rural" outcomes in the United States faces an awkward fact: there is no single definition of rural. At least four classification schemes are in widespread use, each maintained by a different agency with a different purpose: - **USDA RUCC** (Rural-Urban Continuum Codes, 2023) — nine categories built from county-level metro adjacency and population. - **USDA RUCA** (Rural-Urban Commuting Area, 2020) — ten categories defined at the ZCTA level from commuting flows, available here aggregated to counties via modal ZCTA. - **NCHS** (Urban-Rural Classification Scheme for Counties, 2023) — six categories developed by the National Center for Health Statistics for health-outcome research. - **OMB** (Metropolitan/Micropolitan Statistical Areas) — a three-way metro/micro/noncore split based on core urban populations and commuting. A researcher asking "does X differ between rural and urban counties?" implicitly picks one scheme, one way of collapsing it (ordinal vs. binary metro/nonmetro), and one set of covariates. Different choices often produce different answers, and the literature is full of conflicting claims about rural–urban gaps that are really claims about *which scheme was used*. `rurality_spec()` addresses this by running a **specification curve**: the same outcome model is refit across every reasonable combination of scheme, form, and covariate adjustment, and the distribution of estimates is reported. This is a *sensitivity* tool, not a model-selection tool. The point is not to find the specification with the largest coefficient or smallest p-value — that would be p-hacking. The point is to show how much your substantive conclusion depends on the (often arbitrary) choices an analyst must make. ```{r setup} library(rurality) library(dplyr) ``` ## The `county_crosswalk` dataset All four schemes, plus ACS 2022 covariates, are harmonized onto a single county backbone in `county_crosswalk`: ```{r} county_crosswalk ``` The key columns: | Column | Scheme / source | |---|---| | `rucc_2023` | USDA RUCC (1–9) | | `ruca_2020_county` | USDA RUCA modal ZCTA (1–10) | | `nchs_2023` | NCHS (1–6) | | `omb_class` | OMB (`metro`, `micro`, `noncore`) | | `pct_urban_2020` | 2020 Census percent urban | | `pop_total`, `median_inc`, `pct_ba_plus`, `pct_nh_white` | ACS 2022 5-year | You can use this directly for any analysis that needs a single harmonized rurality crosswalk. ## Running `rurality_spec()` Bring your own county-level outcome data, keyed by 5-digit FIPS. Here we simulate one for illustration: ```{r} df <- county_crosswalk |> transmute( fips, # Simulated outcome: mild negative association with RUCC + noise y = 0.60 - 0.02 * rucc_2023 + rnorm(n(), 0, 0.10) ) head(df) ``` The basic call: ```{r, fig.cap = "Specification curve across 24 models."} results <- rurality_spec(df, outcome = "y") ``` The returned tibble has one row per specification: ```{r} results |> select(scheme, form, covars, n, estimate, p.value, r.squared) ``` By default, `rurality_spec()` fits 4 schemes × 2 forms × 3 covariate sets = 24 OLS models. You can narrow the space if you only care about a subset: ```{r} rurality_spec( df, outcome = "y", schemes = c("rucc", "nchs"), forms = "ordinal", covar_sets = c("minimal", "full"), plot = FALSE ) |> select(scheme, form, covars, estimate, p.value) ``` ## Reading the specification curve The plot produced by default shows: - **x-axis**: specifications ordered by coefficient estimate. - **y-axis**: the coefficient on the rurality predictor, with a 95% confidence interval as a vertical line. - **color**: blue if significant at p < 0.05, grey otherwise. - **facets**: one panel per outcome (when multiple outcomes are supplied). Three patterns to look for: 1. **Robust effect** — the dots cluster tightly around a nonzero value and most are significant. Your substantive finding does not depend on analyst choices. 2. **Fragile effect** — the dots span zero, some positive, some negative, few significant. What looks like "an effect of rurality" in any single model is largely an artifact of the scheme and cutpoint chosen. 3. **Systematic structure** — estimates vary with the *type* of choice (e.g., RUCA consistently larger than OMB, or binary forms larger than ordinal). That is a substantive finding about measurement: different schemes capture different things. ## Multiple outcomes at once You can pass a character vector of outcome columns: ```{r} df2 <- df |> mutate(y2 = 0.40 + rnorm(n(), 0, 0.10)) # no real rurality signal results2 <- rurality_spec(df2, outcome = c("y", "y2")) ``` Each outcome gets its own facet in the plot, and each row in `results2` is tagged with its `outcome`. ## Reporting results When writing up a spec-curve analysis, report the *distribution*, not the single "best" cell: ```{r} results |> summarise( n_specs = n(), median_est = median(estimate, na.rm = TRUE), q25 = quantile(estimate, 0.25, na.rm = TRUE), q75 = quantile(estimate, 0.75, na.rm = TRUE), pct_signif = mean(p.value < 0.05, na.rm = TRUE), pct_same_sign = mean(sign(estimate) == sign(median(estimate, na.rm = TRUE)), na.rm = TRUE) ) ``` A defensible write-up might read: *"Across 24 specifications varying classification scheme (RUCC, RUCA, NCHS, OMB), functional form (ordinal, binary), and covariate set, the estimated association between rurality and Y had a median of −0.021 (IQR: −0.025 to −0.017); 88% of specifications were significant at p < 0.05 and all recovered the same sign."* ## Further reading The specification-curve approach draws on two closely related literatures: - Simonsohn, Simmons, and Nelson (2020), *Nature Human Behaviour* — the canonical reference for specification curve analysis. - Steegen, Tuerlinckx, Gelman, and Vanpaemel (2016), *Perspectives on Psychological Science* — "Increasing Transparency Through a Multiverse Analysis," the broader multiverse framework. ## Caveats - `rurality_spec()` currently fits OLS only. For binary or count outcomes, fit the specifications manually using `county_crosswalk` as the backbone. - Standard errors are classical; clustered/robust SEs are not built in yet. - The three covariate sets are opinionated defaults. Pass your own via the `covariates` argument to append them to every specification.