Plumbing the Data:
Mapping U.S. Water Insecurity
Final project proposal - INFO 526
This project explores county-level water insecurity in the U.S. for 2022 and 2023, focusing on access to complete indoor plumbing. Using data from the American Community Survey (ACS), we analyze geographic and demographic patterns across over 850 counties each year. The goal is to identify regional disparities and year-over-year trends to better understand the social factors linked to plumbing access.
Packages Setup
Installed Packages
Dataset
# Load water insecurity data from the TidyTuesday project (2025-01-28)
# This dataset explores social vulnerability and access to complete indoor
# plumbing across U.S. counties,
# curated by Niha Pereira and featured in the blog post:
# "Mapping water insecurity in R with tidycensus"
# Original data sources include the U.S. Census Bureau (ACS) and the USGS Vizlab’s
# “Unequal Access to Water” visualization.
# Repo:
# https://github.com/rfordatascience/tidytuesday/tree/main/data/2025/2025-01-28
water_insecurity_2022 <- read_csv(here("water_insecurity",
"water_insecurity_2022.csv"))
water_insecurity_2023 <- read_csv(here("water_insecurity",
"water_insecurity_2023.csv"))
glimpse(water_insecurity_2022)
Rows: 848
Columns: 7
$ geoid <chr> "01069", "04001", "06037", "06097", "06001", …
$ name <chr> "Houston County, Alabama", "Apache County, Ar…
$ year <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 202…
$ geometry <chr> "list(list(c(975267.980555021, 975512.9445474…
$ total_pop <dbl> 108079, 65432, 9721138, 482650, 1628997, 8978…
$ plumbing <dbl> 93, 2440, 6195, 148, 808, 18, 128, 0, 123, 13…
$ percent_lacking_plumbing <dbl> 0.08604817, 3.72906223, 0.06372711, 0.0306640…
Rows: 854
Columns: 7
$ geoid <chr> "01003", "01069", "06037", "06087", "06097", …
$ name <chr> "Baldwin County, Alabama", "Houston County, A…
$ year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
$ geometry <chr> "list(list(c(765297.99052762, 765703.76567671…
$ total_pop <dbl> 253507, 108462, 9663345, 261547, 481812, 1155…
$ plumbing <dbl> 271, 30, 5248, 187, 308, 517, 4, 198, 1269, 8…
$ percent_lacking_plumbing <dbl> 0.106900401, 0.027659457, 0.054308317, 0.0714…
About the ACS Plumbing Data
For the final project, we selected the U.S. Water Insecurity dataset from TidyTuesday (2025-01-28). The 2022 and 2023 datasets were collected by the U.S. Census Bureau’s American Community Survey (ACS). Each dataset provides comprehensive county-level information, including geographic boundaries, population size, the number of households lacking plumbing, and the percentage of the population lacking plumbing facilities.
The two datastes includes the following variables:
geoid
: The U.S. Census Bureau ACS county id.name
: The U.S. Census Bureau ACS county name.year
: The year of U.S. Census Bureau ACS sample.geometry
: The county geographic boundaries.total_pop
: The total population.plumbing
: The total owner occupied households lacking plumbing facilities.percent_lacking_plumbing
: The percent of population lacking plumbing facilities.
water_insecurity_2022_data
Code: Water Insecurity 2022 data analysis
# Create summary table
table1 <- tibble(
variable = c(
"geoid",
"name",
"geometry",
"year",
"total_pop",
"plumbing",
"percent_lacking_plumbing"
),
types = c(
"character",
"character",
"character",
"numeric",
"numeric",
"numeric",
"numeric"
),
missing_count = c(0, 0, 0, 0, 0, 2, 2),
missing_percent = c(0, 0, 0, 0, 0, 0.23585, 0.23585)
)
# - format the table with the gt library
gt_table_1 <- table1 |>
gt() |>
tab_header(
#<--- format as markdown
title = md("Table 1: Variable Summary `water_insecurity_2022_dat`") #<--- format as markdown
) |>
fmt(
columns = missing_percent,
fns = function(x) paste0(formatC(x, format = "f", digits = 3), " %")
) |>
cols_label(
variable = "Variable Name",
types = "Type",
missing_count = "Missing Count",
missing_percent = "Missing (%)"
) |>
tab_style(
style = cell_text(weight = "bold", align = "center"),
locations = cells_column_labels(everything())
) |>
tab_style(style = cell_text(align = "center"),
locations = cells_body(everything())) |>
tab_options(
table.font.size = "small",
column_labels.font.size = "medium",
table.width = pct(100) # <-- full width of container
)
Table 1: Variable Summary water_insecurity_2022_dat |
|||
---|---|---|---|
Variable Name | Type | Missing Count | Missing (%) |
geoid | character | 0 | 0.000 % |
name | character | 0 | 0.000 % |
geometry | character | 0 | 0.000 % |
year | numeric | 0 | 0.000 % |
total_pop | numeric | 0 | 0.000 % |
plumbing | numeric | 2 | 0.236 % |
percent_lacking_plumbing | numeric | 2 | 0.236 % |
water_insecurity_2023_data
Code: Water Insecurity 2023 data analysis
# - manually create a table..
table2 <- tibble(
variable = c(
"geoid",
"name",
"geometry",
"year",
"total_pop",
"plumbing",
"percent_lacking_plumbing"
),
types = c(
"character",
"character",
"character",
"numeric",
"numeric",
"numeric",
"numeric"
),
missing_count = c(0, 0, 0, 0, 0, 1, 1),
missing_percent = c(0, 0, 0, 0, 0, 0.1171, 0.1171)
)
# - format the table with the gt library
gt_table_2 <- table2 |>
gt() |>
tab_header(
#<--- format as markdown
title = md("Table 2: Variable Summary `water_insecurity_2023_dat`")
) |>
fmt(
columns = missing_percent,
fns = function(x)
paste0(formatC(x, format = "f", digits = 3), " %")
) |>
cols_label(
variable = "Variable Name",
types = "Type",
missing_count = "Missing Count",
missing_percent = "Missing (%)"
) |>
tab_style(
style = cell_text(weight = "bold", align = "center"),
locations = cells_column_labels(everything())
) |>
tab_style(style = cell_text(align = "center"),
locations = cells_body(everything())) |>
tab_options(
table.font.size = "small",
column_labels.font.size = "medium",
table.width = pct(100) # <-- full width of container
)
Table 2: Variable Summary water_insecurity_2023_dat |
|||
---|---|---|---|
Variable Name | Type | Missing Count | Missing (%) |
geoid | character | 0 | 0.000 % |
name | character | 0 | 0.000 % |
geometry | character | 0 | 0.000 % |
year | numeric | 0 | 0.000 % |
total_pop | numeric | 0 | 0.000 % |
plumbing | numeric | 1 | 0.117 % |
percent_lacking_plumbing | numeric | 1 | 0.117 % |
Synopsis of data quality:
These two datasets contain data on the lack of complete indoor plumbing in U.S. counties for the years 2022 and 2023. The 2022 dataset includes information from 848 counties, while the 2023 dataset includes 854 counties. It should be noted that the disparity in the number of counties is entirely due to increased reporting between 2022 and 2023. Each dataset contains seven variables: geoid, name, geometry, year, total_pop, plumbing, percent_lacking_plumbing. The columns plumbing and percent_lacking_plumbing have a small number of missing values—less than 0.3%—which does not significantly impact the overall analysis.
Why we chose this data-set:
We chose this dataset because water insecurity is a critical public health issue that may be influenced by various social vulnerability indicators. Water insecurity is a multifaceted issue that encompasses not only access to water but also its quality, affordability, reliability, and infrastructure support. In this project, we focus specifically on plumbing insecurity, which remains a critical and measurable dimension of basic water access. The data, sourced from the American Community Survey (ACS), provides county-level information on indoor plumbing insecurity for both 2022 and 2023. Since it covers multiple counties and spans two years, it is well-suited for analyzing year-over-year trends and regional disparities across the U.S. Improved water security can lead to reduced disease transmission, better hygiene, and overall improvements in quality of life. Additionally, the dataset is clean, complete, well-documented, and ideal for collaborative analysis and visualization.
Team roles
Each team member will be assigned one specific question to focus on (Yashi – Question 1, Nathan – Question 2). They will be responsible for independently cleaning, preparing, and analyzing the data related to their assigned question to ensure accuracy and reliability. Throughout the process, team members will collaborate closely during the problem analysis phase, sharing insights and providing constructive feedback to improve problem-solving strategies and ensure their approaches align with the overall objectives of the project.
In addition to their individual tasks, both team members will actively contribute to the collective development of the project by co-authoring the proposal, designing and delivering the presentation, and building and maintaining the project website.
Questions
Note:
While water insecurity encompasses a range of issues, our analysis specifically focuses on plumbing insecurity, as reported in 2022 and 2023.
The two questions to be answered are:
Question 1: Top counties had the highest plumbing insecurity in 2022 and 2023.
Question 2: How did the plumbing insecurity change from 2022 to 2023 across U.S. counties, and where do significant spatial clusters (hotspots and coldspots) of worsening or improving plumbing access exist?
Analysis plan
The following are the approaches we will be using for each question.
Approach for question 1
We want to visualize which U.S. counties experienced the highest level of plumbing insecurity in 2022 and 2023. To represent the available data best, we will be creating bar graphs showing the top 10 counties with the highest percentages of households lacking complete indoor plumbing for each year. The comparison of the plot will be done for each individual year to highlight the most affected regions in the U.S.
We will clean the data using dplyr package by filtering and selecting the required variables for the analysis, including name, percent_lacking_plumbing, and year. We will sort the counties in descending order of percent_lacking_plumbing for each year. These subsets will then be combined into a single dataset for visualization.
To visualize the results, we will create bar graphs and choropleth map to show comparison between 2022 and 2023 using facet separation and explore year-over-year changes in plumbing access. These plots will allow us to address our first research question by identifying the top 10 counties with plumbing insecurity in the U.S between 2022 and 2023 and the counties that experienced any notable shifts in plumbing insecurity between these two years.
Approach for question 2
We aim to identify and visualize spatial clusters of U.S. counties that experienced significant improvement or worsening in plumbing access between 2022 and 2023. Our goal is to reveal where changes in infrastructure equity are concentrated, rather than occurring randomly across the country. To do this, we will create a color-coded choropleth map representing hotspots (regions of worsening) and coldspots (regions of improvement) in the percentage of households lacking complete plumbing facilities.
Before analysis, we will clean and filter the data using the dplyr package to ensure consistency between the two years. Rows with missing percent_lacking_plumbing values will be removed to ensure accurate change calculations. Since the 2023 dataset includes additional counties not present in 2022, we will exclude these newly added counties, as we cannot calculate change for them. Only counties with valid data for both 2022 and 2023 will be retained, and we will compute the difference in percent_lacking_plumbing to measure change.
To detect spatial clusters of change, we will use the Getis-Ord G (Local G) statistic via the spdep
package. This analysis will classify each county based on whether it is part of a statistically significant cluster of improvement or decline. Counties will be labeled as hotspots (worsening), coldspots (improving), or not significant, depending on the spatial pattern of their change values. Counties will be labeled as:
- Hotspots (worsening)
- Coldspots (improving)
- Not significant
To address the question of spatial clustering in plumbing access change (Q2):
In order to strengthen the analysis and fully frame the spatial hotspot mapping, the following hypothesis has been introduced:
Hypothesis: “Changes in plumbing access over 2022–2023 are spatially clustered, with neighboring counties showing similar directions of change.”
Null Hypothesis: “There is no spatial clustering; changes in plumbing access are randomly distributed geographically.”
This hypothesis supports the goal of detecting localized improvements or declines, and justifies the use of spatial statistical techniques (e.g., Local G*) to evaluate clustering significance.
In this context, a “significant” spatial cluster refers to a geographic pattern where counties with similar levels of plumbing access change are more spatially concentrated than would be expected by chance:
- Hotspots: z ≥ 1.96 → p ≤ 0.05 (upper tail)
- Coldspots: z ≤ -1.96 → p ≤ 0.05 (lower tail)
The results will be visualized using tmap
to create a choropleth map, allowing us to identify where in the U.S. plumbing access is improving or deteriorating in a geographically meaningful way.
Note:
If the analysis of statistically significant changes over the 1-year timeframe for question two does not reveal clear spatial patterns or meaningful differences, the team will consider an alternative approach. This would involve examining the data based on relative percent changes—for example, identifying counties with at least a 10% or 25% increase or decrease in plumbing access. This method may highlight meaningful shifts that are not captured by strict significance testing, providing additional insights into spatial trends.
Variables examined per question
Table 3: Summary of Variables Examined by Research Question | ||
---|---|---|
Variables | Q1 | Q2 |
geoid | - | X |
name | X | X |
year | X | X |
geometry | - | X |
total_pop | - | - |
plumbing | - | - |
percent_lacking_plumbing | X | X |
Note: Current Proposal 2.0.0. Subject to change.
(version.feature.patch notation)