Let’s get back roughly to where we were with the R session from California cities, reading in a copy of the data pulled from Wikipedia.
> library(dplyr)
> library(fillpattern)
> library(sf)
> library(ggplot2)
> cities <- read_sf("~/Downloads/California_Cities_and_Identifiers_Blue_Version_view_-6943741225906831761.gpkg")
> stats <- read.csv(url("https://raw.githubusercontent.com/fadend/ca_cities_data/refs/heads/main/ca_cities.csv"), stringsAsFactors=FALSE)
Oops. There’s some disagreement on names:
> setdiff(stats$city_name, cities$city_name)
[1] "Angels Camp" "California City" "City of Industry"
> setdiff(cities$city_name, stats$city_name)
[1] "Angels" "California" "Industry"
Let’s standardize with Wikipedia’s choices:
> mapping <- list(`Angels`="Angels Camp", `California`="California City", `Industry`="City of Industry")
# Do you have a nicer way to do this string replacement?
replaceStrings <- function(x, mapping) {
replacement <- mapping[x]
x[!is.na(names(replacement))] <- unlist(mapping, use.names=FALSE)
return(x)
}
> cities$city_name <- replaceStrings(cities$city_name, mapping)
> setequal(cities$city_name, stats$city_name)
[1] TRUE
Merge using dplyr‘s inner_join:
> d <- inner_join(cities, stats, by="city_name")
> nrow(d)
[1] 483
> class(d)
[1] "sf" "tbl_df" "tbl" "data.frame"
How do the areas provided on Wikipedia compare with what we can compute ourselves?
areas <- st_area(d)
units(areas) <- "mi2"
d$computed_area_mi2 <- as.numeric(areas)
d$area_abs_diff <- abs(d$area_mi2 - d$computed_area_mi2)
d$area_rel_abs_diff <- d$area_abs_diff / d$computed_area_mi2
The top cities in terms of absolute difference in area between that provided on Wikipedia vs what we computed based on the geometries:
| city_name | area_abs_diff (mi^2) |
| Whittier | 36.10904026 |
| San Diego | 14.54519154 |
| South Lake Tahoe | 7.293364377 |
| Eureka | 6.507250615 |
| Rancho Cucamonga | 6.401620618 |
| Coronado | 6.09243771 |
| Lake Elsinore | 5.313489877 |
| Stockton | 4.373923467 |
| Los Angeles | 3.757396269 |
| McFarland | 3.673710795 |
Looking at a related plot:
> ggplot(d, aes(x=computed_area_mi2, y=area_mi2, label=city_name)) + geom_point() + geom_abline(slope=1) + geom_label(data=subset(d, area_abs_diff >= head(sort(area_abs_diff, decreasing=TRUE), n=2)[2]), hjust="left", nudge_x=5)

Probably (?) the relative difference is more important in detecting problems with the geometries. Whittier stands out again:
| city_name | area_rel_abs_diff |
| Whittier | 71.1% |
| McFarland | 57.8% |
| Coronado | 43.8% |
| South Lake Tahoe | 41.7% |
| Eureka | 40.6% |
| Mountain House | 34.0% |
| Ukiah | 29.2% |
| Patterson | 23.2% |
| National City | 21.6% |
| Woodlake | 20.0% |
Here’s what it looks like on OpenStreetMap:

Comparing that with the geometry we have, something seems off:
ggplot(subset(d, city_name == "Whittier")) + geom_sf()

It turns out, the city of Whittier also provides geometry for the city: https://city-of-whittier-open-data-portal-whittier.hub.arcgis.com/datasets/307f25443bcb4cbcad0cd4bcaf3ca945_0/explore?location=33.979039%2C-118.018727%2C12.80. Let’s compare that:
# Pull out just "Whittier" from the state provided data.
whittierWeird <- subset(d, city_name == "Whittier")
# Read in the data from the city.
whittier <- read_sf("~/Downloads/Whittier_City_Boundary.geojson")
# Ensure the coordinate reference systems match.
# Also drop some data related to water district by requiring
# CITY_TYPE == "City"
whittierTransformed <- st_transform(subset(whittier, CITY_TYPE == "City"), st_crs(whittierWeird))
> whittierArea <- st_area(whittierTransformed)
> units(whittierArea) <- "mi2"
> whittierArea
14.74093 [mi^2]
# Compare with what we computed earlier
> whittierWeird$computed_area_mi2
[1] 50.75904
# Make sure the name for the geometries match
st_geometry(whittierTransformed) <- "SHAPE"
# Get ready to combine them for plotting
> whittierWeird$source = "State"
> whittierTransformed$source = "City"
> common <- intersect(names(whittierWeird), names(whittierTransformed))
# See what they look like together
ggplot(rbind(whittierWeird[, common], whittierTransformed[, common]), aes(fill=source)) + geom_sf() + scale_fill_pattern(c("stripe45", "stripe135"), width=unit(1/100, 'npc'), alpha=0.5)
The area in square miles for the geometry provided by the city matches the area on Wikipedia. From the plot, we see that the geometry provided in the file from the state encompasses the whole area of the city and that the geometry provided by the city matches OpenStreetMap:

The area to the northeast of Whittier is Hacienda Heights, “an unincorporated suburban community” (Wikipedia). So, perhaps some unincorporated areas got included in “Whittier” in the state version of the map?
Wikipedia says that the following lat-lng are in Whittier: 33.965556, -118.024444. Double checking that:
> whittierPoint <- st_transform(st_sf(st_sfc(st_point(c(-118.024444, 33.965556)), crs=4326)), st_crs(d))
> d$city_name[sapply(st_contains(d, whittierPoint), length) > 0]
[1] "Whittier"
(I got some help from Gemini figuring out how to get that intersection code right. The last bit with sapply though, I had to figure out on my own.)
I’ll try sending an e-mail to the data steward now to ask about Whittier.
Leave a Reply