California cities (continued)

Let’s get back roughly to where we were with the R session from California cities, reading in a copy of the data pulled from Wikipedia.

> library(dplyr)
> library(fillpattern)
> library(sf)
> library(ggplot2)
> cities <- read_sf("~/Downloads/California_Cities_and_Identifiers_Blue_Version_view_-6943741225906831761.gpkg")
> stats <- read.csv(url("https://raw.githubusercontent.com/fadend/ca_cities_data/refs/heads/main/ca_cities.csv"), stringsAsFactors=FALSE)

Oops. There’s some disagreement on names:

> setdiff(stats$city_name, cities$city_name)
[1] "Angels Camp"      "California City"  "City of Industry"
> setdiff(cities$city_name, stats$city_name)
[1] "Angels"     "California" "Industry"

Let’s standardize with Wikipedia’s choices:

> mapping <- list(`Angels`="Angels Camp", `California`="California City", `Industry`="City of Industry")

# Do you have a nicer way to do this string replacement?
replaceStrings <- function(x, mapping) {
  replacement <- mapping[x]
  x[!is.na(names(replacement))] <- unlist(mapping, use.names=FALSE)
  return(x)
}

> cities$city_name <- replaceStrings(cities$city_name, mapping)
> setequal(cities$city_name, stats$city_name)
[1] TRUE

Merge using dplyr‘s inner_join:

> d <- inner_join(cities, stats, by="city_name")
> nrow(d)
[1] 483
> class(d)
[1] "sf"         "tbl_df"     "tbl"        "data.frame"

How do the areas provided on Wikipedia compare with what we can compute ourselves?

areas <- st_area(d)
units(areas) <- "mi2"
d$computed_area_mi2 <- as.numeric(areas)
d$area_abs_diff <- abs(d$area_mi2 - d$computed_area_mi2)
d$area_rel_abs_diff <- d$area_abs_diff / d$computed_area_mi2

The top cities in terms of absolute difference in area between that provided on Wikipedia vs what we computed based on the geometries:

city_name	area_abs_diff (mi^2)
Whittier	36.10904026
San Diego	14.54519154
South Lake Tahoe	7.293364377
Eureka	6.507250615
Rancho Cucamonga	6.401620618
Coronado	6.09243771
Lake Elsinore	5.313489877
Stockton	4.373923467
Los Angeles	3.757396269
McFarland	3.673710795

Looking at a related plot:

> ggplot(d, aes(x=computed_area_mi2, y=area_mi2, label=city_name)) + geom_point() + geom_abline(slope=1) + geom_label(data=subset(d, area_abs_diff >= head(sort(area_abs_diff, decreasing=TRUE), n=2)[2]), hjust="left", nudge_x=5)

Probably (?) the relative difference is more important in detecting problems with the geometries. Whittier stands out again:

city_name	area_rel_abs_diff
Whittier	71.1%
McFarland	57.8%
Coronado	43.8%
South Lake Tahoe	41.7%
Eureka	40.6%
Mountain House	34.0%
Ukiah	29.2%
Patterson	23.2%
National City	21.6%
Woodlake	20.0%

Here’s what it looks like on OpenStreetMap:

Comparing that with the geometry we have, something seems off:

ggplot(subset(d, city_name == "Whittier")) + geom_sf()

It turns out, the city of Whittier also provides geometry for the city: https://city-of-whittier-open-data-portal-whittier.hub.arcgis.com/datasets/307f25443bcb4cbcad0cd4bcaf3ca945_0/explore?location=33.979039%2C-118.018727%2C12.80. Let’s compare that:

# Pull out just "Whittier" from the state provided data.
whittierWeird <- subset(d, city_name == "Whittier")

# Read in the data from the city.
whittier <- read_sf("~/Downloads/Whittier_City_Boundary.geojson")

# Ensure the coordinate reference systems match.
# Also drop some data related to water district by requiring
# CITY_TYPE == "City"
whittierTransformed <- st_transform(subset(whittier, CITY_TYPE == "City"), st_crs(whittierWeird))

> whittierArea <- st_area(whittierTransformed)
> units(whittierArea) <- "mi2"
> whittierArea
14.74093 [mi^2]

# Compare with what we computed earlier
> whittierWeird$computed_area_mi2
[1] 50.75904

# Make sure the name for the geometries match
st_geometry(whittierTransformed) <- "SHAPE"

# Get ready to combine them for plotting
> whittierWeird$source = "State"
> whittierTransformed$source = "City"
> common <- intersect(names(whittierWeird), names(whittierTransformed))

# See what they look like together
ggplot(rbind(whittierWeird[, common], whittierTransformed[, common]), aes(fill=source)) + geom_sf() + scale_fill_pattern(c("stripe45", "stripe135"), width=unit(1/100, 'npc'), alpha=0.5)

The area in square miles for the geometry provided by the city matches the area on Wikipedia. From the plot, we see that the geometry provided in the file from the state encompasses the whole area of the city and that the geometry provided by the city matches OpenStreetMap:

The area to the northeast of Whittier is Hacienda Heights, “an unincorporated suburban community” (Wikipedia). So, perhaps some unincorporated areas got included in “Whittier” in the state version of the map?

Wikipedia says that the following lat-lng are in Whittier: 33.965556, -118.024444. Double checking that:

> whittierPoint <- st_transform(st_sf(st_sfc(st_point(c(-118.024444, 33.965556)), crs=4326)), st_crs(d))
> d$city_name[sapply(st_contains(d, whittierPoint), length) > 0]
[1] "Whittier"

(I got some help from Gemini figuring out how to get that intersection code right. The last bit with sapply though, I had to figure out on my own.)

I’ll try sending an e-mail to the data steward now to ask about Whittier.

California cities (continued)

Comments

Leave a Reply Cancel reply