import numpy as np
import polars as pl
from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())
metro = pl.read_csv("data/acs_cbsa.csv")16 Spatial Data
16.1 Setup
Load all of the modules and datasets needed for the chapter.
16.2 Introduction
Every piece of data we have worked with so far in this book has existed in an abstract space defined only by the relationships between variables. A country’s GDP, a patient’s blood pressure, or a product’s sales figures are all numbers that live in a purely numerical realm. But much of the world’s most valuable data is inherently tied to physical locations on Earth. Where did this crime occur? Which neighborhoods have the highest rates of asthma? How far is the nearest hospital from each school? These questions require a fundamentally different approach to data analysis, one that treats location not as just another variable but as a geometric object with shape, area, and spatial relationships to other objects.
Spatial data analysis bridges the gap between traditional tabular data and the physical world. At its core, spatial data associates each observation with a geometry: a mathematical representation of a location or region on Earth’s surface. A geometry might be as simple as a single point (the location of a coffee shop) or as complex as a polygon with thousands of vertices (the boundary of a congressional district). The key insight is that once we have geometries, we can ask questions that would be impossible with ordinary data: Which points fall inside which polygons? Which polygons share a border? What is the nearest neighbor to each observation?
In this chapter, we introduce tools for working with spatial data in Python. Because the Polars library we have used throughout this book does not natively support spatial operations, we provide a wrapper class called DSGeo that allows us to leverage the powerful GeoPandas library while keeping our data in Polars DataFrames. This approach gives us the best of both worlds: the speed and ergonomics of Polars for standard data manipulation, combined with the mature spatial functionality of GeoPandas when we need it.
16.3 Spatial Points
The simplest type of spatial data consists of points: individual locations specified by their coordinates. Points are used to represent discrete features like cities, sensor stations, retail locations, or any other phenomenon that can be meaningfully reduced to a single location. Each point is defined by two numbers: a longitude (the east-west position) and a latitude (the north-south position).
We begin with a dataset containing information about metropolitan statistical areas (MSAs) in the United States. Each row represents a single metro region, with columns for population, economic indicators, and geographic coordinates.
metro| name | geoid | quad | lon | lat | pop | density | age_median | hh_income_median | percent_own | rent_1br_median | rent_perc_income | division |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | i64 | f64 | str |
| "New York" | 35620 | "NE" | -74.101056 | 40.76877 | 20.011812 | 1051.306467 | 42.9 | 86445 | 55.3 | 1430 | 31.0 | "Middle Atlantic" |
| "Los Angeles" | 31080 | "W" | -118.148722 | 34.219406 | 13.202558 | 1040.647281 | 41.6 | 81652 | 51.3 | 1468 | 33.6 | "Pacific" |
| "Chicago" | 16980 | "NC" | -87.95882 | 41.700605 | 9.607711 | 508.629406 | 41.9 | 78790 | 68.9 | 1060 | 29.0 | "East North Central" |
| "Dallas" | 19100 | "S" | -96.970508 | 32.84948 | 7.54334 | 323.181404 | 41.3 | 76916 | 64.1 | 1106 | 29.1 | "West South Central" |
| "Houston" | 26420 | "S" | -95.401574 | 29.787083 | 7.048954 | 316.543514 | 41.0 | 72551 | 65.1 | 997 | 30.0 | "West South Central" |
| … | … | … | … | … | … | … | … | … | … | … | … | … |
| "Zapata" | 49820 | "S" | -99.168533 | 27.000573 | 0.013945 | 5.081383 | 41.4 | 34406 | 77.1 | 321 | 34.6 | "West South Central" |
| "Ketchikan" | 28540 | "O" | -130.927217 | 55.582484 | 0.013939 | 1.046556 | 44.6 | 77820 | 67.5 | 946 | 31.2 | "Pacific" |
| "Craig" | 18780 | "W" | -108.207523 | 40.618749 | 0.01324 | 1.077471 | 44.5 | 58583 | 72.8 | 485 | 29.0 | "Mountain" |
| "Vernon" | 46900 | "S" | -99.240853 | 34.080611 | 0.012887 | 5.086411 | 41.5 | 45262 | 62.0 | 503 | 22.0 | "West South Central" |
| "Lamesa" | 29500 | "S" | -101.947637 | 32.742488 | 0.012371 | 5.291467 | 41.1 | 42778 | 71.5 | 611 | 34.5 | "West South Central" |
Because longitude and latitude are just numbers, we can create a basic visualization using plotnine without any special spatial tools. The longitude values go on the x-axis and latitude values on the y-axis.
This scatter plot is functional but has several limitations. The most obvious is that it treats longitude and latitude as if they were ordinary Cartesian coordinates, when in fact they represent positions on a curved surface. The distortion is particularly noticeable at higher latitudes, where equal increments in longitude correspond to shorter physical distances. Additionally, we have no context for these points: no coastlines, state boundaries, or other geographic features to help orient the viewer.
To unlock more sophisticated spatial functionality, we need to convert our latitude and longitude columns into proper geometric objects. The DSGeo.from_latlon() method creates a new column called geometry that stores each point as a binary blob in a standardized format. This format encodes not just the coordinates but also metadata about the coordinate reference system.
metro = DSGeo.from_latlon(metro)
metro| name | geoid | quad | lon | lat | pop | density | age_median | hh_income_median | percent_own | rent_1br_median | rent_perc_income | division | geometry |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | i64 | f64 | str | binary |
| "New York" | 35620 | "NE" | -74.101056 | 40.76877 | 20.011812 | 1051.306467 | 42.9 | 86445 | 55.3 | 1430 | 31.0 | "Middle Atlantic" | b"\x01\x01\x00\x00\x000\xadY\xb2w\x86R\xc0\x13\x04\xd7\x10gbD@" |
| "Los Angeles" | 31080 | "W" | -118.148722 | 34.219406 | 13.202558 | 1040.647281 | 41.6 | 81652 | 51.3 | 1468 | 33.6 | "Pacific" | b"\x01\x01\x00\x00\x00\x1d\xadx\xa7\x84\x89]\xc0[\xf9\x8c|\x15\x1cA@" |
| "Chicago" | 16980 | "NC" | -87.95882 | 41.700605 | 9.607711 | 508.629406 | 41.9 | 78790 | 68.9 | 1060 | 29.0 | "East North Central" | b"\x01\x01\x00\x00\x00\x07\x8boM]\xfdU\xc0F\xce\x0dn\xad\xd9D@" |
| "Dallas" | 19100 | "S" | -96.970508 | 32.84948 | 7.54334 | 323.181404 | 41.3 | 76916 | 64.1 | 1106 | 29.1 | "West South Central" | b"\x01\x01\x00\x00\x00\xaf\xe3\xc9\xcc\x1c>X\xc0\x82]\x16\xc0\xbbl@@" |
| "Houston" | 26420 | "S" | -95.401574 | 29.787083 | 7.048954 | 316.543514 | 41.0 | 72551 | 65.1 | 997 | 30.0 | "West South Central" | b"\x01\x01\x00\x00\x00gd\x01c\xb3\xd9W\xc0\x87\xbbJH~\xc9=@" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "Zapata" | 49820 | "S" | -99.168533 | 27.000573 | 0.013945 | 5.081383 | 41.4 | 34406 | 77.1 | 321 | 34.6 | "West South Central" | b"\x01\x01\x00\x00\x00$l\xfa=\xc9\xcaX\xc0\x15L[\x93%\x00;@" |
| "Ketchikan" | 28540 | "O" | -130.927217 | 55.582484 | 0.013939 | 1.046556 | 44.6 | 77820 | 67.5 | 946 | 31.2 | "Pacific" | b"\x01\x01\x00\x00\x00\x99\xe3\xda\xc3\xab]`\xc0\xc4\x90\x8f\xd8\x8e\xcaK@" |
| "Craig" | 18780 | "W" | -108.207523 | 40.618749 | 0.01324 | 1.077471 | 44.5 | 58583 | 72.8 | 485 | 29.0 | "Mountain" | b"\x01\x01\x00\x00\x00\x92\x88u\x0dH\x0d[\xc0\xe7Q\x16+3OD@" |
| "Vernon" | 46900 | "S" | -99.240853 | 34.080611 | 0.012887 | 5.086411 | 41.5 | 45262 | 62.0 | 503 | 22.0 | "West South Central" | b"\x01\x01\x00\x00\x00\xcf\x9a\x09!j\xcfX\xc0\xd3z\xccuQ\x0aA@" |
| "Lamesa" | 29500 | "S" | -101.947637 | 32.742488 | 0.012371 | 5.291467 | 41.1 | 42778 | 71.5 | 611 | 34.5 | "West South Central" | b"\x01\x01\x00\x00\x00oF\xe3\x15\xa6|Y\xc0R\xc7~\xdc\x09_@@" |
The DataFrame now contains a geometry column that appears as a series of binary values. While these values are not human-readable, they can be manipulated by specialized spatial functions that understand their internal structure.
16.4 Coordinate Systems (CRS)
One of the most important concepts in spatial data analysis is the coordinate reference system (CRS), sometimes called a spatial reference system. A CRS defines how coordinates map to actual locations on Earth. This might seem straightforward, but Earth’s surface is curved while maps are flat. Any attempt to represent the globe on a two-dimensional surface requires making trade-offs: you can preserve shapes or areas or distances, but never all three simultaneously.
The most common CRS for storing geographic data is WGS 84 (World Geodetic System 1984), which uses latitude and longitude measured in degrees. This is the system used by GPS devices and most web mapping services. When we created our geometry column above, the coordinates were assumed to be in WGS 84. However, for visualization and certain calculations, we often want to project our data into a different CRS that is optimized for a particular region or purpose.
Each CRS is identified by a numeric code in the EPSG (European Petroleum Survey Group) registry. For example:
- EPSG:4326 is WGS 84, the standard latitude/longitude system
- EPSG:5069 is NAD83 Conus Albers, an equal-area projection suitable for the contiguous United States
- EPSG:3857 is Web Mercator, used by Google Maps and other web mapping services
- EPSG:32618 is UTM Zone 18N, appropriate for the US East Coast
- EPSG:2154 is the official projection for Metropolitan France (including Corsica), mandated by French law
- EPSG:7791 is Italy’s modern national reference system and the most accurate for Italian data
To find the appropriate EPSG code for a specific region or purpose, you can search the EPSG.io website or consult the Spatial Reference database. These resources allow you to look up projections by name, region, or properties (such as whether they preserve area or shape).
When we plot our metro data with a specific CRS, the coordinates are transformed to reflect the chosen projection. The Albers Equal Area projection (EPSG:5069) is designed to minimize distortion for the contiguous United States, making it a good choice for visualizing data across the country.
Notice how the shape of the point cloud now more closely resembles the familiar outline of the continental United States. Alaska and Hawaii, if present in the data, would appear distorted under this projection because it is optimized for the lower 48 states.
The choice of projection can have significant implications for analysis. Consider calculating the distance between two cities. In WGS 84 coordinates, the distance between two points depends on their latitude: one degree of longitude represents about 111 kilometers at the equator but only about 85 kilometers at 40° latitude (roughly the latitude of New York). If you compute distances using raw latitude/longitude values as if they were Cartesian coordinates, your results will be systematically wrong.
Similarly, if you want to calculate the area of a state or country, you need to use an equal-area projection. Using a projection that distorts areas (like Web Mercator) will give you incorrect results. This is why Greenland appears larger than Africa on many web maps, even though Africa is actually 14 times larger.
16.5 Interactive Visualization
Static maps are useful for publications and reports, but interactive maps allow for exploration and discovery. The DSGeo.explore() function creates an interactive map powered by the Folium library, which in turn uses Leaflet.js. Users can zoom, pan, and click on features to see their attributes.
DSGeo.explore(metro, tooltip=[c.name])The tooltip parameter specifies which columns should be displayed when the user hovers over a point. Try zooming in on different regions and clicking on individual metro areas to see their names. The base map provides geographic context that was missing from our static scatter plot.
16.6 Spatial Polygons
While points are useful for representing discrete locations, many spatial phenomena are better represented as polygons: closed shapes that define the boundaries of regions. States, countries, zip codes, census tracts, and watersheds are all naturally represented as polygons. Each polygon is defined by a series of coordinate pairs that trace its boundary.
Spatial data is commonly stored in specialized file formats that can encode complex geometries along with their attributes. The GeoJSON format is particularly popular for web applications because it is based on JSON (JavaScript Object Notation) and can be read by virtually any programming language. GeoJSON files have a standardized structure where each row in the equivalent table is called a feature, and all features are collected into a FeatureCollection. The official specification is maintained by the Internet Engineering Task Force as RFC 7946.
Other common spatial formats include Shapefiles (a legacy format still widely used), GeoPackage (a modern SQLite-based format), and various database formats like PostGIS. Our DSGeo.read_file() function can read most of these formats.
state = DSGeo.read_file("data/acs_state.geojson")
state| name | abb | fips | geometry |
|---|---|---|---|
| str | str | str | binary |
| "Maine" | "ME" | "23" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00C\x00\x00\x00\x93\x15\xa7Z\x0b\xadQ\xc0\x10>yX\xa8\x87E@y\x15\x8cJ\xea\xb4Q\xc0\xb4\xe3\xa2ZD\x90E@\xda\xe2p\xe6W\xb4"… |
| "New Hampshire" | "NH" | "33" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00!\x00\x00\x00\xeb\x1ag\xd3\x11\xe0Q\xc0a\xc2hV\xb6\x81F@4\x85\xb2\xf0\xf5\xd9Q\xc0#\x88f\x9e\\x99F@`\xc7G\x8b3\xd2"… |
| "Delaware" | "DE" | "10" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x10\x00\x00\x00ip[[x\xf2R\xc0\x1a\x1aO\x04q\xdcC@\x1dr3\xdc\x80\xe7R\xc0X8I\xf3\xc7\xeaC@\xfe\xf34`\x90\xda"… |
| "South Carolina" | "SC" | "45" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x009\x00\x00\x00y\x89\x94f\xf3\xc6T\xc0\x8e\xb9\x17\x98\x15\x80A@\x9cI\xb7%r\x9dT\xc0s\x81\xcbc\xcd\x96A@\xda`S\xe7Q\x8b"… |
| "Nebraska" | "NE" | "31" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00.\x00\x00\x00\x18\xe2\xc9nf\x03Z\xc0;\xefU+\x13\x80E@R\xf3U\xf2\xb1\xb2Y\xc0\xb9S:X\xff\x7fE@\xb7\x93\x16.\xabm"… |
| … | … | … | … |
| "Florida" | "FL" | "12" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00t\x00\x00\x00Z\xd1\x90\xf1(@U\xc0%\xb2\x0f\xb2,\x00?@\x90rL\x16\xf7;U\xc0\x1du\xacRz\xe2>@)7\x19U\x86:"… |
| "Alaska" | "AK" | "02" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00k\x02\x00\x009\xaf\xce1\x20\x04e\xc0\x16\xa3\xae\xb5\xf7iP@\xa1\x16-@[\xffd\xc0\xc7\x19\xdf\x17\x97nP@k\x90e\xc1D\xf1"… |
| "New Jersey" | "NJ" | "34" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00$\x00\x00\x00T\xd9\xe8\x9c\x9f\xe0R\xc0M.\x00\x8d\xd2\xd7C@\xfe\xf34`\x90\xdaR\xc0-\x85#H\xa5\xe6C@\x8ft\xb1i\xa5\xc9"… |
| "North Dakota" | "ND" | "38" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x19\x00\x00\x00X\xabw\xb8\x1d\x03Z\xc0\xab'\xd6\xa9\xf2\x7fH@\x00\xc9t\xe8\xf4\x8dY\xc01v\xa4\xfa\xce\x7fH@\xfe\x1e-\xce\x18H"… |
| "Iowa" | "IA" | "19" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
The geometry column now contains polygon geometries rather than points. Each polygon is defined by many coordinate pairs (sometimes thousands for states with complex coastlines), but they are all stored efficiently in the binary geometry format.
Unlike point data, polygon data cannot be meaningfully visualized with a simple scatter plot. We need specialized functions that understand how to draw shapes from their vertex coordinates. The effect of projection is also much more noticeable with polygons than with points.
Compare this to what the same data would look like in unprojected coordinates. The familiar shapes of states would be distorted, particularly in the northern regions where lines of longitude converge. The Albers projection stretches and compresses the coordinates to create a more faithful representation of the true shapes and relative sizes of states.
Interactive maps work with polygons just as they do with points. Users can hover over states to see their names and other attributes.
DSGeo.explore(state, tooltip=[c.name])Points and polygons are the most common geometry types, but GeoJSON and other spatial formats support several others:
- LineString: A sequence of connected points forming a path, useful for roads, rivers, or flight routes
- MultiPoint: A collection of points treated as a single feature
- MultiLineString: Multiple disconnected line segments treated as a single feature
- MultiPolygon: Multiple disconnected polygons treated as a single feature (useful for states with islands, like Hawaii)
- GeometryCollection: A heterogeneous collection containing any mix of the above types
Most of the techniques in this chapter apply to all geometry types, though some operations (like calculating area) only make sense for polygons.
16.7 Polygon Metrics
Once we have polygon geometries, we can compute various geometric properties. Two particularly useful properties are the centroid (the geometric center of a polygon) and the area (the amount of surface enclosed by the polygon’s boundary).
The centroid of a polygon is itself a point geometry. Adding centroids to our state data gives us a representative point location for each state, which can be useful for labeling maps or for spatial operations that require point geometries.
state = DSGeo.add_centroid(state)
state| name | abb | fips | geometry | lon | lat |
|---|---|---|---|---|---|
| str | str | str | binary | f64 | f64 |
| "Maine" | "ME" | "23" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00C\x00\x00\x00\x93\x15\xa7Z\x0b\xadQ\xc0\x10>yX\xa8\x87E@y\x15\x8cJ\xea\xb4Q\xc0\xb4\xe3\xa2ZD\x90E@\xda\xe2p\xe6W\xb4"… | -69.225019 | 45.368586 |
| "New Hampshire" | "NH" | "33" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00!\x00\x00\x00\xeb\x1ag\xd3\x11\xe0Q\xc0a\xc2hV\xb6\x81F@4\x85\xb2\xf0\xf5\xd9Q\xc0#\x88f\x9e\\x99F@`\xc7G\x8b3\xd2"… | -71.579258 | 43.683295 |
| "Delaware" | "DE" | "10" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x10\x00\x00\x00ip[[x\xf2R\xc0\x1a\x1aO\x04q\xdcC@\x1dr3\xdc\x80\xe7R\xc0X8I\xf3\xc7\xeaC@\xfe\xf34`\x90\xda"… | -75.501089 | 38.988027 |
| "South Carolina" | "SC" | "45" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x009\x00\x00\x00y\x89\x94f\xf3\xc6T\xc0\x8e\xb9\x17\x98\x15\x80A@\x9cI\xb7%r\x9dT\xc0s\x81\xcbc\xcd\x96A@\xda`S\xe7Q\x8b"… | -80.892889 | 33.904653 |
| "Nebraska" | "NE" | "31" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00.\x00\x00\x00\x18\xe2\xc9nf\x03Z\xc0;\xefU+\x13\x80E@R\xf3U\xf2\xb1\xb2Y\xc0\xb9S:X\xff\x7fE@\xb7\x93\x16.\xabm"… | -99.810823 | 41.527493 |
| … | … | … | … | … | … |
| "Florida" | "FL" | "12" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00t\x00\x00\x00Z\xd1\x90\xf1(@U\xc0%\xb2\x0f\xb2,\x00?@\x90rL\x16\xf7;U\xc0\x1du\xacRz\xe2>@)7\x19U\x86:"… | -82.50424 | 28.652783 |
| "Alaska" | "AK" | "02" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00k\x02\x00\x009\xaf\xce1\x20\x04e\xc0\x16\xa3\xae\xb5\xf7iP@\xa1\x16-@[\xffd\xc0\xc7\x19\xdf\x17\x97nP@k\x90e\xc1D\xf1"… | -152.675093 | 64.489399 |
| "New Jersey" | "NJ" | "34" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00$\x00\x00\x00T\xd9\xe8\x9c\x9f\xe0R\xc0M.\x00\x8d\xd2\xd7C@\xfe\xf34`\x90\xdaR\xc0-\x85#H\xa5\xe6C@\x8ft\xb1i\xa5\xc9"… | -74.660909 | 40.182638 |
| "North Dakota" | "ND" | "38" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x19\x00\x00\x00X\xabw\xb8\x1d\x03Z\xc0\xab'\xd6\xa9\xf2\x7fH@\x00\xc9t\xe8\xf4\x8dY\xc01v\xa4\xfa\xce\x7fH@\xfe\x1e-\xce\x18H"… | -100.467458 | 47.446063 |
| "Iowa" | "IA" | "19" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… | -93.500512 | 42.07356 |
The centroid column contains point geometries representing the center of each state. For roughly rectangular states like Colorado, the centroid falls near the geographic center. For states with irregular shapes or multiple disconnected parts, the centroid might fall outside the state’s actual boundary.
Calculating area requires a projection that preserves area relationships. If we compute areas in unprojected WGS 84 coordinates, the results would be meaningless numbers that do not correspond to actual square kilometers or square miles. By specifying a CRS, we ensure that the calculated areas reflect true surface measurements.
state = DSGeo.add_area(state, crs=5069)
state.sort(c.area).select(c.name, c.area)| name | area |
|---|---|
| str | f64 |
| "Rhode Island" | 2765.092425 |
| "Delaware" | 5169.59288 |
| "Connecticut" | 12958.532624 |
| "Hawaii" | 15215.771452 |
| "New Jersey" | 20160.360977 |
| … | … |
| "New Mexico" | 314909.256017 |
| "Montana" | 380769.784499 |
| "California" | 409190.207104 |
| "Texas" | 687901.770044 |
| "Alaska" | 1.4557e6 |
The results show area in square meters (the default unit for most projected coordinate systems). Dividing by 1,000,000 would give square kilometers. As expected, Alaska is the largest state by far, followed by Texas and California.
16.8 Choropleth Maps
A choropleth map colors geographic regions according to the value of a variable. This type of visualization is extremely common for showing how quantities vary across space: election results by county, income levels by census tract, or disease rates by state. The column parameter tells the explore function which variable to use for coloring.
DSGeo.explore(state, column=c.area, tooltip=[c.name])The resulting map shows each state colored according to its area, with a legend indicating how colors correspond to values. Larger states appear darker (or whatever color scheme is in use), making spatial patterns immediately visible.
Choropleth maps are powerful but can be misleading. Because larger regions occupy more visual space, they can dominate the viewer’s attention even if they contain fewer people or are otherwise less important. For this reason, choropleth maps are best suited for variables that are naturally area-based (like population density) rather than absolute counts (like total population).
16.9 Spatial Joins
In Chapter 4, we learned how to join two tables based on a shared key column. Spatial joins extend this concept by joining tables based on their geometric relationships. Instead of matching rows where a key value is equal, we match rows where geometries intersect, touch, contain, or are within some distance of each other.
The most common spatial join matches points to the polygons that contain them. Given our metro areas (points) and states (polygons), we can determine which state contains each metro area by checking which state polygon each metro point falls inside.
DSGeo.sjoin(metro, state)| name | geoid | quad | lon | lat | pop | density | age_median | hh_income_median | percent_own | rent_1br_median | rent_perc_income | division | index__right | name__right | abb | fips | lon__right | lat__right | area | geometry |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | i64 | f64 | str | i64 | str | str | str | f64 | f64 | f64 | binary |
| "New York" | 35620 | "NE" | -74.101056 | 40.76877 | 20.011812 | 1051.306467 | 42.9 | 86445 | 55.3 | 1430 | 31.0 | "Middle Atlantic" | 47 | "New Jersey" | "NJ" | "34" | -74.660909 | 40.182638 | 20160.360977 | b"\x01\x01\x00\x00\x000\xadY\xb2w\x86R\xc0\x13\x04\xd7\x10gbD@" |
| "Los Angeles" | 31080 | "W" | -118.148722 | 34.219406 | 13.202558 | 1040.647281 | 41.6 | 81652 | 51.3 | 1468 | 33.6 | "Pacific" | 9 | "California" | "CA" | "06" | -119.61258 | 37.252557 | 409190.207104 | b"\x01\x01\x00\x00\x00\x1d\xadx\xa7\x84\x89]\xc0[\xf9\x8c|\x15\x1cA@" |
| "Chicago" | 16980 | "NC" | -87.95882 | 41.700605 | 9.607711 | 508.629406 | 41.9 | 78790 | 68.9 | 1060 | 29.0 | "East North Central" | 28 | "Illinois" | "IL" | "17" | -89.196324 | 40.061609 | 145928.761225 | b"\x01\x01\x00\x00\x00\x07\x8boM]\xfdU\xc0F\xce\x0dn\xad\xd9D@" |
| "Dallas" | 19100 | "S" | -96.970508 | 32.84948 | 7.54334 | 323.181404 | 41.3 | 76916 | 64.1 | 1106 | 29.1 | "West South Central" | 8 | "Texas" | "TX" | "48" | -99.349969 | 31.484882 | 687901.770044 | b"\x01\x01\x00\x00\x00\xaf\xe3\xc9\xcc\x1c>X\xc0\x82]\x16\xc0\xbbl@@" |
| "Houston" | 26420 | "S" | -95.401574 | 29.787083 | 7.048954 | 316.543514 | 41.0 | 72551 | 65.1 | 997 | 30.0 | "West South Central" | 8 | "Texas" | "TX" | "48" | -99.349969 | 31.484882 | 687901.770044 | b"\x01\x01\x00\x00\x00gd\x01c\xb3\xd9W\xc0\x87\xbbJH~\xc9=@" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "Zapata" | 49820 | "S" | -99.168533 | 27.000573 | 0.013945 | 5.081383 | 41.4 | 34406 | 77.1 | 321 | 34.6 | "West South Central" | 8 | "Texas" | "TX" | "48" | -99.349969 | 31.484882 | 687901.770044 | b"\x01\x01\x00\x00\x00$l\xfa=\xc9\xcaX\xc0\x15L[\x93%\x00;@" |
| "Ketchikan" | 28540 | "O" | -130.927217 | 55.582484 | 0.013939 | 1.046556 | 44.6 | 77820 | 67.5 | 946 | 31.2 | "Pacific" | 46 | "Alaska" | "AK" | "02" | -152.675093 | 64.489399 | 1.4557e6 | b"\x01\x01\x00\x00\x00\x99\xe3\xda\xc3\xab]`\xc0\xc4\x90\x8f\xd8\x8e\xcaK@" |
| "Craig" | 18780 | "W" | -108.207523 | 40.618749 | 0.01324 | 1.077471 | 44.5 | 58583 | 72.8 | 485 | 29.0 | "Mountain" | 19 | "Colorado" | "CO" | "08" | -105.54769 | 38.997652 | 269556.293547 | b"\x01\x01\x00\x00\x00\x92\x88u\x0dH\x0d[\xc0\xe7Q\x16+3OD@" |
| "Vernon" | 46900 | "S" | -99.240853 | 34.080611 | 0.012887 | 5.086411 | 41.5 | 45262 | 62.0 | 503 | 22.0 | "West South Central" | 8 | "Texas" | "TX" | "48" | -99.349969 | 31.484882 | 687901.770044 | b"\x01\x01\x00\x00\x00\xcf\x9a\x09!j\xcfX\xc0\xd3z\xccuQ\x0aA@" |
| "Lamesa" | 29500 | "S" | -101.947637 | 32.742488 | 0.012371 | 5.291467 | 41.1 | 42778 | 71.5 | 611 | 34.5 | "West South Central" | 8 | "Texas" | "TX" | "48" | -99.349969 | 31.484882 | 687901.770044 | b"\x01\x01\x00\x00\x00oF\xe3\x15\xa6|Y\xc0R\xc7~\xdc\x09_@@" |
The result contains all columns from both DataFrames, with each metro area matched to the state it falls within. Metro areas that span state boundaries will be assigned to whichever state contains their center point (the coordinate we used when creating the point geometry). Notice that the geometry in the output is the point geometry from the first (left) DataFrame.
We can also reverse the order of the join to get state geometry in the result. When we put the states first, the output contains polygon geometries, with each state matched to the metro areas it contains.
DSGeo.sjoin(state, metro)| name | abb | fips | lon | lat | area | index__right | name__right | geoid | quad | lon__right | lat__right | pop | density | age_median | hh_income_median | percent_own | rent_1br_median | rent_perc_income | division | geometry |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | f64 | f64 | f64 | i64 | str | i64 | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | i64 | f64 | str | binary |
| "Maine" | "ME" | "23" | -69.225019 | 45.368586 | 85262.523591 | 104 | "Portland" | 38860 | "NE" | -70.470321 | 43.695543 | 0.547792 | 93.194683 | 43.8 | 77759 | 76.8 | 968 | 28.9 | "New England" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00C\x00\x00\x00\x93\x15\xa7Z\x0b\xadQ\xc0\x10>yX\xa8\x87E@y\x15\x8cJ\xea\xb4Q\xc0\xb4\xe3\xa2ZD\x90E@\xda\xe2p\xe6W\xb4"… |
| "Maine" | "ME" | "23" | -69.225019 | 45.368586 | 85262.523591 | 365 | "Lewiston" | 30340 | "NE" | -70.206499 | 44.16555 | 0.110378 | 85.911812 | 43.4 | 59287 | 70.1 | 674 | 28.0 | "New England" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00C\x00\x00\x00\x93\x15\xa7Z\x0b\xadQ\xc0\x10>yX\xa8\x87E@y\x15\x8cJ\xea\xb4Q\xc0\xb4\xe3\xa2ZD\x90E@\xda\xe2p\xe6W\xb4"… |
| "Maine" | "ME" | "23" | -69.225019 | 45.368586 | 85262.523591 | 337 | "Augusta" | 12300 | "NE" | -69.767669 | 44.408937 | 0.123293 | 50.14378 | 44.9 | 58097 | 76.4 | 663 | 29.1 | "New England" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00C\x00\x00\x00\x93\x15\xa7Z\x0b\xadQ\xc0\x10>yX\xa8\x87E@y\x15\x8cJ\xea\xb4Q\xc0\xb4\xe3\xa2ZD\x90E@\xda\xe2p\xe6W\xb4"… |
| "Maine" | "ME" | "23" | -69.225019 | 45.368586 | 85262.523591 | 287 | "Bangor" | 12620 | "NE" | -68.650949 | 45.397272 | 0.152211 | 16.55587 | 42.7 | 55125 | 73.8 | 720 | 29.0 | "New England" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00C\x00\x00\x00\x93\x15\xa7Z\x0b\xadQ\xc0\x10>yX\xa8\x87E@y\x15\x8cJ\xea\xb4Q\xc0\xb4\xe3\xa2ZD\x90E@\xda\xe2p\xe6W\xb4"… |
| "New Hampshire" | "NH" | "33" | -71.579258 | 43.683295 | 24037.263031 | 130 | "Manchester" | 31700 | "NE" | -71.715762 | 42.915227 | 0.420504 | 182.271376 | 44.1 | 86930 | 70.8 | 1009 | 29.0 | "New England" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00!\x00\x00\x00\xeb\x1ag\xd3\x11\xe0Q\xc0a\xc2hV\xb6\x81F@4\x85\xb2\xf0\xf5\xd9Q\xc0#\x88f\x9e\\x99F@`\xc7G\x8b3\xd2"… |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 900 | "Storm Lake" | 44740 | "NC" | -95.151132 | 42.735419 | 0.020723 | 13.813099 | 41.9 | 53645 | 74.4 | 555 | 23.6 | "West North Central" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 923 | "Spencer" | 43980 | "NC" | -95.150936 | 43.082492 | 0.01641 | 11.083706 | 46.4 | 52307 | 73.1 | 571 | 27.2 | "West North Central" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 596 | "Mason City" | 32380 | "NC" | -93.26083 | 43.203198 | 0.050635 | 20.045526 | 43.9 | 58753 | 74.7 | 576 | 23.5 | "West North Central" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 917 | "Spirit Lake" | 44020 | "NC" | -95.150806 | 43.377979 | 0.017536 | 16.791337 | 44.6 | 65215 | 83.1 | 770 | 30.0 | "West North Central" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 260 | "Waterloo" | 47940 | "NC" | -92.471855 | 42.536019 | 0.168595 | 43.066768 | 39.2 | 61632 | 71.2 | 625 | 27.9 | "West North Central" | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
This produces more rows than we have states because each state can contain multiple metro areas. The logic is analogous to a one-to-many relational join: the state geometry is duplicated for each metro area it contains.
Spatial joins can use various predicates (spatial relationships) beyond simple containment. The predicate parameter specifies which relationship to test. For example, we can find which states share a border with which other states using the touches predicate.
DSGeo.sjoin(state, state, predicate="touches")| name | abb | fips | lon | lat | area | index__right | name__right | abb__right | fips__right | lon__right | lat__right | area__right | geometry |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | f64 | f64 | f64 | i64 | str | str | str | f64 | f64 | f64 | binary |
| "Maine" | "ME" | "23" | -69.225019 | 45.368586 | 85262.523591 | 1 | "New Hampshire" | "NH" | "33" | -71.579258 | 43.683295 | 24037.263031 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00C\x00\x00\x00\x93\x15\xa7Z\x0b\xadQ\xc0\x10>yX\xa8\x87E@y\x15\x8cJ\xea\xb4Q\xc0\xb4\xe3\xa2ZD\x90E@\xda\xe2p\xe6W\xb4"… |
| "New Hampshire" | "NH" | "33" | -71.579258 | 43.683295 | 24037.263031 | 43 | "Massachusetts" | "MA" | "25" | -71.844911 | 42.278432 | 20631.817699 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00!\x00\x00\x00\xeb\x1ag\xd3\x11\xe0Q\xc0a\xc2hV\xb6\x81F@4\x85\xb2\xf0\xf5\xd9Q\xc0#\x88f\x9e\\x99F@`\xc7G\x8b3\xd2"… |
| "New Hampshire" | "NH" | "33" | -71.579258 | 43.683295 | 24037.263031 | 29 | "Vermont" | "VT" | "50" | -72.66144 | 44.078034 | 24839.475007 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00!\x00\x00\x00\xeb\x1ag\xd3\x11\xe0Q\xc0a\xc2hV\xb6\x81F@4\x85\xb2\xf0\xf5\xd9Q\xc0#\x88f\x9e\\x99F@`\xc7G\x8b3\xd2"… |
| "New Hampshire" | "NH" | "33" | -71.579258 | 43.683295 | 24037.263031 | 0 | "Maine" | "ME" | "23" | -69.225019 | 45.368586 | 85262.523591 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00!\x00\x00\x00\xeb\x1ag\xd3\x11\xe0Q\xc0a\xc2hV\xb6\x81F@4\x85\xb2\xf0\xf5\xd9Q\xc0#\x88f\x9e\\x99F@`\xc7G\x8b3\xd2"… |
| "Delaware" | "DE" | "10" | -75.501089 | 38.988027 | 5169.59288 | 33 | "Maryland" | "MD" | "24" | -76.773391 | 39.040321 | 26772.630334 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x10\x00\x00\x00ip[[x\xf2R\xc0\x1a\x1aO\x04q\xdcC@\x1dr3\xdc\x80\xe7R\xc0X8I\xf3\xc7\xeaC@\xfe\xf34`\x90\xda"… |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 37 | "Missouri" | "MO" | "29" | -92.477487 | 38.367533 | 180621.339618 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 28 | "Illinois" | "IL" | "17" | -89.196324 | 40.061609 | 145928.761225 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 14 | "Wisconsin" | "WI" | "55" | -90.012344 | 44.630868 | 145016.280762 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 32 | "Minnesota" | "MN" | "27" | -94.309123 | 46.315559 | 218497.340682 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
| "Iowa" | "IA" | "19" | -93.500512 | 42.07356 | 145800.681715 | 7 | "South Dakota" | "SD" | "46" | -100.232438 | 44.436377 | 199692.46709 | b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00<\x00\x00\x00O%;6\x02\x1dX\xc0\x7f\x9f\x8e\xc7\x0c\xc0E@F\xb5\xa5\x0e\xf2\xe0W\xc0]\xb5\x89\x93\xfb\xbfE@\xee\x10\x8d\xee\x20~"… |
This self-join matches each state to every other state that shares at least one point on its boundary. The result shows all pairs of neighboring states. Notice that each border appears twice: California touches Oregon, and Oregon touches California. If you only want each pair once, you would need to filter or deduplicate the results.
Other useful predicates include:
- intersects: Geometries share any space (including boundaries)
- contains: The first geometry completely encloses the second
- within: The first geometry is completely enclosed by the second
- crosses: Geometries share some but not all interior points
- overlaps: Geometries share some space but neither contains the other
16.10 Nearest Neighbor Joins
Sometimes we want to match features not based on overlap or containment but based on proximity. The sjoin_nearest function finds, for each feature in the first DataFrame, the closest feature in the second DataFrame. This is useful for questions like “What is the nearest hospital to each school?” or “Which weather station is closest to each measurement location?”
DSGeo.sjoin_nearest(metro, metro, exclusive=True)| name | geoid | quad | lon | lat | pop | density | age_median | hh_income_median | percent_own | rent_1br_median | rent_perc_income | division | index_right | name_right | geoid_right | quad_right | lon_right | lat_right | pop_right | density_right | age_median_right | hh_income_median_right | percent_own_right | rent_1br_median_right | rent_perc_income_right | division_right | geometry |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | i64 | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | i64 | f64 | str | i64 | str | i64 | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | i64 | f64 | str | binary |
| "New York" | 35620 | "NE" | -74.101056 | 40.76877 | 20.011812 | 1051.306467 | 42.9 | 86445 | 55.3 | 1430 | 31.0 | "Middle Atlantic" | 141 | "Trenton" | 45940 | "NE" | -74.701703 | 40.283417 | 0.384951 | 650.172055 | 44.7 | 85687 | 65.7 | 1126 | 30.2 | "Middle Atlantic" | b"\x01\x01\x00\x00\x000\xadY\xb2w\x86R\xc0\x13\x04\xd7\x10gbD@" |
| "Los Angeles" | 31080 | "W" | -118.148722 | 34.219406 | 13.202558 | 1040.647281 | 41.6 | 81652 | 51.3 | 1468 | 33.6 | "Pacific" | 70 | "Oxnard" | 37100 | "W" | -119.083405 | 34.45556 | 0.845255 | 175.81324 | 42.1 | 94150 | 63.3 | 1582 | 34.2 | "Pacific" | b"\x01\x01\x00\x00\x00\x1d\xadx\xa7\x84\x89]\xc0[\xf9\x8c|\x15\x1cA@" |
| "Chicago" | 16980 | "NC" | -87.95882 | 41.700605 | 9.607711 | 508.629406 | 41.9 | 78790 | 68.9 | 1060 | 29.0 | "East North Central" | 368 | "Kankakee" | 28100 | "NC" | -87.861944 | 41.137788 | 0.108104 | 61.34992 | 41.8 | 61664 | 70.4 | 701 | 30.5 | "East North Central" | b"\x01\x01\x00\x00\x00\x07\x8boM]\xfdU\xc0F\xce\x0dn\xad\xd9D@" |
| "Dallas" | 19100 | "S" | -96.970508 | 32.84948 | 7.54334 | 323.181404 | 41.3 | 76916 | 64.1 | 1106 | 29.1 | "West South Central" | 685 | "Gainesville" | 23620 | "S" | -97.212584 | 33.639178 | 0.041215 | 17.700303 | 42.5 | 63338 | 69.6 | 666 | 25.0 | "West South Central" | b"\x01\x01\x00\x00\x00\xaf\xe3\xc9\xcc\x1c>X\xc0\x82]\x16\xc0\xbbl@@" |
| "Houston" | 26420 | "S" | -95.401574 | 29.787083 | 7.048954 | 316.543514 | 41.0 | 72551 | 65.1 | 997 | 30.0 | "West South Central" | 683 | "El Campo" | 20900 | "S" | -96.222207 | 29.277782 | 0.041602 | 14.65937 | 41.8 | 53963 | 69.2 | 600 | 26.3 | "West South Central" | b"\x01\x01\x00\x00\x00gd\x01c\xb3\xd9W\xc0\x87\xbbJH~\xc9=@" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "Zapata" | 49820 | "S" | -99.168533 | 27.000573 | 0.013945 | 5.081383 | 41.4 | 34406 | 77.1 | 321 | 34.6 | "West South Central" | 501 | "Rio Grande City" | 40100 | "S" | -98.738748 | 26.562072 | 0.065568 | 20.57443 | 40.5 | 33334 | 75.3 | 390 | 32.5 | "West South Central" | b"\x01\x01\x00\x00\x00$l\xfa=\xc9\xcaX\xc0\x15L[\x93%\x00;@" |
| "Ketchikan" | 28540 | "O" | -130.927217 | 55.582484 | 0.013939 | 1.046556 | 44.6 | 77820 | 67.5 | 946 | 31.2 | "Pacific" | 799 | "Juneau" | 27940 | "O" | -134.169542 | 58.45342 | 0.03224 | 4.445802 | 41.8 | 90126 | 68.5 | 1095 | 25.0 | "Pacific" | b"\x01\x01\x00\x00\x00\x99\xe3\xda\xc3\xab]`\xc0\xc4\x90\x8f\xd8\x8e\xcaK@" |
| "Craig" | 18780 | "W" | -108.207523 | 40.618749 | 0.01324 | 1.077471 | 44.5 | 58583 | 72.8 | 485 | 29.0 | "Mountain" | 864 | "Steamboat Springs" | 44460 | "W" | -106.990829 | 40.484142 | 0.024899 | 4.063905 | 41.4 | 83725 | 76.2 | 1371 | 29.7 | "Mountain" | b"\x01\x01\x00\x00\x00\x92\x88u\x0dH\x0d[\xc0\xe7Q\x16+3OD@" |
| "Vernon" | 46900 | "S" | -99.240853 | 34.080611 | 0.012887 | 5.086411 | 41.5 | 45262 | 62.0 | 503 | 22.0 | "West South Central" | 861 | "Altus" | 11060 | "S" | -99.414999 | 34.587933 | 0.02496 | 11.981281 | 40.8 | 55551 | 62.7 | 489 | 23.4 | "West South Central" | b"\x01\x01\x00\x00\x00\xcf\x9a\x09!j\xcfX\xc0\xd3z\xccuQ\x0aA@" |
| "Lamesa" | 29500 | "S" | -101.947637 | 32.742488 | 0.012371 | 5.291467 | 41.1 | 42778 | 71.5 | 611 | 34.5 | "West South Central" | 253 | "Midland" | 33260 | "S" | -101.991236 | 32.08914 | 0.172177 | 36.548597 | 38.9 | 87812 | 70.7 | 1128 | 26.8 | "West South Central" | b"\x01\x01\x00\x00\x00oF\xe3\x15\xa6|Y\xc0R\xc7~\xdc\x09_@@" |
Here we join the metro dataset to itself to find each metro area’s nearest neighbor. The exclusive=True parameter prevents each metro area from matching with itself (which would always be distance zero). The result shows pairs of metro areas that are nearest neighbors, along with the distance between them.
Nearest neighbor joins are computationally expensive because they potentially require comparing every feature in the first DataFrame to every feature in the second. For large datasets, this can be slow. Spatial libraries use sophisticated indexing structures (like R-trees) to speed up these queries, but they still scale less favorably than simple predicate-based joins.
16.11 Best Practices
Working with spatial data introduces several potential sources of error that do not arise with ordinary tabular data.
CRS mismatches: If two datasets use different coordinate reference systems, spatial operations will give incorrect results. Points might appear to be in the wrong location, or spatial joins might fail to find matches. Always verify that your datasets are in compatible CRSs before performing spatial operations, and transform them to a common CRS if necessary.
Unprojected coordinates for calculations: Never calculate distances or areas using raw latitude/longitude coordinates as if they were Cartesian. The errors can be substantial, especially for features that span large areas or are located far from the equator. Always project your data to an appropriate CRS first.
Polygon validity: Complex polygons can sometimes become geometrically invalid due to self-intersections or other topological errors. Many spatial functions will fail or give incorrect results on invalid geometries. If you encounter strange errors, check whether your geometries are valid and repair them if necessary.
Boundary effects: Features that span boundaries between regions can cause problems for spatial joins. A metro area that straddles a state line might be assigned to one state or the other depending on where its center point falls. Be aware of these edge cases and handle them appropriately for your analysis.


