22 Datasets

22.1 Setup

Load all of the modules and datasets needed for the chapter. In each of the sections below we briefly present the datasets used in this text and the supplemental materials. The glimpse method is used to show all of the column names, data types, and the first few rows of the dataset.

import numpy as np
import polars as pl

from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())

22.2 Countries

Sourced from GapMinder and WikiData, the countries dataset provides a snapshot of 135 nations, identifying each by its full standard name and three-letter ISO code. Geographically, entries are categorized into broad regions and specific subregions, accompanied by precise latitude and longitude coordinates. The data captures essential socioeconomic health through metrics such as total population (in millions), life expectancy, and the Human Development Index (HDI). Economic conditions are represented by GDP figures and the Gini coefficient, which measures income inequality, while broader well-being is gauged via a happiness index. Additionally, the dataset includes infrastructure and cultural details, specifically tracking cellphone adoption rates, the percentage of the population with access to improved water sources, and the primary languages spoken.

country = pl.read_csv("data/countries.csv")
country.glimpse()

Rows: 135
Columns: 15
$ iso          <str> 'SEN', 'VEN', 'FIN', 'USA', 'LKA', 'DOM', 'SGP', 'GAB', 'BGR', 'TZA'
$ full_name    <str> 'Senegal', 'Venezuela, Bolivarian Republic of', 'Finland', 'United States of America', 'Sri Lanka', 'Dominican Republic', 'Singapore', 'Gabon', 'Bulgaria', 'Tanzania, United Republic of'
$ region       <str> 'Africa', 'Americas', 'Europe', 'Americas', 'Asia', 'Americas', 'Asia', 'Africa', 'Europe', 'Africa'
$ subregion    <str> 'Western Africa', 'South America', 'Northern Europe', 'Northern America', 'Southern Asia', 'Caribbean', 'South-eastern Asia', 'Middle Africa', 'Eastern Europe', 'Eastern Africa'
$ pop          <f64> 18.932, 28.517, 5.623, 347.276, 23.229, 11.52, 5.871, 2.593, 6.715, 70.546
$ lexp         <f64> 70.43, 76.18, 82.84, 79.83, 78.51, 74.35, 85.63, 68.68, 74.33, 68.59
$ lat          <f64> 14.366667, 8.0, 65.0, 39.828175, 7.0, 18.8, 1.3, -0.683330555, 42.75, -6.306944444
$ lon          <f64> -14.283333, -67.0, 27.0, -98.5795, 81.0, -70.2, 103.8, 11.5, 25.5, 34.853888888
$ hdi          <f64> 0.53, 0.709, 0.948, 0.938, 0.776, 0.776, 0.946, 0.733, 0.845, 0.555
$ gdp          <i64> 4871, 8899, 57574, 78389, 14380, 25663, 137906, 19543, 36211, 3924
$ gini         <f64> 38.1, 44.8, 27.7, 47.7, 39.3, 39.6, null, 38.0, 40.3, 40.5
$ happy        <f64> 50.93, 57.65, 76.99, 65.21, 36.02, 59.21, 66.54, 51.04, 55.9, 40.42
$ cellphone    <f64> 66.0, 96.8, 156.4, 91.7, 83.1, 90.6, 145.5, 93.6, 137.1, 46.9
$ water_access <f64> 54.93987, 95.66913, 99.44798, 99.72235, 90.77437, 86.1939, 100.0, 49.20331, 86.00395, 26.78297
$ lang         <str> 'pbp|fra|wol', 'spa|vsl', 'fin|swe', 'eng', 'sin|sin|tam|tam', 'spa', 'eng|msa|cmn|tam', 'fra', 'bul', 'eng|swa'

Also sourced from GapMinder, the cellphone dataset is a longitudinal record containing 3,480 observations that track the adoption of mobile technology over time. Unlike the previous cross-sectional dataset, this table uses a time-series format, recording data for specific nations identified by their three-letter iso codes across multiple years. The primary metric, cell, quantifies mobile phone subscriptions (expressed per 100 people), allowing for the analysis of growth trends and technological saturation within different countries over the recorded period.

cellphone = pl.read_csv("data/countries_cellphone.csv")
cellphone.glimpse()

Rows: 3480
Columns: 3
$ iso  <str> 'AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'AFG'
$ year <i64> 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012
$ cell <f64> 0.87978, 2.54662, 4.91711, 9.9133, 18.0167, 29.8268, 38.2289, 36.1187, 47.0152, 50.1967

Sourced from Wikidata, the borders dataset provides a relational map of international boundaries, containing 829 entries that define connections between nations. Each row represents a single land border, linking a primary country (iso) to one of its adjacent neighbors (iso_neighbor) using their three-letter ISO codes. Because a single country often shares borders with multiple neighbors, the iso column contains repeated values, effectively creating an adjacency list that allows for the analysis of geographic clustering, continent connectivity, and geopolitical relationships.

border = pl.read_csv("data/countries_borders.csv")
border.glimpse()

Rows: 829
Columns: 2
$ iso          <str> 'AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'AGO', 'AGO', 'AGO', 'AGO'
$ iso_neighbor <str> 'IRN', 'PAK', 'CHN', 'TJK', 'TKM', 'UZB', 'COD', 'GAB', 'NAM', 'COG'

A spatial dataset from world-geojson provides shape files for all of the world countries. Note that this is a complete set of all countries listed by the U.N., whereas some of the data from GapMinder is filter to avoid countries with a large amount of missing demographic data.

country_geo = DSGeo.read_file("data/countries_polygons.geojson")
country_geo.drop(c.geometry).glimpse()

Rows: 199
Columns: 3
$ iso    <str> 'AFG', 'ALB', 'DZA', 'AND', 'AGO', 'ATG', 'ARG', 'ARM', 'AUS', 'AUT'
$ iso_a2 <str> 'AF', 'AL', 'DZ', 'AD', 'AO', 'AG', 'AR', 'AM', 'AU', 'AT'
$ name   <str> 'Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua And Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria'

A supplementary dataset from Simple Maps provides the locations and populations of world cities, which can be associated with the countries dataset by using spatial joins (data released under CC BY 4.0).

city = pl.read_csv("data/countries_cities.csv")
city.glimpse()

Rows: 47808
Columns: 4
$ city <str> 'Tokyo', 'Jakarta', 'Delhi', 'Guangzhou', 'Mumbai', 'Manila', 'Shanghai', 'São Paulo', 'Seoul', 'Mexico City'
$ lat  <f64> 35.687, -6.175, 28.61, 23.13, 19.0761, 14.5958, 31.2286, -23.5504, 37.5667, 19.4333
$ lon  <f64> 139.7495, 106.8275, 77.23, 113.26, 72.8775, 120.9772, 121.4747, -46.6339, 126.9833, -99.1333
$ pop  <i64> 37785000, 33756000, 32226000, 26940000, 24973000, 24922000, 24073000, 23086000, 23016000, 21804000

22.3 Food Items

The food dataset profiles 61 common culinary items, providing a comprehensive nutritional and descriptive breakdown for each. It categorizes items into broad food_group classifications (such as fruits, vegetables, grains, and meats) and details their dietary composition through macronutrients—including total and saturated fats, carbohydrates, sugar, fiber, and protein—as well as cholesterol and calorie counts. The dataset also tracks micronutrient content, specifically sodium, iron, and vitamins A and C. Beyond nutritional metrics, the table includes metadata sourced from WikiData, such as a URL slug (wiki), a textual description defining the item, and its primary visual color.

food = pl.read_csv("data/food.csv")
food.glimpse()

Rows: 61
Columns: 17
$ item        <str> 'Apple', 'Asparagus', 'Avocado', 'Banana', 'Chickpea', 'String Bean', 'Beef', 'Bell Pepper', 'Crab', 'Broccoli'
$ food_group  <str> 'fruit', 'vegetable', 'fruit', 'fruit', 'grains', 'vegetable', 'meat', 'vegetable', 'fish', 'vegetable'
$ calories    <i64> 52, 20, 160, 89, 180, 31, 288, 26, 87, 34
$ total_fat   <f64> 0.1, 0.1, 14.6, 0.3, 2.9, 0.1, 19.5, 0.0, 1.0, 0.3
$ sat_fat     <f64> 0.028, 0.046, 2.126, 0.112, 0.309, 0.026, 7.731, 0.059, 0.222, 0.039
$ cholesterol <i64> 0, 0, 0, 0, 0, 0, 87, 0, 78, 0
$ sodium      <i64> 1, 2, 7, 1, 243, 6, 384, 2, 293, 33
$ carbs       <f64> 13.81, 3.88, 8.53, 22.84, 29.98, 7.13, 0.0, 6.03, 0.04, 6.64
$ fiber       <f64> 2.4, 2.1, 6.7, 2.6, 8.6, 3.4, 0.0, 2.0, 0.0, 2.6
$ sugar       <f64> 10.39, 1.88, 0.66, 12.23, 5.29, 1.4, 0.0, 4.2, 0.0, 1.7
$ protein     <f64> 0.26, 2.2, 2.0, 1.09, 9.54, 1.82, 26.33, 0.99, 18.06, 2.82
$ iron        <i64> 1, 12, 3, 1, 17, 6, 15, 2, 4, 4
$ vitamin_a   <i64> 1, 15, 3, 1, 0, 14, 0, 63, 0, 12
$ vitamin_c   <i64> 8, 9, 17, 15, 3, 27, 0, 317, 5, 149
$ wiki        <str> 'apple', 'asparagus', 'avocado', 'banana', 'chickpea', 'green_bean', 'beef', 'bell_pepper', 'callinectes_sapidus', 'broccoli'
$ description <str> 'A common, round fruit produced by the tree <i>Malus domestica</i>, cultivated in temperate climates.', 'Any of various perennial plants of the genus <i>Asparagus</i> having leaflike stems, scalelike leaves, and small flowers.', 'The large, usually yellowish-green or black, pulpy fruit of the avocado tree.', 'An elongated curved tropical fruit that grows in bunches and has a creamy flesh and a smooth skin.', 'An annual Asian plant (<i>Cicer arietinum</i>) in the pea family, widely cultivated for the edible seeds in its short inflated pods.', 'A long, slender variety of green bean.', 'The meat from a cow, bull or other bovine.', '<i>Capsicum annuum</i>, an edible spicy-sweet fruit, originating in the New World.', 'A crustacean of the infraorder <i>Brachyura</i>, having five pairs of legs, the foremost of which are in the form of claws, and a carapace.', 'A plant, <i>Brassica oleracea var. italica</i>, of the cabbage family, Brassicaceae; especially, the tree-shaped flower and stalk that are eaten as a vegetable.'
$ color       <str> 'red', 'green', 'green', 'yellow', 'brown', 'green', 'red', 'green', 'red', 'green'

The diet dataset is a small reference table containing 6 rows that define dietary compliance for major food groups. It links broad food_group categories (such as fruit, vegetable, grains, meat, fish, and dairy) to specific restrictive diets. Boolean-style columns (yes/no) indicate whether each group is permissible within vegan, vegetarian, and pescatarian lifestyles, effectively serving as a lookup table for filtering food items based on dietary restrictions.

diet = pl.read_csv("data/food_diet_restrictions.csv")
diet.glimpse()

Rows: 6
Columns: 4
$ food_group  <str> 'fruit', 'vegetable', 'grains', 'meat', 'fish', 'dairy'
$ vegan       <str> 'yes', 'yes', 'yes', 'no', 'no', 'no'
$ vegetarian  <str> 'yes', 'yes', 'yes', 'no', 'no', 'yes'
$ pescatarian <str> 'yes', 'yes', 'yes', 'no', 'yes', 'yes'

The recipe dataset provides a structural breakdown of culinary dishes, listing the specific components required to prepare them. Organized in a “long” format, each row represents a single ingredient for a given recipe, rather than a single row per dish. This means complex recipes like “Pot Roast” or “Guacamole” appear across multiple lines, each detailing a constituent item and its corresponding amount (in grams). This granular structure facilitates the aggregation of nutritional data by allowing individual ingredients to be linked back to detailed food profiles.

recipe = pl.read_csv("data/food_recipes.csv")
recipe.glimpse()

Rows: 10
Columns: 3
$ recipe     <str> 'Pot Roast', 'Pot Roast', 'Pot Roast', 'Pot Roast', 'Pot Roast', 'Pot Roast', 'Guacamole', 'Guacamole', 'Guacamole', 'Guacamole'
$ ingredient <str> 'Beef', 'Carrot', 'Potato', 'Onion', 'Tomato', 'Bay Leaf', 'Avocado', 'Onion', 'Tomato', 'Lime'
$ amount     <i64> 1200, 400, 1000, 500, 200, 5, 1000, 500, 500, 150

22.4 Majors and Salary

Sourced from the U.S. Bureau of Labor Statistics, the major dataset offers a high-resolution profile of the earnings distribution for various undergraduate fields of study. Unlike summary tables that report only a median income, this dataset uses a long-format structure to trace the entire salary curve, containing 8,316 rows that correspond to 99 percentile ranks for roughly 84 distinct majors. For each major, the data lists the percentile (from 1 to 99) and the associated earnings value at that rank. This granular approach allows for a deeper analysis of financial outcomes, enabling comparisons of income inequality within fields and assessing the risk-reward profiles—such as the reliable “floor” versus the potential “ceiling” of wages—across different career paths.

major = pl.read_csv("data/majors.csv")
major.glimpse()

Rows: 8316
Columns: 3
$ major      <str> 'Accounting', 'Accounting', 'Accounting', 'Accounting', 'Accounting', 'Accounting', 'Accounting', 'Accounting', 'Accounting', 'Accounting'
$ percentile <i64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ earnings   <f64> 1.538733, 1.673857, 1.760265, 1.816639, 1.872135, 1.933803, 1.980822, 2.019701, 2.059812, 2.090352

22.5 Criterion Films

The film dataset contains 1,479 entries from the Criterion Collection, a prestigious home-video distribution company dedicated to preserving and publishing “important classic and contemporary films” from around the world. Often regarded as a canon of cinema as an art form, the collection includes technically restored and historically significant works.

The dataset identifies each film by its standard title, release year, and unique imdb_id. It captures the creative backbone of each work through columns for directors, writers, and genre classifications, alongside production details like the country of origin, primary languages, and runtime. Critical reception is well-documented with aggregated scores from IMDb (including vote counts), Rotten Tomatoes, and Metacritic. Additionally, the table is enriched with encyclopedic context via Wikipedia extracts and descriptions, and occasionally includes financial metrics like production budgets and box office returns.

film = pl.read_csv("data/criterion.csv")
film.glimpse()

Rows: 1479
Columns: 18
$ imdb_id               <str> 'tt0012349', 'tt0012364', 'tt0013257', 'tt0014429', 'tt0014624', 'tt0015634', 'tt0015768', 'tt0015841', 'tt0016142', 'tt0017075'
$ title                 <str> 'The Kid', 'The Phantom Carriage', 'Häxan', 'Safety Last!', 'A Woman of Paris: A Drama of Fate', 'Body and Soul', 'Master of the House', 'The Freshman', 'The Mystic', 'The Lodger: A Story of the London Fog'
$ year                  <i64> 1921, 1921, 1922, 1923, 1923, 1925, 1925, 1925, 1925, 1927
$ language              <str> 'None|English', 'None|Swedish', 'Swedish|Danish', 'English', 'English', 'English', 'Danish', 'None|English', 'None|English', 'None'
$ genre                 <str> 'Comedy|Drama|Family', 'Drama|Fantasy|Horror', 'Documentary|Fantasy|Horror', 'Action|Comedy|Thriller', 'Drama|Romance', 'Crime|Drama|Thriller', 'Comedy|Drama', 'Comedy|Family|Romance', 'Drama', 'Crime|Drama|Mystery'
$ director              <str> 'Charles Chaplin', 'Victor Sjöström', 'Benjamin Christensen', 'Fred C. Newmeyer|Sam Taylor', 'Charles Chaplin', 'Oscar Micheaux', 'Carl Theodor Dreyer', 'Fred C. Newmeyer|Sam Taylor', 'Tod Browning', 'Alfred Hitchcock'
$ writer                <str> 'Charles Chaplin', 'Selma Lagerlöf|Victor Sjöström', 'Benjamin Christensen', 'Hal Roach|Sam Taylor|Tim Whelan', 'Charles Chaplin', 'Oscar Micheaux', 'Carl Theodor Dreyer|Svend Rindom', 'Sam Taylor|Ted Wilde|John Grey', 'Tod Browning|Waldemar Young', 'Marie Belloc Lowndes|Eliot Stannard|Alfred Hitchcock'
$ country               <str> 'United States', 'Sweden', 'Sweden|Denmark', 'United States', 'United States', 'United States', 'Denmark', 'United States', 'United States', 'United Kingdom'
$ imdb_votes            <i64> 142797, 15311, 18391, 23503, 6548, 1221, 2506, 6373, 489, 14371
$ rating_imdb           <f64> 8.2, 8.0, 7.6, 8.1, 6.9, 6.2, 7.0, 7.5, 6.7, 7.3
$ rating_rt             <i64> 100, 100, 93, 97, 94, null, 100, 95, null, 96
$ rating_mc             <i64> null, null, null, null, 76, null, null, null, null, 82
$ runtime_raw           <i64> 68, 106, 107, 73, 84, 102, 107, 76, 70, 70
$ wikipedia_pageid      <i64> 1346905, 7329426, 3644898, 76313, 546663, 1506585, 11072916, 3831825, 17325678, 287408
$ wikipedia_description <str> '1921 silent film by Charlie Chaplin', '1921 film by Victor Sjöström', 'Swedish 1922 silent horror essay film', '1923 American silent romantic comedy film', '1923 drama film by Charlie Chaplin', '1925 film directed by Oscar Micheaux', '1925 film by Carl Theodor Dreyer', '1925 film', '1925 film', '1927 silent film by Alfred Hitchcock'
$ wikipedia_extract     <str> "The Kid is a 1921 American silent comedy-drama film written, produced, directed by and starring Charlie Chaplin, and features Jackie Coogan as his foundling baby, adopted son and sidekick. This was Chaplin's first full-length film as a director. It was a huge success and was the second-highest-grossing film in 1921. Now considered one of the greatest films of the silent era, it was selected for preservation in the United States National Film Registry by the Library of Congress in 2011.", "The Phantom Carriage is a 1921 Swedish silent film directed by and starring Victor Sjöström, based on the 1912 novel Thy Soul Shall Bear Witness! (Körkarlen) by Swedish author Selma Lagerlöf. In the film, Sjöström plays a drunkard named David Holm who, on the night of New Year's Eve, is compelled by the ghostly driver of Death's carriage to reflect on his past mistakes. Alongside Sjöström, the film's cast includes Hilda Borgström, Tore Svennberg, and Astrid Holm.", "Häxan is a 1922 Swedish-Danish silent horror essay film written and directed by Benjamin Christensen. Consisting partly of documentary-style storytelling as well as dramatized narrative sequences, the film purports to chart the historical roots and superstitions surrounding witchcraft, beginning in the Middle Ages through the 20th century. Based partly on Christensen's own study of the Malleus Maleficarum, a 15th-century German guide for inquisitors, Häxan proposes that such witch-hunts may have stemmed from misunderstandings of mental or neurological disorders, triggering mass hysteria.", "Safety Last! is a 1923 American silent romantic-comedy film starring Harold Lloyd. It includes one of the most famous images from the silent-film era: Lloyd clutching the hands of a large clock as he dangles from the outside of a skyscraper above moving traffic. The film was highly successful and critically hailed, and it cemented Lloyd's status as a major figure in early motion pictures. It is still popular at revivals, and it is viewed today as one of the great film comedies.", 'A Woman of Paris is a 1923 silent drama film written, produced, and directed by Charlie Chaplin. It stars Edna Purviance as the title character, along with Clarence Geldart, Carl Miller, Lydia Knott, Charles K. French and Adolphe Menjou. A United Artists production, the film was an atypical dramatic work for Chaplin.', 'Body and Soul is a 1925 race film produced, written, directed, and distributed by Oscar Micheaux and starring Paul Robeson in his motion picture debut. In 2019, the film was selected by the Library of Congress for inclusion in the National Film Registry for being "culturally, historically, or aesthetically significant".', 'Master of the House is a 1925 Danish silent drama film directed and written by acclaimed filmmaker Carl Theodor Dreyer. The film marked the debut of Karin Nellemose, and it is regarded by many as a classic of Danish cinema.', "The Freshman is a 1925 American silent comedy film that tells the story of a college freshman trying to become popular by joining the school football team. It stars Harold Lloyd, Jobyna Ralston, Brooks Benedict, and James Anderson. It remains one of Lloyd's most successful and enduring films. When the film opened on September 20 at the B.S. Moss Colony Theater on Broadway, Broderick & Felsen's production of Campus Capers was the opening act which was engaged for the full ten weeks of the film's run.", "The Mystic is a 1925 American MGM silent drama film directed by Tod Browning, who also co-wrote it with Waldemar Young. It is the only one of nine silent MGM films directed by Browning from 1925 to 1929 that does not star Lon Chaney. The film costars Aileen Pringle and Conway Tearle. Aileen Pringle's gowns in the film were by already famous Romain de Tirtoff. A print of the film exists.", "The Lodger: A Story of the London Fog is a 1927 British silent thriller film directed by Alfred Hitchcock and starring Marie Ault, Arthur Chesney, June Tripp, Malcolm Keen and Ivor Novello. Hitchcock's third feature film, it was released on 14 February 1927 in London and on 10 June 1928 in New York City. The film is based on the 1913 novel The Lodger by Marie Belloc Lowndes and the play Who Is He? co-written by Belloc Lowndes. Its plot concerns the hunt for a Jack the Ripper-like serial killer in London."
$ budget_raw            <i64> 250000, null, 2000000, 121000, 351000, null, null, 301681, null, 12000
$ box_office_raw        <i64> null, null, null, null, 634000, null, null, null, null, null

22.6 Fifty Years of Movies

The movie dataset serves as the central hub, containing 5,000 observations that represent the top-100 grossing U.S. films for each year from 1970 to 2020. It captures essential metadata such as the film’s title, release year, MPA rating, and runtime, alongside measures of commercial and critical success like gross revenue, IMDb user ratings/vote counts, and Metacritic scores. Uniquely, it also includes computer vision metrics derived from the film’s promotional poster, quantifying visual attributes such as poster_brightness, saturation, and edgeness (a measure of visual complexity).

movie = pl.read_csv("data/movies_50_years.csv")
movie.glimpse()

Rows: 5000
Columns: 12
$ year              <i64> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970
$ title             <str> 'Love Story', 'Airport', 'MASH', 'Patton', 'The AristoCats', 'Little Big Man', 'Tora! Tora! Tora!', 'Catch-22', 'The Owl and the Pussycat', 'Joe'
$ mpa               <str> 'PG', 'G', 'R', 'GP', 'G', 'PG-13', 'G', 'R', 'PG', 'R'
$ runtime           <i64> 100, 137, 116, 172, 78, 139, 144, 122, 95, 107
$ gross             <f64> 106.4, 100.49, 81.6, 61.7, 37.68, 31.56, 29.55, 24.91, 23.68, 19.32
$ rating_count      <i64> 28330, 16512, 64989, 90461, 87551, 31412, 30347, 20997, 3107, 2633
$ rating            <f64> 6.9, 6.6, 7.5, 7.9, 7.1, 7.6, 7.5, 7.2, 6.5, 6.8
$ metacritic        <i64> null, 42, null, null, null, null, 46, null, null, null
$ poster_brightness <f64> 79.039734052134, 70.73515993593905, 74.5400023238925, 83.12899118937443, 79.79474571281945, 67.96583791250038, 39.79528775599128, 62.28054483453459, 67.22113941912305, 31.82685812294916
$ poster_saturation <f64> 8.029792248510015, 29.28457189363516, 40.103629182765395, 17.433849565817365, 12.481991945072153, 9.016387405954426, 48.60645939534094, 35.96620937559402, 10.24255420931898, 27.578054400938452
$ poster_edgeness   <f64> 4.586166444178613, 4.954734636760736, 3.5102847848915624, 3.657573618987647, 4.400358220849864, 5.359519670438313, 2.1121361608919753, 3.6546805917391145, 4.894551155673808, 4.229543347907548
$ description       <str> 'A boy and a girl from different backgrounds fall in love regardless of their upbringing - and then tragedy strikes.', 'A bomber on board an airplane, an airport almost closed by snow, and various personal problems of the people involved.', 'The staff of a Korean War field hospital use humor and high jinks to keep their sanity in the face of the horror of war.', 'The World War II phase of the career of controversial American general George S. Patton.', 'With the help of a smooth talking tomcat, a family of Parisian felines set to inherit a fortune from their owner try to make it back home after a jealous butler kidnaps them and leaves them in the country.', 'Jack Crabb, looking back from extreme old age, tells of his life being raised by Native Americans and fighting with General Custer.', 'In 1941, following months of economic embargo, Japan prepares to open its war against the United States with a surprise attack on the US naval base at Pearl Harbor.', 'A man is trying desperately to be certified insane during World War II, so he can stop flying missions.', 'A stuffy author enters into an explosive relationship with his neighbor, a foul-mouthed, freewheeling prostitute.', "Two men, Bill, a wealthy conservative, and Joe, a far-right factory worker, form a dangerous bond after Bill confesses to murdering his daughter's drug dealer boyfriend to Joe."

The color dataset provides a detailed breakdown of the color palettes used in the film posters. Structured in a long format, it links each movie to multiple rows representing specific color categories—spanning hues like “red” or “blue” and greyscale tones like “black” or “white.” The percentage column quantifies the dominance of each color, enabling the analysis of visual trends in movie marketing over the last half-century (such as the rise of darker or more saturated poster designs).

color = pl.read_csv("data/movies_50_years_color.csv")
color.glimpse()

Rows: 46980
Columns: 5
$ year       <i64> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970
$ title      <str> 'Love Story', 'Love Story', 'Love Story', 'Love Story', 'Love Story', 'Love Story', 'Love Story', 'Love Story', 'Love Story', 'Love Story'
$ color_type <str> 'hue', 'hue', 'hue', 'hue', 'hue', 'hue', 'greyscale', 'greyscale', 'greyscale', 'hue'
$ color      <str> 'red', 'orange', 'yellow', 'green', 'blue', 'violet', 'black', 'grey', 'white', 'other'
$ percentage <f64> 2.6356547746208077, 3.0933561204870754, 0.05420850245674002, 0.2360606707968383, 0.31937620166631064, 0.0005340739158299509, 10.963736381115147, 6.9082461012604135, 75.78882717368084, 0.0

The genre dataset acts as a mapping table to handle the one-to-many relationship between films and their narrative categories. Since a single movie often fits into multiple classifications (e.g., a film that is both “Action” and “Sci-Fi”), this table lists each genre tag on a separate row. This structure allows for precise filtering and aggregation, facilitating analysis of how genre popularity—like the decline of Westerns or the rise of Superhero films—has shifted over the 50-year period.

genre = pl.read_csv("data/movies_50_years_genre.csv")
genre.glimpse()

Rows: 11887
Columns: 3
$ year  <i64> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970
$ title <str> 'Love Story', 'Love Story', 'Airport', 'Airport', 'Airport', 'MASH', 'MASH', 'MASH', 'Patton', 'Patton'
$ genre <str> 'Drama', 'Romance', 'Action', 'Drama', 'Thriller', 'Comedy', 'Drama', 'War', 'Biography', 'Drama'

The people dataset details the key creative talent behind each film, listing the director and the top four billed actors (“starring”) ranked by prominence. Beyond simply naming the individuals, this table enriches the data with demographic inference: it includes predicted gender classifications and a confidence score (gender_conf) for each name. These predictions are derived from U.S. Social Security name data, allowing for longitudinal studies of gender representation in top-tier Hollywood productions.

people = pl.read_csv("data/movies_50_years_people.csv")
people.glimpse()

Rows: 24648
Columns: 7
$ year        <i64> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970
$ title       <str> 'Love Story', 'Love Story', 'Love Story', 'Love Story', 'Love Story', 'Airport', 'Airport', 'Airport', 'Airport', 'Airport'
$ role        <str> 'director', 'starring', 'starring', 'starring', 'starring', 'director', 'director', 'starring', 'starring', 'starring'
$ rank        <i64> 1, 1, 2, 3, 4, 1, 2, 1, 2, 3
$ person      <str> 'Arthur Hiller', 'Ali MacGraw', "Ryan O'Neal", 'John Marley', 'Ray Milland', 'George Seaton', 'Henry Hathaway', 'Burt Lancaster', 'Dean Martin', 'George Kennedy'
$ gender      <str> 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male'
$ gender_conf <f64> 0.9937, 0.6877, 0.9768, 0.9961, 0.984, 0.9932, 0.9935, 1.0, 0.9875, 0.9932

22.7 What We Eat in America

The wweia dataset serves as a granular log of dietary intake events, containing over 173,000 observations where each row represents a specific food item consumed by a participant. It captures the “what, when, and where” of eating habits: identifying the item via a standard food_code, pinpointing the occasion with temporal markers (time, day_of_week, meal_name), and noting the origin (food_source) and location (at_home). Crucially, this transactional table details the nutritional impact of each specific portion, recording the mass in grams and providing a breakdown of energy (kcal), macronutrients (protein, carbs, fat, sugar), and other constituents like caffeine and alcohol.

wweia = pl.read_csv("data/wweia_food.csv", ignore_errors=True)
wweia.glimpse()

Rows: 173174
Columns: 15
$ id          <i64> 109263, 109263, 109263, 109263, 109263, 109263, 109263, 109263, 109263, 109263
$ food_code   <i64> 28320300, 91746110, 58106210, 64104010, 11710801, 54304020, 57124200, 94000100, 11710801, 94000100
$ day_of_week <i64> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5
$ time        <i64> 19, 18, 12, 16, 19, 14, 16, 8, 8, 14
$ meal_name   <str> 'Dinner', 'Snack', 'Lunch', 'Snack', 'Dinner', 'Snack', 'Snack', 'Extended consumption', 'Breakfast', 'Snack'
$ food_source <str> 'Store - grocery/supermarket', 'Child/Adult care center', 'Child/Adult care center', 'Store - grocery/supermarket', 'Store - grocery/supermarket', 'Child/Adult care center', 'Store - grocery/supermarket', 'NA', 'Store - grocery/supermarket', 'NA'
$ at_home     <str> 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No'
$ grams       <f64> 199.5, 20.0, 238.0, 209.0, 124.0, 10.0, 11.67, 105.0, 130.2, 105.0
$ kcal        <i64> 114, 101, 633, 99, 123, 49, 45, 0, 129, 0
$ protein     <f64> 12.11, 2.2, 27.11, 0.19, 3.55, 1.09, 0.63, 0.0, 3.72, 0.0
$ carbs       <f64> 5.07, 12.03, 79.33, 23.71, 13.83, 5.94, 9.45, 0.0, 14.52, 0.0
$ sugar       <f64> 2.13, 10.4, 8.52, 21.17, 13.02, 0.45, 4.02, 0.0, 13.67, 0.0
$ fat         <f64> 4.95, 5.09, 23.06, 0.52, 5.92, 2.27, 0.58, 0.0, 6.21, 0.0
$ caffeine    <i64> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0
$ alcohol     <i64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

The demo dataset provides the socioeconomic and demographic context for the 13,724 survey participants, linked to the food log by a unique id. It constructs a profile for each individual, tracking fundamental attributes such as age, gender, and race, alongside indicators of social status like education level (edu_level) and family structure. Economic wellbeing is quantified by the ratio_to_poverty (the ratio of family income to the federal poverty threshold), allowing researchers to analyze how diet quality varies across different income brackets and population segments.

demo = pl.read_csv("data/wweia_demo.csv")
demo.glimpse()

Rows: 13724
Columns: 8
$ id               <i64> 109263, 109264, 109265, 109266, 109269, 109270, 109271, 109272, 109273, 109274
$ age              <i64> 2, 13, 2, 29, 2, 11, 49, 0, 36, 68
$ gender           <str> 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male'
$ edu_level        <str> 'NA', 'NA', 'NA', '5', 'NA', 'NA', '2', 'NA', '4', '4'
$ race             <str> 'Other', 'Mexican American', 'White', 'Other', 'Other Hispanic', 'Black', 'White', 'Mexican American', 'White', 'Missing'
$ family_status    <str> 'NA', 'NA', 'NA', 'Other', 'NA', 'NA', 'Other', 'NA', 'Other', 'Other'
$ ratio_to_poverty <str> '4.66', '0.83', '3.06', '5', '0.96', '1.88', 'NA', '0.73', '0.83', '1.2'
$ lang_interview   <str> 'English', 'English', 'English', 'English', 'English', 'English', 'English', 'English', 'English', 'English'

The meta dataset acts as the definitive taxonomy for the survey, containing 7,444 entries that map the numeric food_code found in the consumption logs to human-readable definitions. It organizes the vast array of food items into a hierarchical structure, linking specific descriptions (e.g., “Milk, low sodium, whole”) to broader category_descriptions (e.g., “Milk, whole”) and high-level food_group classifications (e.g., “Milk and Dairy”). This reference table is essential for aggregating granular food data into meaningful dietary patterns consistent with nutritional guidelines.

meta = pl.read_csv("data/wweia_meta.csv")
meta.glimpse()

Rows: 7444
Columns: 7
$ food_code             <i64> 11000000, 11100000, 11111000, 11111100, 11111150, 11111160, 11111170, 11112110, 11112120, 11112130
$ food_code_description <str> 'Milk, human', 'Milk, NFS', 'Milk, whole', 'Milk, low sodium, whole', 'Milk, calcium fortified, whole', 'Milk, calcium fortified, low fat (1%)', 'Milk, calcium fortified, fat free (skim)', 'Milk, reduced fat (2%)', 'Milk, acidophilus, low fat (1%)', 'Milk, acidophilus, reduced fat (2%)'
$ category_number       <i64> 9602, 1004, 1002, 1002, 1002, 1006, 1008, 1004, 1006, 1004
$ category_description  <str> 'Human milk', 'Milk, reduced fat', 'Milk, whole', 'Milk, whole', 'Milk, whole', 'Milk, lowfat', 'Milk, nonfat', 'Milk, reduced fat', 'Milk, lowfat', 'Milk, reduced fat'
$ meta_number           <i64> 96, 10, 10, 10, 10, 10, 10, 10, 10, 10
$ meta_name             <str> 'Human Milk', 'Milk', 'Milk', 'Milk', 'Milk', 'Milk', 'Milk', 'Milk', 'Milk', 'Milk'
$ food_group            <str> 'Baby Foods', 'Milk and Dairy', 'Milk and Dairy', 'Milk and Dairy', 'Milk and Dairy', 'Milk and Dairy', 'Milk and Dairy', 'Milk and Dairy', 'Milk and Dairy', 'Milk and Dairy'

22.8 Inference Data

Derived from the CDC’s 2010 National Survey of Family Growth (NSFG), the marriage dataset is a focused univariate collection regarding family formation trends. It consists of a single column, age, which records the age in years at which 5,534 U.S. women entered into their first marriage. This simple numeric vector serves as a foundational sample for estimating population parameters—such as the median age of first marriage—and analyzing shifts in nuptiality over time.

marriage = pl.read_csv("data/inference_age_at_mar.csv")
marriage.glimpse()

Rows: 5534
Columns: 1
$ age <i64> 32, 25, 24, 26, 32, 29, 23, 23, 29, 27

Originating from a study in rural New South Wales, Australia, the absent dataset investigates the factors influencing school attendance among 146 primary school students. The target variable, days, counts the total days a student was absent during the school year. These figures are contextualized by categorical demographic and academic indicators, including ethnicity (eth), gender (sex), age group (age), and lrn, a classification of the student’s learning status.

absent = pl.read_csv("data/inference_absenteeism.csv")
absent.glimpse()

Rows: 146
Columns: 5
$ eth  <str> 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'
$ sex  <str> 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M'
$ age  <str> 'F0', 'F0', 'F0', 'F0', 'F0', 'F0', 'F0', 'F0', 'F1', 'F1'
$ lrn  <str> 'SL', 'SL', 'SL', 'AL', 'AL', 'AL', 'AL', 'AL', 'SL', 'SL'
$ days <i64> 2, 11, 14, 5, 5, 13, 20, 22, 6, 6

The sulph dataset captures the results of a clinical control trial testing the efficacy of the drug sulphinpyrazone in post-heart attack care. It tracks 1,475 patients, dividing them into experimental and placebo arms via the group column. The primary endpoint is recorded in the binary outcome column (“lived” or “died”), creating a classic contingency structure used to calculate odds ratios and determine if the drug provides a statistically significant survival benefit compared to the control.

sulph = pl.read_csv("data/inference_sulphinpyrazone.csv")
sulph.glimpse()

Rows: 1475
Columns: 2
$ group   <str> 'control', 'control', 'control', 'control', 'control', 'control', 'control', 'control', 'control', 'control'
$ outcome <str> 'died', 'died', 'died', 'died', 'died', 'died', 'died', 'died', 'died', 'died'

Sourced from a survey of 1,325 UCLA students, the speed dataset combines physiological metrics with self-reported risk behavior. It logs the student’s sex and height (in inches), alongside a behavioral metric: speed, representing the fastest speed the student has ever driven a vehicle (presumably in mph). This combination allows for inference tasks such as testing for gender-based differences in driving habits or exploring correlations between physical stature and risk-taking.

speed = pl.read_csv("data/inference_speed_sex_height.csv")
speed.glimpse()

Rows: 1279
Columns: 3
$ speed  <i64> 85, 40, 87, 110, 110, 120, 90, 90, 80, 95
$ sex    <str> 'female', 'male', 'female', 'female', 'male', 'female', 'female', 'female', 'female', 'male'
$ height <f64> 69.0, 71.0, 64.0, 60.0, 70.0, 61.0, 65.0, 65.0, 61.0, 69.0

The possum dataset provides a morphometric profile of 104 brushtail possums captured across Australia and New Guinea. Aside from sex and age estimates, the data tracks geographic provenance through site codes and population regions (pop, e.g., “Vic” for Victoria). The dataset is defined by its precise biological measurements in millimeters—specifically head_l (head length), skull_w (skull width), total_l (total body length), and tail_l (tail length)—which are often used to classify subspecies or study regional physical variations.

possum = pl.read_csv("data/inference_possum.csv")
possum.glimpse()

Rows: 104
Columns: 8
$ site    <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ pop     <str> 'Vic', 'Vic', 'Vic', 'Vic', 'Vic', 'Vic', 'Vic', 'Vic', 'Vic', 'Vic'
$ sex     <str> 'm', 'f', 'f', 'f', 'f', 'f', 'm', 'f', 'f', 'f'
$ age     <str> '8', '6', '6', '6', '2', '1', '2', '6', '9', '6'
$ head_l  <f64> 94.1, 92.5, 94.0, 93.2, 91.5, 93.1, 95.3, 94.8, 93.4, 91.8
$ skull_w <f64> 60.4, 57.6, 60.0, 57.1, 56.3, 54.8, 58.2, 57.6, 56.3, 58.0
$ total_l <f64> 89.0, 91.5, 95.5, 92.0, 85.5, 90.5, 89.5, 91.0, 91.5, 89.5
$ tail_l  <f64> 36.0, 36.5, 39.0, 38.0, 36.0, 35.5, 36.0, 37.0, 37.0, 37.5

22.9 Keylogging

The klog dataset is a high-resolution behavioral log capturing the precise keystroke dynamics of students writing in English. With over 1.1 million observations, each row represents a single key press or input event. The data records the temporal flow of writing through timestamps (t0 for press, t1 for release) and calculated durations (dur), offering insight into motor processing and cognitive hesitation. The input and code columns differentiate between the resulting character (e.g., “I”) and the physical key actuated (e.g., “KeyI” or “Space”), allowing for the reconstruction of the text and the analysis of editing behaviors like backspacing or pausing.

klog = pl.read_csv("data/keylog.csv.gz")
klog.glimpse()

Rows: 1145051
Columns: 7
$ id        <str> 'R_00RbUqO7jXLDItP', 'R_00RbUqO7jXLDItP', 'R_00RbUqO7jXLDItP', 'R_00RbUqO7jXLDItP', 'R_00RbUqO7jXLDItP', 'R_00RbUqO7jXLDItP', 'R_00RbUqO7jXLDItP', 'R_00RbUqO7jXLDItP', 'R_00RbUqO7jXLDItP', 'R_00RbUqO7jXLDItP'
$ t0        <f64> 20914.10000000009, 21146.20000000018, 21234.30000000028, 22074.10000000009, 22306.20000000018, 23674.30000000028, 23818.39999999991, 24043.70000000018, 25066.30000000028, 25170.20000000018
$ t1        <f64> 20978.5, 21226.30000000028, 21290.10000000009, 22154.30000000028, 22394.39999999991, 23738.60000000009, 23874.5, 24090.30000000028, 25130.30000000028, 25250.0
$ dur       <f64> 64.3999999999069, 80.1000000000968, 55.7999999998137, 80.2000000001863, 88.1999999997242, 64.2999999998137, 56.1000000000931, 46.6000000000968, 64.0, 79.7999999998174
$ dur_after <f64> 80.1000000000968, 55.7999999998137, 80.2000000001863, 88.1999999997242, 64.2999999998137, 56.1000000000931, 46.6000000000968, 64.0, 79.7999999998174, 72.2999999998174
$ input     <str> 'I', 'f', null, 'I', null, 'c', 'o', 'u', 'l', 'd'
$ code      <str> 'KeyI', 'KeyF', 'Space', 'KeyI', 'Space', 'KeyC', 'KeyO', 'KeyU', 'KeyL', 'KeyD'

The meta dataset provides the demographic and linguistic context for the 823 participants tracked in the keylogs. It links each unique session id to the writer’s age and, crucially, their native language background (lang). The dataset also includes a cefr rating (Common European Framework of Reference for Languages) , which categorizes their English proficiency into standard levels such as “B1/B2” (independent user) or “C1/C2” (proficient user). This metadata enables comparative analysis of how L1 background and L2 proficiency manifest in low-level typing patterns.

meta = pl.read_csv("data/keylog-meta.csv.gz")
meta.glimpse()

Rows: 823
Columns: 4
$ id   <str> 'R_2EGIsZARLydD3Uc', 'R_1obCaysaZCWZXoG', 'R_3fqTek829k38iCk', 'R_brxD7Q5ZnPW8Gn7', 'R_1k1RE78cBbZyZMA', 'R_1NwuZMzRkVIR0WT', 'R_2t8LOS9nQDBQPA8', 'R_239Q0X5YLwB7U6Z', 'R_10xbkjEmnsusfb1', 'R_10CbLBzAnYKgWxB'
$ age  <i64> 25, 22, 22, 43, 23, 32, 24, 28, 32, 21
$ lang <str> 'Italian', 'Spanish', 'Polish', 'English', 'Polish', 'English', 'Spanish', 'English', 'Polish', 'Polish'
$ cefr <str> 'C1/C2', 'B1/B2', 'B1/B2', 'C1/C2', 'B1/B2', 'C1/C2', 'C1/C2', 'C1/C2', 'B1/B2', 'B1/B2'

22.10 Paris Metro

The pmetro dataset captures the geospatial layout of the Paris Métro system, containing 371 entries that represent individual station stops or track segments. Each row identifies a station by name and links it to its specific line number and official branding line_color (provided as a hex code). Uniquely, the dataset is structured to facilitate network visualization rather than just point plotting: in addition to the station’s own coordinates (lat, lon), it includes lat_end and lon_end columns. This “start-to-end” structure effectively defines the edges between stations, allowing for the reconstruction of the connected path of each subway line.

pmetro = pl.read_csv("data/paris_metro_stops.csv")
pmetro.glimpse()

Rows: 371
Columns: 7
$ name       <str> 'La Defense - Grande Arche', 'Esplanade de la Defense', 'Pont de Neuilly (Avenue de Madrid)', "Les Sablons (Jardin d'acclimatation)", 'Argentine', 'Charles De Gaulle-Etoile', 'George-V', 'Franklin D.Roosevelt', 'Champs-Elysees-Clemenceau', 'Concorde'
$ line       <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ line_color <str> '#ffbe00', '#ffbe00', '#ffbe00', '#ffbe00', '#ffbe00', '#ffbe00', '#ffbe00', '#ffbe00', '#ffbe00', '#ffbe00'
$ lon        <f64> 2.237018056395013, 2.247932435324861, 2.260515077888117, 2.271686721050983, 2.289322589613773, 2.295904906076514, 2.300560451248796, 2.30747079344783, 2.313545549946741, 2.322943412243541
$ lat        <f64> 48.892187076449495, 48.88863121777117, 48.884708201322525, 48.88119152058607, 48.87559404986666, 48.87514981973562, 48.872023809500426, 48.86980822019895, 48.86790534489708, 48.86628580458387
$ lon_end    <str> '2.247932435324861', '2.260515077888117', '2.271686721050983', '2.289322589613773', '2.295904906076514', '2.300560451248796', '2.30747079344783', '2.313545549946741', '2.322943412243541', '2.330129877112861'
$ lat_end    <str> '48.88863121777117', '48.884708201322525', '48.88119152058607', '48.87559404986666', '48.87514981973562', '48.872023809500426', '48.86980822019895', '48.86790534489708', '48.86628580458387', '48.864343778733904'

22.11 US City Population

The us_pop dataset traces the demographic evolution of the United States through a historical record of urban growth. Spanning from the first census in 1790 through 2010, it logs the population (measured in thousands) for distinct cities identified by name and state. The data appears to track modern cities backward in time, showing values of 0.0 for years prior to a city’s founding or incorporation (e.g., Anchorage in 1790). Enriched with geospatial coordinates (lat and lon), this longitudinal collection facilitates the analysis of urbanization patterns, capturing the country’s westward expansion and the explosive growth of metropolitan hubs over two centuries.

us_pop = pl.read_csv("data/us_city_population.csv")
us_pop.glimpse()

Rows: 6900
Columns: 6
$ city       <str> 'Anchorage, AK', 'Birmingham, AL', 'Huntsville, AL', 'Mobile, AL', 'Montgomery, AL', 'Little Rock, AR', 'Chandler, AZ', 'Gilbert, AZ', 'Glendale, AZ', 'Mesa, AZ'
$ state      <str> 'AK', 'AL', 'AL', 'AL', 'AL', 'AR', 'AZ', 'AZ', 'AZ', 'AZ'
$ year       <i64> 1790, 1790, 1790, 1790, 1790, 1790, 1790, 1790, 1790, 1790
$ population <f64> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
$ lon        <f64> -149.2743541, -86.799047, -86.5389964, -88.1002261, -86.2685927, -92.358556, -111.8549429, -111.7421907, -112.1899006, -111.7173787
$ lat        <f64> 61.177549, 33.5274441, 34.7842707, 30.668426, 32.3462512, 34.7254318, 33.2828736, 33.3102088, 33.5331113, 33.4019259

22.12 US Metropolitan Regions

Sourced from the Census Bureau’s American Community Survey, this collection of datasets provides a multi-dimensional view of U.S. demographics and economics, centered on Metropolitan Statistical Areas (CBSAs).

The metro dataset is the primary analytic table, profiling 934 metropolitan areas identified by a unique geoid. It aggregates key socioeconomic indicators, including population size (pop), population density, and the median age of residents. Economic health is captured through median household income and housing metrics like home ownership rates (percent_own) and the median cost of a one-bedroom rental. Geographically, each metro is assigned to a broad census region (quad) and division, and is precisely located via latitude/longitude coordinates. The accompanying metro_geo dataset provides the corresponding polygon geometries for these areas, enabling choropleth mapping and spatial analysis.

metro = pl.read_csv("data/acs_cbsa.csv")
metro.glimpse()

Rows: 934
Columns: 13
$ name             <str> 'New York', 'Los Angeles', 'Chicago', 'Dallas', 'Houston', 'Washington', 'Philadelphia', 'Miami', 'Atlanta', 'Boston'
$ geoid            <i64> 35620, 31080, 16980, 19100, 26420, 47900, 37980, 33100, 12060, 14460
$ quad             <str> 'NE', 'W', 'NC', 'S', 'S', 'S', 'NE', 'S', 'S', 'NE'
$ lon              <f64> -74.10105570561859, -118.1487215689162, -87.95881973164443, -96.97050780978928, -95.40157389770467, -77.51307477160977, -75.30263491667849, -80.50630736521515, -84.39956676469873, -71.0999121719376
$ lat              <f64> 40.768770318020096, 34.219405716738684, 41.70060516046628, 32.84947968570761, 29.78708316635632, 38.812483836818316, 39.90521296481641, 26.15536900531196, 33.691787081105744, 42.555193833166065
$ pop              <f64> 20.011812, 13.202558, 9.607711, 7.54334, 7.048954, 6.332069, 6.215222, 6.105897, 6.026734, 4.91203
$ density          <f64> 1051.3064674555676, 1040.6472811000378, 508.62940568377377, 323.1814035995303, 316.5435135006062, 363.73268864646764, 506.06813036177755, 430.1031618191652, 263.27582111238434, 517.8277024367719
$ age_median       <f64> 42.9, 41.6, 41.9, 41.3, 41.0, 42.4, 42.6, 43.9, 41.9, 42.2
$ hh_income_median <i64> 86445, 81652, 78790, 76916, 72551, 111252, 79070, 62870, 75267, 99039
$ percent_own      <f64> 55.3, 51.3, 68.9, 64.1, 65.1, 67.4, 71.1, 60.8, 67.3, 66.4
$ rent_1br_median  <i64> 1430, 1468, 1060, 1106, 997, 1601, 1083, 1230, 1181, 1390
$ rent_perc_income <f64> 31.0, 33.6, 29.0, 29.1, 30.0, 28.8, 30.0, 36.8, 30.3, 29.5
$ division         <str> 'Middle Atlantic', 'Pacific', 'East North Central', 'West South Central', 'West South Central', 'South Atlantic', 'Middle Atlantic', 'South Atlantic', 'South Atlantic', 'New England'

The next several datasets handle the higher-level state geography. The state table serves as a reference for the 50 U.S. states, providing names, abbreviations, and total populations, while state_geo contains their boundaries.

state = pl.read_csv("data/acs_state.csv")
state.glimpse()

Rows: 50
Columns: 3
$ state <str> 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia'
$ abb   <str> 'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA'
$ pop   <f64> 4.997675, 0.735951, 7.079203, 3.006309, 39.455353, 5.723176, 3.60533, 0.981892, 21.339762, 10.625615

The metro_cw (crosswalk) table bridges the two geographic levels, handling the complexity of metropolitan areas that span multiple state lines (e.g., the New York metro area covering parts of NY, NJ, and PA). It uses the prop column to indicate what fraction of a metro area’s footprint or population falls within a specific state.

metro_cw = pl.read_csv("data/acs_cbsa_to_state.csv")
metro_cw.glimpse()

Rows: 996
Columns: 3
$ geoid <i64> 35620, 35620, 35620, 31080, 16980, 16980, 16980, 19100, 26420, 47900
$ state <str> 'NY', 'NJ', 'PA', 'CA', 'IL', 'IN', 'WI', 'TX', 'TX', 'VA'
$ prop  <f64> 0.6483270833289084, 0.3486404584192905, 0.0030324582518010926, 1.0, 0.9065905299185738, 0.07571270790662321, 0.017696762174803024, 1.0, 1.0, 0.48420377548328675

The transit dataset provides a breakdown of transportation modes, listing the percentage of the population that commutes via car, public transportation, bicycle, or other means, as well as those who work from home.

transit = pl.read_csv("data/acs_cbsa_commute_type.csv")
transit.glimpse()

Rows: 939
Columns: 11
$ name                  <str> 'New York', 'Los Angeles', 'Chicago', 'Dallas', 'Houston', 'Washington', 'Philadelphia', 'Miami', 'Atlanta', 'Boston'
$ geoid                 <i64> 35620, 31080, 16980, 19100, 26420, 47900, 37980, 33100, 12060, 14460
$ pop                   <f64> 20.011812, 13.202558, 9.607711, 7.54334, 7.048954, 6.332069, 6.215222, 6.105897, 6.026734, 4.91203
$ car                   <f64> 53.51, 80.13, 74.23, 85.27, 86.49, 69.21, 75.33, 84.17, 81.38, 68.81
$ public_transportation <f64> 27.77, 4.07, 10.03, 1.04, 1.85, 10.12, 7.87, 2.62, 2.39, 10.75
$ taxicab               <f64> 0.78, 0.27, 0.38, 0.11, 0.14, 0.45, 0.25, 0.47, 0.45, 0.32
$ motorcycle            <f64> 0.05, 0.21, 0.05, 0.1, 0.09, 0.11, 0.06, 0.16, 0.09, 0.05
$ bicycle               <f64> 0.71, 0.61, 0.61, 0.14, 0.26, 0.74, 0.6, 0.49, 0.15, 0.94
$ walked                <f64> 5.53, 2.28, 2.78, 1.22, 1.18, 2.88, 3.22, 1.46, 1.17, 5.05
$ other_means           <f64> 1.06, 1.18, 1.01, 0.97, 1.3, 1.09, 0.96, 1.43, 1.14, 1.07
$ worked_from_home      <f64> 10.59, 11.25, 10.91, 11.15, 8.69, 15.41, 11.71, 9.21, 13.22, 13.02

Complementing this, the commute dataset uses a “long” format to capture the distribution of travel times. Instead of a single average, it breaks commute durations into specific time bins (defined by time_min and time_max), with the per column indicating the percentage of commuters falling into each interval.

commute = pl.read_csv("data/acs_cbsa_commute_time.csv")
commute.glimpse()

Rows: 13146
Columns: 6
$ name     <str> 'New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'New York'
$ geoid    <i64> 35620, 35620, 35620, 35620, 35620, 35620, 35620, 35620, 35620, 35620
$ pop      <f64> 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812
$ per      <f64> 3.43, 2.97, 3.32, 7.66, 7.72, 14.4, 10.74, 15.69, 7.97, 9.56
$ time_min <f64> 0.0, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0
$ time_max <f64> 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 10.0

The hh dataset offers a granular look at economic disparity by mapping the full distribution of household income for each metro area. Like the commute data, it is structured in a long format, where each row represents a specific income bracket (bounded by band_min and band_max). The per column quantifies the share of households within that bracket, allowing for a more nuanced analysis of wealth distribution—such as identifying the “middle class” squeeze or poverty rates—than a simple median value could provide.

hh = pl.read_csv("data/acs_cbsa_hh_income.csv")
hh.glimpse()

Rows: 15024
Columns: 6
$ name     <str> 'New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'New York'
$ geoid    <i64> 35620, 35620, 35620, 35620, 35620, 35620, 35620, 35620, 35620, 35620
$ pop      <f64> 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812, 20.011812
$ per      <f64> 5.66, 3.97, 3.27, 3.31, 3.2, 3.12, 2.92, 3.06, 2.66, 5.63
$ band_min <i64> 0, 10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000
$ band_max <str> '10000', '14999', '19999', '24999', '29999', '34999', '39999', '44999', '49999', '59999'

A geographic file that contains the polygons for each of the US states. There are keys to join to the other structured datasets above.

state_geo = DSGeo.read_file("data/acs_state.geojson")
state_geo.drop(c.geometry).glimpse()

Rows: 50
Columns: 3
$ name <str> 'Maine', 'New Hampshire', 'Delaware', 'South Carolina', 'Nebraska', 'Washington', 'New Mexico', 'South Dakota', 'Texas', 'California'
$ abb  <str> 'ME', 'NH', 'DE', 'SC', 'NE', 'WA', 'NM', 'SD', 'TX', 'CA'
$ fips <str> '23', '33', '10', '45', '31', '53', '35', '46', '48', '06'

And another geographic file that contains the polygons for each of the metro regions. There are keys to join to the other structured datasets above.

metro_geo = DSGeo.read_file("data/acs_cbsa_geo.geojson")
metro_geo.drop(c.geometry).glimpse()

Rows: 939
Columns: 3
$ geoid <f64> 35620.0, 31080.0, 16980.0, 19100.0, 26420.0, 47900.0, 37980.0, 33100.0, 12060.0, 14460.0
$ quad  <str> 'NE', 'W', 'NC', 'S', 'S', 'S', 'NE', 'S', 'S', 'NE'
$ pop   <f64> 20.011812, 13.202558, 9.607711, 7.54334, 7.048954, 6.332069, 6.215222, 6.105897, 6.026734, 4.91203

22.13 COVID

The covid dataset is a longitudinal record tracking the daily impact of the COVID-19 pandemic across France’s administrative departments. It uses a time-series structure where each row represents the status of a specific department (departement) on a given date. The metrics capture the strain on the healthcare system and the severity of the outbreak, recording cumulative statistics for deceased patients and recovered cases, as well as real-time snapshots of patients currently hospitalised or in intensive care (reanimation). Additionally, some columns track daily flows, such as new hospital admissions (hospitalised_new), allowing for analysis of infection waves and healthcare capacity over time.

covid = pl.read_csv("data/france_departement_covid.csv")
covid.glimpse()

Rows: 19998
Columns: 9
$ date             <str> '2020-03-18', '2020-03-18', '2020-03-18', '2020-03-18', '2020-03-18', '2020-03-18', '2020-03-18', '2020-03-18', '2020-03-18', '2020-03-18'
$ departement      <str> '01', '02', '03', '04', '05', '06', '07', '08', '09', '10'
$ departement_name <str> 'Ain', 'Aisne', 'Allier', 'Alpes-de-Haute-Provence', 'Hautes-Alpes', 'Alpes-Maritimes', 'Ardèche', 'Ardennes', 'Ariège', 'Aube'
$ deceased         <i64> 0, 9, 0, 0, 0, 2, 0, 0, 0, 0
$ hospitalised     <i64> 2, 0, 0, 3, 8, 25, 0, 0, 1, 5
$ reanimation      <i64> 0, 0, 0, 1, 1, 1, 0, 0, 1, 0
$ recovered        <i64> 1, 0, 0, 2, 9, 47, 0, 1, 2, 0
$ hospitalised_new <str> 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'
$ reanimation_new  <str> 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'

The pop dataset is a concise demographic reference table listing the total resident population for each of the 101 French departments. Indexed by the standard two-character departement code (e.g., “01” for Ain, “75” for Paris), this table serves as a critical normalization tool. It allows researchers to convert raw counts from the COVID-19 or economic datasets into standardized rates (such as cases per 100,000 inhabitants), enabling fair comparisons between densely populated urban areas and rural regions.

pop = pl.read_csv("data/france_departement_population.csv")
pop.glimpse()

Rows: 101
Columns: 2
$ departement <str> '01', '02', '03', '04', '05', '06', '07', '08', '09', '10'
$ population  <i64> 643350, 534490, 337988, 163915, 141284, 1083310, 325712, 273579, 153153, 310020

The fr_city dataset focuses on the country’s major urban centers, providing geospatial and demographic details for 58 significant cities. It identifies each city by name and links it to its broader administrative region (admin_name), such as “Ile-de-France” or “Nouvelle-Aquitaine.” The data includes precise lat and lon coordinates for mapping and a population figure that represents the broader urban or metropolitan area (agglomeration) rather than just the municipal limits. This dataset allows for spatial analysis of city-level hubs distinct from the broader departmental data.

fr_city = pl.read_csv("data/france_cities.csv")
fr_city.glimpse()

Rows: 58
Columns: 5
$ city       <str> 'Paris', 'Lyon', 'Marseille', 'Lille', 'Nice', 'Toulouse', 'Bordeaux', 'Rouen', 'Strasbourg', 'Nantes'
$ lat        <f64> 48.8667, 45.77, 43.29, 50.65, 43.715, 43.62, 44.85, 49.4304, 48.58, 47.2104
$ lon        <f64> 2.3333, 4.83, 5.375, 3.08, 7.265, 1.4499, -0.595, 1.08, 7.75, -1.59
$ population <i64> 9904000, 1423000, 1400000, 1044000, 927000, 847000, 803000, 532559, 439972, 438537
$ admin_name <str> 'Ile-de-France', 'Auvergne-Rhone-Alpes', "Provence-Alpes-Cote d'Azur", 'Hauts-de-France', "Provence-Alpes-Cote d'Azur", 'Occitanie', 'Nouvelle-Aquitaine', 'Normandie', 'Grand Est', 'Pays de la Loire'

The gdp dataset provides an economic profile of the country at the departmental level. It maps each departement (identified by both code and name) to its Gross Domestic Product (GDP) per capita in Euros. This metric serves as a proxy for regional standard of living and economic productivity, allowing for correlations with health outcomes or infrastructure availability.

gdp = pl.read_csv("data/france_departement_gdp.csv")
gdp.glimpse()

Rows: 96
Columns: 3
$ departement      <str> '01', '02', '03', '04', '05', '06', '07', '08', '09', '10'
$ departement_name <str> 'Ain', 'Aisne', 'Allier', 'Alpes-de-Haute-Provence', 'Hautes-Alpes', 'Alpes-Maritimes', 'Ardèche', 'Ardennes', 'Ariège', 'Aube'
$ gdp_eur          <i64> 28296, 25556, 27657, 28149, 28672, 38488, 24720, 26816, 23534, 31098

The dep dataset serves as the geospatial backbone for mapping French administrative divisions. It contains the boundaries for the 101 departments, linking the standard departement codes and names to a hidden geometry column (polygons). By joining this spatial file with the covid, pop, or gdp tables, users can visualize data through choropleth maps, revealing geographic patterns such as regional economic clusters or the spatial spread of the pandemic.

dep = DSGeo.read_file("data/france_departement_sml.geojson")
dep.drop(c.geometry).glimpse()

Rows: 101
Columns: 2
$ departement      <str> '01', '02', '03', '04', '05', '06', '07', '08', '09', '10'
$ departement_name <str> 'Ain', 'Aisne', 'Allier', 'Alpes-de-Haute-Provence', 'Hautes-Alpes', 'Alpes-Maritimes', 'Ardèche', 'Ardennes', 'Ariège', 'Aube'

We have a second covid dataset consisting of a granular time-series record tracking the spread of the virus across Italy’s administrative landscape. Unlike national or regional summaries, this table drills down to the provincial level (roughly equivalent to U.S. counties), providing a daily count of total cases for each province and its parent region. With over 68,000 observations, it allows for the analysis of local outbreaks and the specific trajectory of the pandemic within distinct geographic pockets from February 2020 onwards.

covid = pl.read_csv("data/it_province_covid.csv")
covid.glimpse()

Rows: 68694
Columns: 4
$ date     <str> '2020-02-24', '2020-02-24', '2020-02-24', '2020-02-24', '2020-02-24', '2020-02-24', '2020-02-24', '2020-02-24', '2020-02-24', '2020-02-24'
$ region   <str> 'Abruzzo', 'Abruzzo', 'Abruzzo', 'Abruzzo', 'Basilicata', 'Basilicata', 'Calabria', 'Calabria', 'Calabria', 'Calabria'
$ province <str> "L'Aquila", 'Teramo', 'Pescara', 'Chieti', 'Potenza', 'Matera', 'Cosenza', 'Catanzaro', 'Reggio di Calabria', 'Crotone'
$ cases    <i64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

The it_city dataset serves as a geospatial reference for 388 distinct Italian urban centers. It identifies cities by name—ranging from major metropolises like Rome and Milan to smaller regional hubs—and provides their precise latitude (lat) and longitude (lon) coordinates. Additionally, the pop column lists the resident population for each city, enabling analysis that correlates population density or urban size with other socioeconomic or health indicators.

it_city = pl.read_csv("data/it_cities.csv")
it_city.glimpse()

Rows: 388
Columns: 4
$ city_name <str> 'Rome', 'Milan', 'Naples', 'Turin', 'Palermo', 'Genoa', 'Bologna', 'Florence', 'Catania', 'Bari'
$ lon       <f64> 12.51133, 9.18951, 14.26811, 7.68682, 13.33561, 8.94439, 11.33875, 11.24626, 15.07041, 16.8554
$ lat       <f64> 41.89193, 45.46427, 40.85216, 45.07049, 38.13205, 44.40478, 44.49381, 43.77925, 37.49223, 41.11148
$ pop       <i64> 2318895, 1236837, 959470, 870456, 648260, 580223, 366133, 349296, 290927, 277387

The prov dataset provides the essential spatial geometry required to map the provincial data. It contains the administrative boundaries for Italy’s 107 provinces, identifying each by its standard name (e.g., “Torino”, “Firenze”). By linking this geospatial file with the covid dataset via the province column, users can construct choropleth maps to visualize the spatial distribution and evolution of case counts across the peninsula.

prov = DSGeo.read_file("data/it_province.geojson")
prov.drop(c.geometry).glimpse()

Rows: 107
Columns: 1
$ province <str> 'Torino', 'Vercelli', 'Novara', 'Cuneo', 'Asti', 'Alessandria', 'Biella', 'Verbano-Cusio-Ossola', "Valle d'Aosta/Vallée d'Aoste", 'Varese'

22.14 RVA Flights

Sourced from the U.S. Bureau of Transportation, these five datasets provide a comprehensive record of commercial aviation activity departing from Richmond International Airport (RIC) during its record-breaking year of 2019.

The rva dataset is the central fact table, containing 24,808 rows that represent the complete set of commercial departures from Richmond for the year. It captures the pulse of daily operations, logging scheduling data (planned vs. actual departure/arrival times), delays, and routing information (origin to dest). It also serves as the connector for the other tables, linking to them via keys like carrier, tailnum, and time_hour.

rva = pl.read_csv("data/flightsrva_flights.csv.gz", null_values=["NA"])
rva.glimpse()

Rows: 24808
Columns: 19
$ year           <i64> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019
$ month          <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ day            <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ dep_time       <i64> 548, 552, 558, 630, 639, 641, 648, 655, 717, 734
$ sched_dep_time <i64> 550, 600, 600, 630, 645, 645, 654, 700, 725, 730
$ dep_delay      <i64> -2, -8, -2, 0, -6, -4, -6, -5, -8, 4
$ arr_time       <i64> 728, 814, 817, 713, 748, 827, 910, 933, 828, 903
$ sched_arr_time <i64> 740, 824, 810, 729, 824, 847, 924, 945, 855, 850
$ arr_delay      <i64> -12, -10, 7, -16, -36, -20, -14, -12, -27, 13
$ carrier        <str> 'WN', 'B6', 'YX', 'YV', 'AA', 'DL', 'YX', 'AA', '9E', 'UA'
$ flight         <i64> 25, 33, 135, 145, 58, 28, 117, 62, 79, 29
$ tailnum        <str> 'N485WN', 'N624JB', 'N818MD', 'N88327', 'N680AW', 'N960DN', 'N443YX', 'N930AU', 'N8943A', 'N29717'
$ origin         <str> 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC'
$ dest           <str> 'ATL', 'FLL', 'MSP', 'IAD', 'CLT', 'ATL', 'MIA', 'DFW', 'JFK', 'ORD'
$ air_time       <i64> 85, 115, 161, 26, 52, 85, 119, 194, 54, 121
$ distance       <i64> 481, 805, 970, 100, 257, 481, 825, 1158, 288, 642
$ hour           <i64> 5, 6, 6, 6, 6, 6, 6, 7, 7, 7
$ minute         <i64> 50, 0, 0, 30, 45, 45, 54, 0, 25, 30
$ time_hour      <str> '2019-01-01T05:00:00Z', '2019-01-01T06:00:00Z', '2019-01-01T06:00:00Z', '2019-01-01T06:00:00Z', '2019-01-01T06:00:00Z', '2019-01-01T06:00:00Z', '2019-01-01T06:00:00Z', '2019-01-01T07:00:00Z', '2019-01-01T07:00:00Z', '2019-01-01T07:00:00Z'

The weather dataset offers an hourly meteorological log for the airport, containing 8,735 observations. It tracks environmental conditions—such as wind speed, visibility, and humidity—that are critical for analyzing flight delays. The time_hour column allows this data to be precisely joined with flight departures to assess the impact of weather on airport performance.

weather = pl.read_csv("data/flightsrva_weather.csv.gz", null_values=["NA"])
weather.glimpse()

Rows: 8735
Columns: 15
$ origin     <str> 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC', 'RIC'
$ year       <i64> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019
$ month      <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ day        <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ hour       <i64> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
$ temp       <str> null, null, null, null, null, null, null, null, null, null
$ dewp       <str> null, null, null, null, null, null, null, null, null, null
$ humid      <str> null, null, null, null, null, null, null, null, null, null
$ wind_dir   <i64> 180, 180, 180, 180, 190, 200, 200, 210, 220, 210
$ wind_speed <f64> 8.05546, 12.658579999999999, 13.809359999999998, 13.809359999999998, 12.658579999999999, 19.56326, 17.261699999999998, 18.41248, 16.11092, 16.11092
$ wind_gust  <f64> 9.2700622588, 14.567240692399997, 15.891535300799996, 15.891535300799996, 14.567240692399997, 22.5130083428, 19.864419125999994, 21.188713734399997, 18.5401245176, 18.5401245176
$ precip     <f64> null, null, null, null, null, null, null, null, null, null
$ pressure   <str> null, null, null, null, null, null, null, null, null, null
$ visib      <f64> 1.0, 1.5, 1.5, 4.0, 9.0, 10.0, 10.0, 10.0, 10.0, 10.0
$ time_hour  <str> '2019-01-01T05:00:00Z', '2019-01-01T06:00:00Z', '2019-01-01T07:00:00Z', '2019-01-01T08:00:00Z', '2019-01-01T09:00:00Z', '2019-01-01T10:00:00Z', '2019-01-01T11:00:00Z', '2019-01-01T12:00:00Z', '2019-01-01T13:00:00Z', '2019-01-01T14:00:00Z'

The airport dataset acts as a geospatial lookup table for the 1,251 US destinations accessible from Richmond. It maps three-letter FAA codes (like “ATL” or “ORD”) to their full names, time zones, and exact latitude/longitude coordinates, enabling the mapping of flight paths and the calculation of distance-based metrics.

airport = pl.read_csv("data/flightsrva_airports.csv.gz", null_values=["NA"])
airport.glimpse()

Rows: 1251
Columns: 8
$ faa   <str> 'AAF', 'AAP', 'ABE', 'ABI', 'ABL', 'ABQ', 'ABR', 'ABY', 'ACK', 'ACT'
$ name  <str> 'Apalachicola Regional Airport', 'Andrau Airpark', 'Lehigh Valley International Airport', 'Abilene Regional Airport', 'Ambler Airport', 'Albuquerque International Sunport', 'Aberdeen Regional Airport', 'Southwest Georgia Regional Airport', 'Nantucket Memorial Airport', 'Waco Regional Airport'
$ lat   <f64> 29.72750092, 29.7224998474, 40.652099609375, 32.4113006592, 67.1063, 35.040199, 45.449100494384766, 31.535499572753903, 41.25310135, 31.611299514770508
$ lon   <f64> -85.02749634, -95.5883026123, -75.44080352783203, -99.6819000244, -157.856989, -106.609001, -98.42179870605467, -84.19450378417969, -70.06020355, -97.23049926757812
$ alt   <i64> 20, 79, 393, 1791, 334, 5355, 1302, 197, 47, 516
$ tz    <i64> -5, -6, -5, -6, -9, -7, -6, -5, -5, -6
$ dst   <str> 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'
$ tzone <str> 'America/New_York', 'America/Chicago', 'America/New_York', 'America/Chicago', 'America/Anchorage', 'America/Denver', 'America/Chicago', 'America/New_York', 'America/New_York', 'America/Chicago'

The airline dataset is a small reference table linking the 14 unique two-letter carrier codes found in the flight logs (e.g., “AA”, “DL”) to their full corporate names (e.g., “American Airlines Inc.”, “Delta Air Lines Inc.”).

airline = pl.read_csv("data/flightsrva_airlines.csv.gz", null_values=["NA"])
airline.glimpse()

Rows: 14
Columns: 2
$ carrier <str> '9E', 'AA', 'B6', 'DL', 'EV', 'G4', 'MQ', 'NK', 'OH', 'OO'
$ name    <str> 'Endeavor Air Inc.', 'American Airlines Inc.', 'JetBlue Airways', 'Delta Air Lines Inc.', 'ExpressJet Airlines LLC d/b/a aha!', 'Allegiant Air', 'Envoy Air', 'Spirit Air Lines', 'PSA Airlines Inc.', 'SkyWest Airlines Inc.'

The plane dataset provides a technical registry for the aircraft used in these flights. Indexed by unique tail numbers, it details the hardware specifications for 3,120 individual planes, including their manufacturer, model year, engine type, and seating capacity, allowing for analysis of fleet modernization and equipment usage.

plane = pl.read_csv("data/flightsrva_planes.csv.gz", null_values=["NA"])
plane.glimpse()

Rows: 3120
Columns: 9
$ tailnum      <str> 'N101HQ', 'N102HQ', 'N102UW', 'N103HQ', 'N103SY', 'N103US', 'N104HQ', 'N104UW', 'N10575', 'N105HQ'
$ year         <i64> 2007, 2007, 1998, 2007, 2014, 1999, 2007, 1999, 2002, 2007
$ type         <str> 'Fixed wing multi engine', 'Fixed wing multi engine', 'Fixed wing multi engine', 'Fixed wing multi engine', 'Fixed wing multi engine', 'Fixed wing multi engine', 'Fixed wing multi engine', 'Fixed wing multi engine', 'Fixed wing multi engine', 'Fixed wing multi engine'
$ manufacturer <str> 'EMBRAER-EMPRESA BRASILEIRA DE', 'EMBRAER-EMPRESA BRASILEIRA DE', 'AIRBUS INDUSTRIE', 'EMBRAER-EMPRESA BRASILEIRA DE', 'EMBRAER S A', 'AIRBUS INDUSTRIE', 'EMBRAER-EMPRESA BRASILEIRA DE', 'AIRBUS INDUSTRIE', 'EMBRAER', 'EMBRAER'
$ model        <str> 'ERJ 170-200 LR', 'ERJ 170-200 LR', 'A320-214', 'ERJ 170-200 LR', 'ERJ 170-200 LR', 'A320-214', 'ERJ 170-200 LR', 'A320-214', 'EMB-145LR', 'ERJ 170-200 LR'
$ engines      <i64> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
$ seats        <i64> 80, 80, 182, 80, 88, 182, 80, 182, 55, 88
$ speed        <i64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ engine       <str> 'Turbo-fan', 'Turbo-fan', 'Turbo-fan', 'Turbo-fan', 'Turbo-fan', 'Turbo-fan', 'Turbo-fan', 'Turbo-fan', 'Turbo-fan', 'Turbo-fan'

22.15 U.S. Storms

Sourced from NOAA’s National Hurricane Center, these three datasets provide a historical record of North Atlantic tropical cyclones. The storm dataset records the comprehensive trajectory and intensity history of tropical cyclones since 1950. It contains over 25,000 timestamped observations, logging the status of a storm every six hours. The data captures the storm’s precise geographic position (lat, lon), its maximum sustained wind speed (wind) in knots, and its Saffir-Simpson category (ranging from 0 for tropical storms/depressions to 5 for catastrophic hurricanes).

storm = pl.read_csv("data/storms.csv")
storm.glimpse()

Rows: 25112
Columns: 12
$ year     <i64> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950
$ month    <i64> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
$ day      <i64> 12, 12, 12, 12, 13, 13, 13, 13, 14, 14
$ hour     <i64> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6
$ name     <str> 'Able', 'Able', 'Able', 'Able', 'Able', 'Able', 'Able', 'Able', 'Able', 'Able'
$ letter   <str> 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'
$ doy      <i64> 224, 224, 224, 224, 225, 225, 225, 225, 226, 226
$ lon      <f64> -55.5, -56.3, -57.4, -58.6, -60.0, -61.1, -62.2, -63.2, -63.8, -64.6
$ lat      <f64> 17.1, 17.7, 18.2, 19.0, 20.0, 20.7, 21.3, 22.0, 22.7, 23.1
$ status   <str> 'TS', 'TS', 'TS', 'TS', 'TS', 'TS', 'TS', 'TS', 'TS', 'TS'
$ category <i64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ wind     <i64> 35, 40, 45, 50, 50, 50, 55, 55, 60, 60

The gender dataset classifies 262 distinct storm names by gender. It provides the name, the assigned gender (male or female), and a prob score reflecting the confidence of that assignment based on U.S. naming conventions. This table supports behavioral research, such as investigations into whether the gender of a storm’s name psychologically impacts public risk perception and preparedness levels.

gender = pl.read_csv("data/storm_gender.csv")
gender.glimpse()

Rows: 262
Columns: 3
$ name   <str> 'Abby', 'Able', 'Agnes', 'Alberto', 'Alex', 'Alice', 'Alicia', 'Allen', 'Allison', 'Alma'
$ gender <str> 'female', 'male', 'female', 'male', 'male', 'female', 'female', 'male', 'female', 'female'
$ prob   <f64> 1.0, 1.0, 1.0, 1.0, 0.9634, 1.0, 1.0, 0.9936, 0.9987, 0.9867

The storm_type dataset acts as a lookup table for the meteorological classifications found in the main tracking data. It maps the two-letter status codes (like ‘HU’ or ‘TD’) to their full descriptions (e.g., ‘hurricane’ or ‘tropical depression’), clarifying the specific developmental stage or physical nature of the cyclone at each observation point.

storm_type = pl.read_csv("data/storm_codes.csv")
storm_type.glimpse()

Rows: 9
Columns: 2
$ status      <str> 'TD', 'TS', 'HU', 'EX', 'SD', 'SS', 'LO', 'WV', 'DB'
$ status_name <str> 'tropical depression', 'tropical storm', 'hurricane', 'extratropical cyclone', 'subtropical depression', 'subtropical storm', 'low', 'tropical wave', 'disturbance'

22.16 English Wikipedia

Derived from the English-language Wikipedia, these three datasets enable network analysis and natural language processing on the encyclopedia’s most prominent articles. The article dataset contains metadata for approximately 248,000 pages, capturing each article’s title, unique id, and category classification. Structural metrics describe the article’s composition: word_count, paragraph_count, section_count, image_count, and link_count (outgoing hyperlinks). The incoming_links column records how many other articles reference each page—a measure of notability. Two-dimensional coordinates (tsne_x, tsne_y) provide a semantic embedding for visualization, placing topically similar articles near each other. For biographical entries, additional columns record wikidata_id, gender, birth_year, death_year, and occupations.

article = pl.read_parquet("data/enwiki_articles.parquet")
article.glimpse()

Rows: 247551
Columns: 26
$ title                 <str> 'Anarchism', 'Albedo', 'A', 'Alabama', 'Achilles', 'Abraham Lincoln', 'Aristotle', 'An American in Paris', 'Academy Award for Best Production Design', 'Academy Awards'
$ id                    <i64> 12, 39, 290, 303, 305, 307, 308, 309, 316, 324
$ category              <str> '', '', 'grapheme', 'U.S. state', 'deity', 'officeholder', 'philosopher', 'musical composition', 'award', 'award'
$ first_sentence        <str> 'Anarchism is a political philosophy and movement that seeks to abolish all institutions that perpetuate authority, coercion, or hierarchy, primarily targeting the state and capitalism.', 'thumb|Albedo change in Greenland: the map shows the difference between the amount of sunlight Greenland reflected in the summer of 2011 versus the average percent it reflected between 2000 and 2006.', 'A, or a, is the first letter and the first vowel letter of the Latin alphabet, used in the modern English alphabet, and others worldwide.', 'Alabama ( ) is a state in the Southeastern and Deep Southern regions of the United States.', 'In Greek mythology, Achilles ( ) or Achilleus () was a hero of the Trojan War who was known as being the greatest of all the Greek warriors.', 'Abraham Lincoln (February 12, 1809April 15, 1865) was the 16th president of the United States, serving from 1861 until his assassination in 1865.', 'Aristotle (; 384–322 BC) was an ancient Greek philosopher and polymath.', 'An American in Paris is a jazz-influenced symphonic poem (or tone poem) for orchestra by American composer George Gershwin first performed in 1928.', 'The Academy Award for Best Production Design recognizes achievement for art direction in film.', 'The Academy Awards, commonly known as the Oscars, are awards for artistic and technical merit in film.'
$ word_count            <i64> 6801, 4095, 2051, 13341, 8093, 11878, 10346, 2103, 7159, 12066
$ paragraph_count       <i64> 81, 69, 36, 222, 101, 138, 185, 35, 54, 150
$ section_count         <i64> 27, 25, 20, 57, 32, 44, 57, 10, 19, 49
$ image_count           <i64> 16, 7, 23, 43, 23, 16, 23, 0, 0, 5
$ link_count            <i64> 635, 166, 262, 890, 511, 517, 559, 105, 2788, 1094
$ incoming_links        <u32> 1786, 220, 627, 18961, 1464, 10681, 6623, 164, 1875, 24786
$ tsne_x                <f32> -20.596328735351562, -16.646677017211914, -10.022477149963379, -58.578121185302734, -49.26348114013672, -13.364924430847168, -50.43266677856445, 53.88010025024414, 57.93836975097656, 57.392173767089844
$ tsne_y                <f32> -4.939699172973633, 30.11508560180664, 10.432233810424805, -106.40777587890625, -3.641737699508667, -76.6756820678711, -8.294107437133789, -41.70497512817383, -1.4083044528961182, -0.7244542837142944
$ id_right              <i64> null, null, null, null, null, 307, 308, null, null, null
$ category_right        <str> null, null, null, null, null, 'officeholder', 'philosopher', null, null, null
$ first_sentence_right  <str> null, null, null, null, null, 'Abraham Lincoln (February 12, 1809April 15, 1865) was the 16th president of the United States, serving from 1861 until his assassination in 1865.', 'Aristotle (; 384–322 BC) was an ancient Greek philosopher and polymath.', null, null, null
$ word_count_right      <i64> null, null, null, null, null, 11878, 10346, null, null, null
$ paragraph_count_right <i64> null, null, null, null, null, 138, 185, null, null, null
$ section_count_right   <i64> null, null, null, null, null, 44, 57, null, null, null
$ image_count_right     <i64> null, null, null, null, null, 16, 23, null, null, null
$ link_count_right      <i64> null, null, null, null, null, 517, 559, null, null, null
$ incoming_links_right  <u32> null, null, null, null, null, 10681, 6623, null, null, null
$ wikidata_id           <str> null, null, null, null, null, 'Q91', 'Q868', null, null, null
$ gender                <str> null, null, null, null, null, 'male', 'male', null, null, null
$ birth_year            <i64> null, null, null, null, null, 1809, -383, null, null, null
$ death_year            <i64> null, null, null, null, null, 1865, -321, null, null, null
$ occupations           <str> null, null, null, null, null, 'writer|lawyer|politician|farmer|statesperson|military officer|postmaster', 'philosopher', null, null, null

The link dataset captures the hyperlink structure of Wikipedia as a directed graph. With 10 million rows, each record represents a single hyperlink from one article (citing_page) to another (cited_page). This edge list supports network analysis tasks such as computing PageRank, identifying hub articles, or tracing how information flows through interconnected topics.

link = pl.read_parquet("data/enwiki_links_p1.parquet")
link.glimpse()

Rows: 10000000
Columns: 2
$ citing_page <str> '"Fast" Eddie Clarke', '"Fast" Eddie Clarke', '"Fast" Eddie Clarke', '"Fast" Eddie Clarke', '"Fast" Eddie Clarke', '"Fast" Eddie Clarke', '"Fast" Eddie Clarke', '"Fast" Eddie Clarke', '"Fast" Eddie Clarke', '"Fast" Eddie Clarke'
$ cited_page  <str> 'AC/DC', 'Abbey Road Studios', 'Ace of Spades (song)', 'Anvil (band)', 'Be-Bop Deluxe', 'Brian Robertson (guitarist)', 'Columbia Records', 'Don Airey', 'Download Festival', 'Girlschool'

The distance dataset provides precomputed semantic neighbors for each article. For a given base_page, it lists the most similar articles (neighbor_page) ranked by a similarity score derived from the t-SNE embedding. This table supports recommendation systems, clustering validation, and exploratory analysis of topical relationships without requiring real-time distance calculations.

distance = pl.read_parquet("data/enwiki_distance.parquet")
distance.glimpse()

Rows: 7426530
Columns: 4
$ base_page     <str> 'Vieques, Puerto Rico', 'Vieques, Puerto Rico', 'Vieques, Puerto Rico', 'Vieques, Puerto Rico', 'Vieques, Puerto Rico', 'Vieques, Puerto Rico', 'Vieques, Puerto Rico', 'Vieques, Puerto Rico', 'Vieques, Puerto Rico', 'Vieques, Puerto Rico'
$ neighbor_page <str> 'Isla de la Juventud', 'Culebra, Puerto Rico', 'Los Roques Archipelago', 'Vega Alta, Puerto Rico', 'Isla Mujeres', 'Puerto Rico', 'Juana Díaz, Puerto Rico', 'Geography of Puerto Rico', 'Vega Baja, Puerto Rico', 'Federal Dependencies of Venezuela'
$ rank           <u8> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ similarity    <f64> 0.8046228289604187, 0.8032209277153015, 0.7991653680801392, 0.7882933020591736, 0.7814295887947083, 0.7801734209060669, 0.7792717814445496, 0.776001513004303, 0.7755575180053711, 0.7754720449447632

22.17 SCOTUS

Sourced from the Supreme Court Database, these three datasets document over two centuries of American judicial history. The scotus dataset catalogs more than 24,000 cases dating back to 1798, identified by their official citation (e.g., ‘3 U.S. 378’) and term year. Each record includes the case_name, the presiding chief justice, and the vote breakdown (maj_votes, min_votes). A lean classification marks whether the decision’s direction was liberal or conservative, while issue codes and their descriptions categorize the legal subject matter—from federal jurisdiction questions to due process claims.

scotus = pl.read_csv("data/scotus_case.csv")
scotus.glimpse()

Rows: 24071
Columns: 9
$ citation          <str> '3 U.S. 378', '3 U.S. 382', '3 U.S. 386', '3 U.S. 401', '3 U.S. 409', '3 U.S. 411', '3 U.S. 425', '4 U.S. 1', '4 U.S. 6', '4 U.S. 7'
$ term              <i64> 1798, 1798, 1798, 1798, 1798, 1799, 1799, 1799, 1799, 1799
$ chief             <str> 'Ellsworth', 'Ellsworth', 'Ellsworth', 'Ellsworth', 'Ellsworth', 'Ellsworth', 'Ellsworth', 'Ellsworth', 'Ellsworth', 'Ellsworth'
$ case_name         <str> 'HOLLINGSWORTH, et al. VERSUS VIRGINIA', 'BINGHAM, PLAINTIFF IN ERROR, VERSUS CABOT, et al.', 'CALDER ET WIFE, VERSUS BULL ET WIFE', 'WILSON VERSUS DANIEL', 'DEWHURST VERSUS COULTHARD', 'FOWLER et al. v. LINDSEY, et al.', 'SIMS LESSEE VERSUS IRVINE', 'THE STATE OF NEW YORK v. THE STATE OF CONNECTICUT et al.', 'HAZLEHURST et al. VERSUS THE UNITED STATES', 'TURNER, ADMINISTRATOR, VERSUS ENRILLE'
$ lean              <str> 'conservative', 'conservative', 'conservative', 'liberal', 'conservative', 'conservative', 'liberal', 'liberal', 'liberal', 'conservative'
$ maj_votes         <i64> 6, 6, 4, 5, 6, 3, 6, 6, 6, 6
$ min_votes         <i64> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0
$ issue             <i64> 90320, 90320, 10180, 90520, 90520, 90510, 20250, 40020, 90160, 90320
$ issue_description <str> 'judicial administration: jurisdiction or authority of federal district courts or territorial courts\nNote: jurisdiction of the federal courts or of the Supreme Court (cf. no merits: dismissed for want of jurisdiction)', 'judicial administration: jurisdiction or authority of federal district courts or territorial courts\nNote: jurisdiction of the federal courts or of the Supreme Court (cf. no merits: dismissed for want of jurisdiction)', 'ex post facto (state)', 'miscellaneous judicial power, especially diversity jurisdiction', 'miscellaneous judicial power, especially diversity jurisdiction', "Supreme Court's certiorari, writ of error, or appeals jurisdiction\nNote: jurisdiction of the federal courts or of the Supreme Court", 'military: veteran\nNote: cf. conscientious objectors and comity: military', "due process: hearing or notice (other than as pertains to government employees or prisoners' rights)\nNote: hearing may be statutorily based", 'no merits: dismissed or affirmed for want of a substantial or properly presented federal question, or a nonsuit \nNote: use only if the syllabus or the summary holding specifies one of the following bases.', 'judicial administration: jurisdiction or authority of federal district courts or territorial courts\nNote: jurisdiction of the federal courts or of the Supreme Court (cf. no merits: dismissed for want of jurisdiction)'

The citation dataset captures the legal precedent network as a directed graph. With over 230,000 rows, each record represents one case (citing_case) referencing another (cited_case). This edge list enables analysis of how legal doctrine evolves—tracing influential precedents, identifying landmark cases by their citation counts, or studying how different areas of law interconnect over time.

citation = pl.read_csv("data/scotus_citation.csv")
citation.glimpse()

Rows: 231775
Columns: 2
$ citing_case <str> '10 U.S. 206', '10 U.S. 267', '10 U.S. 281', '10 U.S. 281', '100 U.S. 1', '100 U.S. 100', '100 U.S. 104', '100 U.S. 104', '100 U.S. 104', '100 U.S. 110'
$ cited_case  <str> '9 U.S. 100', '8 U.S. 137', '8 U.S. 241', '8 U.S. 293', '88 U.S. 17', '73 U.S. 134', '23 U.S. 146', '76 U.S. 805', '98 U.S. 176', '28 U.S. 307'

The vote dataset records individual justice behavior across nearly 205,000 case-justice pairs. Each row links a citation to a justice (identified by name_short and name_full) along with their vote_type. That is, whether they joined the majority, dissented, or concurred. This granular voting record supports research into judicial ideology, coalition formation, and how individual justices have shaped constitutional interpretation throughout the Court’s history.

vote = pl.read_csv("data/scotus_vote.csv")
vote.glimpse()

Rows: 204772
Columns: 4
$ citation   <str> '3 U.S. 378', '3 U.S. 378', '3 U.S. 378', '3 U.S. 378', '3 U.S. 378', '3 U.S. 378', '3 U.S. 382', '3 U.S. 382', '3 U.S. 382', '3 U.S. 382'
$ name_short <str> 'OEllsworth', 'WCushing', 'JWilson', 'JIredell', 'WPaterson', 'SChase', 'OEllsworth', 'WCushing', 'JWilson', 'JIredell'
$ name_full  <str> 'Ellsworth, Oliver', 'Cushing, William', 'Wilson, James', 'Iredell, James', 'Paterson, William', 'Chase, Samuel', 'Ellsworth, Oliver', 'Cushing, William', 'Wilson, James', 'Iredell, James'
$ vote_type  <str> 'majority', 'majority', 'majority', 'majority', 'majority', 'majority', 'majority', 'majority', 'majority', 'majority'

22.18 Shakespeare Plays

A catalog of Shakespeare’s 37 plays, capturing each work’s identity and publication metadata. This serves as the metadata reference table, with each play assigned a short identifier (like “ham” for Hamlet) that links to all other tables. The data comes from the Folger Shakespeare Library’s scholarly editions.

play = pl.read_csv("data/shakespeare_plays.csv")
play.glimpse()

Rows: 37
Columns: 4
$ play_id     <str> 'mnd', 'aww', 'ant', 'ayl', 'cor', 'cym', 'ham', '1h4', '2h4', 'h5'
$ title       <str> 'A Midsummer Night’s Dream', 'All’s Well That Ends Well', 'Antony and Cleopatra', 'As You Like It', 'Coriolanus', 'Cymbeline', 'Hamlet', 'Henry IV, Part I', 'Henry IV, Part II', 'Henry V'
$ author      <str> 'William Shakespeare', 'William Shakespeare', 'William Shakespeare', 'William Shakespeare', 'William Shakespeare', 'William Shakespeare', 'William Shakespeare', 'William Shakespeare', 'William Shakespeare', 'William Shakespeare'
$ short_title <str> 'MND', 'AWW', 'Ant', 'AYL', 'Cor', 'Cym', 'Ham', '1H4', '2H4', 'H5'

The dramatis personae across all plays—1,126 characters from Hamlet to Puck to Lady Macbeth. Each character is tied to their play and includes role descriptions where available (e.g., “duke of Athens” for Theseus, “father to Hermia” for Egeus). This table enables analysis of character networks, naming patterns, and the social structures Shakespeare constructed.

char = pl.read_csv("data/shakespeare_characters.csv")
char.glimpse()

Rows: 1126
Columns: 4
$ character_id     <str> 'Hermia_MND', 'Lysander_MND', 'Helena_MND', 'Demetrius_MND', 'Theseus_MND', 'Hippolyta_MND', 'Egeus_MND', 'Philostrate_MND', 'Bottom_MND', 'Quince_MND'
$ play_id          <str> 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd'
$ name             <str> 'Hermia', 'Lysander', 'Helena', 'Demetrius', 'Theseus', 'Hippolyta', 'Egeus', 'Philostrate', 'Nick Bottom', 'Peter Quince'
$ role_description <str> null, null, null, null, 'duke of Athens', 'queen of the Amazons', 'father to Hermia', 'master of the revels to Theseus', 'weaver', 'carpenter'

The spoken text of the plays, segmented into 80,592 individual lines of verse or prose. Each line is attributed to a specific character and located within the play’s structure (act, scene, line number). This is the core table for literary analysis—studying who speaks, how much, in what style (verse vs. prose), and how dialogue flows through each scene.

line = pl.read_csv("data/shakespeare_lines.csv.gz")
line.glimpse()

Rows: 80592
Columns: 8
$ line_id      <str> 'mnd_line_000001', 'mnd_line_000002', 'mnd_line_000003', 'mnd_line_000004', 'mnd_line_000005', 'mnd_line_000006', 'mnd_line_000007', 'mnd_line_000008', 'mnd_line_000009', 'mnd_line_000010'
$ play_id      <str> 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd'
$ character_id <str> 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Hippolyta_MND', 'Hippolyta_MND', 'Hippolyta_MND', 'Hippolyta_MND'
$ act          <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ scene        <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ line_number  <str> '1.1.1', '1.1.2', '1.1.3', '1.1.4', '1.1.5', '1.1.6', '1.1.7', '1.1.8', '1.1.9', '1.1.10'
$ line_type    <str> 'verse', 'verse', 'verse', 'verse', 'verse', 'verse', 'verse', 'verse', 'verse', 'verse'
$ text         <str> 'Now, fair Hippolyta, our nuptial hour', 'Draws on apace. Four happy days bring in', 'Another moon. But, O, methinks how slow', 'This old moon wanes! She lingers my desires', 'Like to a stepdame or a dowager', 'Long withering out a young man’s revenue.', 'Four days will quickly steep themselves in night;', 'Four nights will quickly dream away the time;', 'And then the moon, like to a silver bow', 'New-bent in heaven, shall behold the night'

A granular, linguistically-annotated word table with nearly 600,000 tokens. Each word is linked back to its line, character, and play, and includes its lemma (base form) and part-of-speech tag. This enables computational stylistics: vocabulary richness, word frequency analysis, grammatical patterns, and comparative studies of how different characters or plays use language.

word = pl.read_csv("data/shakespeare_words.csv.gz")
word.glimpse()

Rows: 599864
Columns: 8
$ word_id      <str> 'mnd_word_00000001', 'mnd_word_00000002', 'mnd_word_00000003', 'mnd_word_00000004', 'mnd_word_00000005', 'mnd_word_00000006', 'mnd_word_00000007', 'mnd_word_00000008', 'mnd_word_00000009', 'mnd_word_00000010'
$ play_id      <str> 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd', 'mnd'
$ character_id <str> 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND', 'Theseus_MND'
$ line_id      <str> 'mnd_line_000001', 'mnd_line_000001', 'mnd_line_000001', 'mnd_line_000001', 'mnd_line_000001', 'mnd_line_000001', 'mnd_line_000002', 'mnd_line_000002', 'mnd_line_000002', 'mnd_line_000002'
$ word         <str> 'Now', 'fair', 'Hippolyta', 'our', 'nuptial', 'hour', 'Draws', 'on', 'apace', 'Four'
$ lemma        <str> 'now', 'fair', 'Hippolyta', 'our', 'nuptial', 'hour', 'draw', 'on', 'apace', 'four'
$ pos          <str> 'av', 'j', 'n1-nn', 'po', 'j', 'n1', 'vvz', 'acp-p', 'av', 'crd'
$ location     <str> '1.1.1', '1.1.1', '1.1.1', '1.1.1', '1.1.1', '1.1.1', '1.1.2', '1.1.2', '1.1.2', '1.1.2'

There are also versions with other early-modern British playwrights: emed_plays.csv, emed_characters.csv, emed_lines.csv and emed_words.csv.

22.19 Wikipedia Authors

Sourced from Wikipedia, these six datasets provide a multi-faceted view of 75 prominent British authors, ranging from medieval poets like Chaucer to modern literary figures.

The wiki dataset serves as the biographical master table for the collection. It anchors each author (doc_id) with key vital statistics, including their years of birth and death, assigned literary era (e.g., “Early”, “Sixteenth C”), and gender. To facilitate linking with the other tables, it includes the URL slug for their Wikipedia page (link) and a shortened name (short) for cleaner visualization.

wiki = pl.read_csv("data/wiki_uk_meta.csv.gz", ignore_errors=True)
wiki.glimpse()

Rows: 75
Columns: 7
$ doc_id <str> 'Marie de France', 'Geoffrey Chaucer', 'John Gower', 'William Langland', 'Margery Kempe', 'Thomas Malory', 'Thomas More', 'Edmund Spenser', 'Walter Raleigh', 'Philip Sidney'
$ born   <i64> 1160, 1343, 1330, 1332, 1373, 1405, 1478, 1552, 1552, 1554
$ died   <i64> 1215, 1400, 1408, 1386, 1438, 1471, 1535, 1599, 1618, 1586
$ era    <str> 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Sixteenth C', 'Sixteenth C', 'Sixteenth C', 'Sixteenth C'
$ gender <str> 'female', 'male', 'male', 'male', 'female', 'male', 'male', 'male', 'male', 'male'
$ link   <str> 'Marie_de_France', 'Geoffrey_Chaucer', 'John_Gower', 'William_Langland', 'Margery_Kempe', 'Thomas_Malory', 'Thomas_More', 'Edmund_Spenser', 'Walter_Raleigh', 'Philip_Sidney'
$ short  <str> 'Marie d. F.', 'Chaucer', 'Gower', 'Langland', 'Kempe', 'Malory', 'More', 'Spenser', 'Raleigh', 'Sidney'

The ptext dataset contains the full text of each of the Wikipedia pages for each of the authors.

ptext = pl.read_csv("data/wiki_uk_authors_text.csv", ignore_errors=True)
ptext.drop(c.text).glimpse()

Rows: 75
Columns: 1
$ doc_id <str> 'Marie de France', 'Geoffrey Chaucer', 'John Gower', 'William Langland', 'Margery Kempe', 'Thomas Malory', 'Thomas More', 'Edmund Spenser', 'Walter Raleigh', 'Philip Sidney'

The anno dataset is a large, token-level linguistic corpus derived from the text of the authors’ Wikipedia biographies. With over 400,000 rows, it breaks down every sentence into individual words (token), providing deep grammatical analysis including the lemma, part-of-speech tags (upos, xpos), and morphological features (feats). It also captures dependency parsing relationships (relation, tid_source), enabling syntactic analysis of how these figures are described in encyclopedic text.

anno = pl.read_csv("data/wiki_uk_authors_anno.csv.gz", ignore_errors=True)
anno.glimpse()

Rows: 407698
Columns: 11
$ doc_id        <str> 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France'
$ sid           <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ tid           <i64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ token         <str> 'Marie', 'de', 'France', 'was', 'a', 'poet', 'possibly', 'born', 'in', 'what'
$ token_with_ws <str> 'Marie ', 'de ', 'France ', 'was ', 'a ', 'poet ', 'possibly ', 'born ', 'in ', 'what '
$ lemma         <str> 'Marie', 'de', 'France', 'be', 'a', 'poet', 'possibly', 'bear', 'in', 'what'
$ upos          <str> 'PROPN', 'PROPN', 'PROPN', 'AUX', 'DET', 'NOUN', 'ADV', 'VERB', 'SCONJ', 'PRON'
$ xpos          <str> 'NNP', 'NNP', 'NNP', 'VBD', 'DT', 'NN', 'RB', 'VBN', 'IN', 'WP'
$ feats         <str> 'Number=Sing', 'Number=Sing', 'Number=Sing', 'Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin', 'Definite=Ind|PronType=Art', 'Number=Sing', 'NA', 'Tense=Past|VerbForm=Part|Voice=Pass', 'NA', 'PronType=Int'
$ tid_source    <i64> 6, 1, 1, 6, 6, 0, 8, 6, 13, 13
$ relation      <str> 'nsubj', 'flat', 'flat', 'cop', 'det', 'root', 'advmod', 'acl', 'mark', 'nsubj'

The cite dataset maps the direct “knowledge graph” between these authors within Wikipedia. It functions as a directed adjacency list, where each row represents a hyperlink connecting one author’s page (doc_id) to another’s (doc_id2). This structure allows researchers to construct a citation network, revealing which authors are explicitly referenced in the biographies of their peers—for example, showing how Geoffrey Chaucer is linked to later writers like Charles Dickens or William Shakespeare.

cite = pl.read_csv("data/wiki_uk_citations.csv", ignore_errors=True)
cite.glimpse()

Rows: 377
Columns: 2
$ doc_id  <str> 'Marie de France', 'Geoffrey Chaucer', 'Geoffrey Chaucer', 'Geoffrey Chaucer', 'Geoffrey Chaucer', 'Geoffrey Chaucer', 'Geoffrey Chaucer', 'John Gower', 'John Gower', 'John Gower'
$ doc_id2 <str> 'Geoffrey Chaucer', 'William Langland', 'John Gower', 'John Dryden', 'Philip Sidney', 'Charles Dickens', 'William Shakespeare', 'William Langland', 'Geoffrey Chaucer', 'C. S. Lewis'

The cocite dataset captures latent connections between authors by tracking how often they appear together in other Wikipedia articles. Rather than direct links, it counts the co-occurrences (count) between a primary author (doc_id) and another figure (doc_id_out).

cocite = pl.read_csv("data/wiki_uk_cocitations.csv", ignore_errors=True)
cocite.glimpse()

Rows: 2114
Columns: 3
$ doc_id     <str> 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France'
$ doc_id_out <str> 'Matthew Arnold', 'Oscar Wilde', 'Samuel Pepys', 'T. S. Eliot', 'Thomas Malory', 'W. H. Auden', 'Walter Raleigh', 'William Blake', 'William Langland', 'William Shakespeare'
$ count      <i64> 1, 2, 1, 1, 1, 1, 1, 1, 1, 2

The rev dataset provides a longitudinal history of the Wikipedia articles themselves, logging over 35,000 individual edits. Each row records a specific revision, detailing the user who made the change, the timestamp (datetime), and the resulting file size. It also captures the editor’s comment, offering a window into the community collaboration and content disputes that shape the public narrative around these literary figures.

rev = pl.read_csv("data/wiki_uk_page_revisions.csv", ignore_errors=True)
rev.glimpse()

Rows: 35470
Columns: 5
$ doc_id   <str> 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France'
$ user     <str> 'YurikBot', 'YurikBot', 'Ccarroll', '63.231.20.88', 'ExplicitImplicity', 'ExplicitImplicity', 'Adam keller', 'Adam keller', '209.174.134.1', '74.71.36.72'
$ datetime <str> '2006-07-06T23:24:54Z', '2006-07-24T18:39:08Z', '2006-09-10T13:36:44Z', '2006-09-18T04:59:53Z', '2006-09-23T22:27:20Z', '2006-09-23T22:31:03Z', '2006-10-03T18:18:09Z', '2006-10-03T18:19:32Z', '2006-10-10T18:48:26Z', '2006-10-11T05:29:32Z'
$ size     <i64> 3061, 3085, 3085, 3105, 3104, 3132, 3140, 3132, 3160, 3132
$ comment  <str> 'robot  Adding: [[it:Maria di Francia]]', 'robot  Adding: [[pt:Maria de França]]', null, null, 'i believe it is stupid to translate one word, but not the other. revert if you think otherwise.', null, null, null, null, null

The views dataset tracks the public interest in these authors through daily page view statistics. It records the number of visits (views) to each author’s Wikipedia page for every day in August 2023. This time-series data allows for the analysis of current popularity trends, revealing how historical figures maintain relevance or experience spikes in attention due to news, anniversaries, or cultural events.

views = pl.read_csv("data/wiki_uk_page_views.csv", ignore_errors=True)
views.glimpse()

Rows: 4490
Columns: 5
$ doc_id <str> 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France'
$ year   <i64> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023
$ month  <i64> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
$ day    <i64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ views  <i64> 121, 138, 138, 129, 104, 139, 107, 109, 127, 117

We also have a dataset ptext_fr, which is a version of the French Wikipedia pages for the same authors. Here are the texts.

ptext_fr = pl.read_csv("data/wiki_uk_authors_text_fr.csv", ignore_errors=True)
ptext_fr.drop(c.text).glimpse()

Rows: 73
Columns: 1
$ doc_id <str> 'Marie de France', 'Geoffrey Chaucer', 'John Gower', 'William Langland', 'Margery Kempe', 'Thomas Malory', 'Thomas More', 'Edmund Spenser', 'Walter Raleigh', 'Philip Sidney'

And likewise, anno_fr captures the French-language specific annotations.

anno_fr = pl.read_csv("data/wiki_uk_authors_anno_fr.csv.gz", ignore_errors=True)
anno_fr.glimpse()

Rows: 215735
Columns: 11
$ doc_id        <str> 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France', 'Marie de France'
$ sid           <i64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ tid           <str> '1', '2', '3', '4', '5', '6', '7', '8', '9', '10-11'
$ token         <str> 'Marie', 'de', 'France', 'est', 'une', 'poétesse', 'de', 'la', 'Renaissance', 'du'
$ token_with_ws <str> 'Marie ', 'de ', 'France ', 'est ', 'une ', 'poétesse ', 'de ', 'la ', 'Renaissance ', 'du '
$ lemma         <str> 'Marie', 'de', 'France', 'être', 'un', 'poétesse', 'de', 'le', 'Renaissance', 'NA'
$ upos          <str> 'NOUN', 'ADP', 'PROPN', 'AUX', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'NA'
$ xpos          <str> 'S', 'E', 'SP', 'V', 'RI', 'S', 'E', 'RD', 'S', 'NA'
$ feats         <str> 'Gender=Fem|Number=Sing', 'NA', 'NA', 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin', 'Definite=Ind|Gender=Fem|Number=Sing|PronType=Art', 'Gender=Fem|Number=Sing', 'NA', 'Definite=Def|Gender=Fem|Number=Sing|PronType=Art', 'Gender=Fem|Number=Sing', 'NA'
$ tid_source    <str> '6', '3', '1', '6', '6', '0', '9', '9', '6', 'NA'
$ relation      <str> 'nsubj', 'case', 'nmod', 'cop', 'det', 'root', 'case', 'det', 'nmod', 'NA'

22.20 IMDb Reviews

The imdb5k dataset is a curated, 5,000-observation subset of the classic IMDb movie review corpus, designed for rapid prototyping and lightweight sentiment analysis experiments. Each row represents a single review, categorized by a binary sentiment label (“positive” or “negative”) and assigned to a specific data partition (index, e.g., “train” or “test”). Although hidden from the preview for brevity, the dataset is fully equipped for modern NLP tasks: it includes the raw text of the critique and e5, a pre-computed 1024-dimensional embedding vector.

imdb5k = pl.read_parquet("data/imdb5k_pca.parquet")
imdb5k.drop(c.e5, c.text).glimpse()

Rows: 5000
Columns: 3
$ id    <str> 'doc0001', 'doc0002', 'doc0003', 'doc0004', 'doc0005', 'doc0006', 'doc0007', 'doc0008', 'doc0009', 'doc0010'
$ label <str> 'negative', 'positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive'
$ index <str> 'test', 'test', 'test', 'train', 'test', 'train', 'train', 'test', 'train', 'train'

The imdb dataset represents the complete, 50,000-row standard benchmark for binary sentiment classification. Balanced between positive and negative polarities, it provides the robust volume of data necessary for training deep learning models. Like the smaller version, it contains metadata for experimental reproducibility (id and index). Crucially, it pairs the unstructured content of the reviews (text) with high-fidelity, 1024-dimensional vector representations (e5), enabling advanced machine learning applications—such as semantic search or clustering—straight out of the box.

imdb = pl.read_parquet("data/imdb_pca.parquet")
imdb.drop(c.e5, c.text).glimpse()

Rows: 50000
Columns: 3
$ id    <str> 'doc0001', 'doc0002', 'doc0003', 'doc0004', 'doc0005', 'doc0006', 'doc0007', 'doc0008', 'doc0009', 'doc0010'
$ label <str> 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative', 'negative'
$ index <str> 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train'

22.21 AG News

The agnews dataset is a widely used benchmark for multiclass topic classification, comprising 127,600 news articles harvested from more than 2,000 sources. Unlike the binary sentiment datasets, this corpus challenges models to distinguish between four distinct thematic categories: World, Sports, Business, and Sci/Tech. Each entry is assigned a specific label and partitioned into training or testing sets via the index column. While the preview displays only the metadata, the dataset is fully enriched with the raw article text and pre-computed, 1024-dimensional e5 embeddings, streamlining the workflow for developing and evaluating advanced NLP models.

agnews = pl.read_parquet("data/agnews_pca.parquet")
agnews.drop(c.e5, c.text).glimpse()

Rows: 127600
Columns: 3
$ id    <str> 'doc0001', 'doc0002', 'doc0003', 'doc0004', 'doc0005', 'doc0006', 'doc0007', 'doc0008', 'doc0009', 'doc0010'
$ label <str> 'Business', 'Business', 'Business', 'Business', 'Business', 'Business', 'Business', 'Business', 'Business', 'Business'
$ index <str> 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train'

22.22 Amazon Reviews

The amazon dataset is a specialized subset of the massive Amazon Product Reviews corpus, containing 10,000 observations selected for the task of product category classification. Unlike standard sentiment analysis datasets, the challenge here is to predict the correct product department (e.g., “All Beauty”) based solely on the content of the review. The dataset includes standard metadata columns—a unique id, the target label, and an index assigning rows to training or testing partitions. Like the previous NLP collections, it comes fully prepared with both the raw text and pre-computed, 1024-dimensional e5 embeddings, enabling immediate experimentation with dense vector-based classification models.

amazon = pl.read_parquet("data/amazon_pca.parquet")
amazon.drop(c.e5, c.text).glimpse()

Rows: 10000
Columns: 3
$ id    <str> 'doc0001', 'doc0002', 'doc0003', 'doc0004', 'doc0005', 'doc0006', 'doc0007', 'doc0008', 'doc0009', 'doc0010'
$ label <str> 'All Beauty', 'All Beauty', 'All Beauty', 'All Beauty', 'All Beauty', 'All Beauty', 'All Beauty', 'All Beauty', 'All Beauty', 'All Beauty'
$ index <str> 'train', 'test', 'train', 'train', 'test', 'train', 'train', 'train', 'train', 'train'

22.23 BBC Headlines

The bbc dataset is a concise benchmark for multiclass document classification, derived from a collection of 2,225 articles published by BBC News. It organizes content into five distinct thematic categories: business, entertainment, politics, sport, and tech. Each row represents a single article, tagged with its ground-truth label and assigned to a specific partition (index) for training or evaluation. Designed for modern NLP workflows, the dataset comes enriched with both the raw text and pre-computed, 1024-dimensional e5 embeddings, allowing researchers to immediately apply vector-based machine learning techniques like clustering or topic modeling.

bbc = pl.read_parquet("data/bbc_pca.parquet")
bbc.drop(c.e5, c.text).glimpse()

Rows: 2225
Columns: 3
$ id    <str> 'doc0001', 'doc0002', 'doc0003', 'doc0004', 'doc0005', 'doc0006', 'doc0007', 'doc0008', 'doc0009', 'doc0010'
$ label <str> 'politics', 'entertainment', 'sport', 'entertainment', 'business', 'tech', 'sport', 'sport', 'tech', 'entertainment'
$ index <str> 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train'

22.24 Sentiment Treebank

The sst dataset contains the Stanford Sentiment Treebank (SST-5), a premier benchmark for fine-grained sentiment analysis. Unlike binary classification tasks, this corpus of 11,855 movie review excerpts challenges models to discern subtle emotional gradations across a five-point scale: “very negative”, “negative”, “neutral”, “positive”, and “very positive”. The dataset includes standard metadata columns for row identification (id) and data partitioning (index). Consistent with the other NLP collections in this series, it is provided with both the raw text and pre-computed, 1024-dimensional e5 embeddings, enabling nuanced research into how vector models capture intensity and neutrality in language.

sst = pl.read_parquet("data/sst5_pca.parquet")
sst.drop(c.e5, c.text).glimpse()

Rows: 11855
Columns: 3
$ id    <str> 'doc0001', 'doc0002', 'doc0003', 'doc0004', 'doc0005', 'doc0006', 'doc0007', 'doc0008', 'doc0009', 'doc0010'
$ label <str> 'very positive', 'negative', 'negative', 'neutral', 'positive', 'neutral', 'positive', 'positive', 'negative', 'very positive'
$ index <str> 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train'

22.25 GoEmotions

The goemo dataset represents GoEmotions, a large-scale, fine-grained corpus designed to capture the complexity of human emotional expression. Sourced from Reddit comments, this collection of over 54,000 observations moves far beyond simple positive/negative sentiment, categorizing text into a rich taxonomy of 27 distinct emotions—such as “admiration,” “remorse,” “curiosity,” and “confusion”—plus a “neutral” state. The data is structured for multi-label classification, with separate columns indicating the presence or absence of each specific emotion. As with the other NLP datasets in this series, it comes enriched with both the raw text and pre-computed, 1024-dimensional e5 embeddings, facilitating advanced research into the vector-space relationships between subtle emotional concepts.

goemo = pl.read_parquet("data/goemotions_pca.parquet")
goemo.drop(c.e5, c.text).glimpse()

Rows: 54263
Columns: 30
$ id             <str> 'doc0001', 'doc0002', 'doc0003', 'doc0004', 'doc0005', 'doc0006', 'doc0007', 'doc0008', 'doc0009', 'doc0010'
$ index          <str> 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train'
$ admiration     <str> 'no admiration', 'no admiration', 'no admiration', 'no admiration', 'no admiration', 'no admiration', 'no admiration', 'no admiration', 'admiration', 'no admiration'
$ amusement      <str> 'no amusement', 'no amusement', 'no amusement', 'no amusement', 'no amusement', 'no amusement', 'no amusement', 'no amusement', 'no amusement', 'no amusement'
$ anger          <str> 'no anger', 'no anger', 'anger', 'no anger', 'no anger', 'no anger', 'no anger', 'no anger', 'no anger', 'no anger'
$ annoyance      <str> 'no annoyance', 'no annoyance', 'no annoyance', 'no annoyance', 'annoyance', 'no annoyance', 'no annoyance', 'no annoyance', 'no annoyance', 'no annoyance'
$ approval       <str> 'no approval', 'no approval', 'no approval', 'no approval', 'no approval', 'no approval', 'no approval', 'no approval', 'no approval', 'no approval'
$ caring         <str> 'no caring', 'no caring', 'no caring', 'no caring', 'no caring', 'no caring', 'no caring', 'no caring', 'no caring', 'no caring'
$ confusion      <str> 'no confusion', 'no confusion', 'no confusion', 'no confusion', 'no confusion', 'no confusion', 'no confusion', 'no confusion', 'no confusion', 'no confusion'
$ curiosity      <str> 'no curiosity', 'no curiosity', 'no curiosity', 'no curiosity', 'no curiosity', 'no curiosity', 'no curiosity', 'no curiosity', 'no curiosity', 'no curiosity'
$ desire         <str> 'no desire', 'no desire', 'no desire', 'no desire', 'no desire', 'no desire', 'no desire', 'desire', 'no desire', 'no desire'
$ disappointment <str> 'no disappointment', 'no disappointment', 'no disappointment', 'no disappointment', 'no disappointment', 'no disappointment', 'no disappointment', 'no disappointment', 'no disappointment', 'no disappointment'
$ disapproval    <str> 'no disapproval', 'no disapproval', 'no disapproval', 'no disapproval', 'no disapproval', 'no disapproval', 'no disapproval', 'no disapproval', 'no disapproval', 'no disapproval'
$ disgust        <str> 'no disgust', 'no disgust', 'no disgust', 'no disgust', 'no disgust', 'no disgust', 'no disgust', 'no disgust', 'no disgust', 'no disgust'
$ embarrassment  <str> 'no embarrassment', 'no embarrassment', 'no embarrassment', 'no embarrassment', 'no embarrassment', 'no embarrassment', 'no embarrassment', 'no embarrassment', 'no embarrassment', 'no embarrassment'
$ excitement     <str> 'no excitement', 'no excitement', 'no excitement', 'no excitement', 'no excitement', 'no excitement', 'no excitement', 'no excitement', 'no excitement', 'no excitement'
$ fear           <str> 'no fear', 'no fear', 'no fear', 'fear', 'no fear', 'no fear', 'no fear', 'no fear', 'no fear', 'no fear'
$ gratitude      <str> 'no gratitude', 'no gratitude', 'no gratitude', 'no gratitude', 'no gratitude', 'no gratitude', 'gratitude', 'no gratitude', 'no gratitude', 'no gratitude'
$ grief          <str> 'no grief', 'no grief', 'no grief', 'no grief', 'no grief', 'no grief', 'no grief', 'no grief', 'no grief', 'no grief'
$ joy            <str> 'no joy', 'no joy', 'no joy', 'no joy', 'no joy', 'no joy', 'no joy', 'no joy', 'no joy', 'no joy'
$ love           <str> 'no love', 'no love', 'no love', 'no love', 'no love', 'no love', 'no love', 'no love', 'no love', 'no love'
$ nervousness    <str> 'no nervousness', 'no nervousness', 'no nervousness', 'no nervousness', 'no nervousness', 'no nervousness', 'no nervousness', 'no nervousness', 'no nervousness', 'no nervousness'
$ optimism       <str> 'no optimism', 'no optimism', 'no optimism', 'no optimism', 'no optimism', 'no optimism', 'no optimism', 'optimism', 'no optimism', 'no optimism'
$ pride          <str> 'no pride', 'no pride', 'no pride', 'no pride', 'no pride', 'no pride', 'no pride', 'no pride', 'no pride', 'no pride'
$ realization    <str> 'no realization', 'no realization', 'no realization', 'no realization', 'no realization', 'no realization', 'no realization', 'no realization', 'no realization', 'no realization'
$ relief         <str> 'no relief', 'no relief', 'no relief', 'no relief', 'no relief', 'no relief', 'no relief', 'no relief', 'no relief', 'no relief'
$ remorse        <str> 'no remorse', 'no remorse', 'no remorse', 'no remorse', 'no remorse', 'no remorse', 'no remorse', 'no remorse', 'no remorse', 'no remorse'
$ sadness        <str> 'no sadness', 'no sadness', 'no sadness', 'no sadness', 'no sadness', 'no sadness', 'no sadness', 'no sadness', 'no sadness', 'no sadness'
$ surprise       <str> 'no surprise', 'no surprise', 'no surprise', 'no surprise', 'no surprise', 'surprise', 'no surprise', 'no surprise', 'no surprise', 'no surprise'
$ neutral        <str> 'neutral', 'neutral', 'no neutral', 'no neutral', 'no neutral', 'no neutral', 'no neutral', 'no neutral', 'no neutral', 'neutral'

22.26 FSA-OWI Color Images

Sourced from the Library of Congress, the fsac dataset contains a sample of 500 color photographs from the Farm Security Administration - Office of War Information (FSA-OWI) collection. This historic government project was originally established to document rural poverty during the Great Depression and later expanded to capture American mobilization for World War II. While the collection is famous for its iconic black-and-white imagery, this dataset highlights the less common, vibrant color work shot on early Kodachrome film. Each row represents a single photograph, accessible via the filepath, and includes rich provenance details such as the photographer (e.g., Russell Lee, Jack Delano) and a descriptive caption. The data is geocoded with city, state, and lat/lon coordinates.

fsac = pl.read_csv("data/fsac.csv")
fsac.glimpse()

Rows: 500
Columns: 11
$ filepath     <str> 'media/fsac/1a35266v.jpg', 'media/fsac/1a34940v.jpg', 'media/fsac/1a34143v.jpg', 'media/fsac/1a35375v.jpg', 'media/fsac/1a34758v.jpg', 'media/fsac/1a34718v.jpg', 'media/fsac/1a34400v.jpg', 'media/fsac/1a34371v.jpg', 'media/fsac/1a34230v.jpg', 'media/fsac/1a34505v.jpg'
$ call_number  <str> 'LC-DIG-fsac-1a35266', 'LC-DIG-fsac-1a34940', 'LC-DIG-fsac-1a34143', 'LC-DIG-fsac-1a35375', 'LC-DIG-fsac-1a34758', 'LC-DIG-fsac-1a34718', 'LC-DIG-fsac-1a34400', 'LC-DIG-fsac-1a34371', 'LC-DIG-fsac-1a34230', 'LC-DIG-fsac-1a34505'
$ photographer <str> 'Alfred T. Palmer', 'Howard R. Hollem', 'Russell Lee', 'Alfred T. Palmer', 'Jack Delano', 'Jack Delano', 'Marion Post Wolcott', 'Marion Post Wolcott', 'Russell Lee', 'John Collier'
$ caption      <str> 'Eight generator units in the generator room of a new addition to TVA&#39;s hydroelectric plant at Wilson Dam, Sheffield vicinity, Ala. Located 260 miles above the mouth of the Tennessee River, the dam has an authorized power installation of 288,000 kw., which can be increased to a possible ultimate of 444,000 kw. The reservoir at the dam adds 377,000 acre-feet of water to controlled storage on the Tennessee River system', 'Metal tubing at the Consolidated Aircraft Corp. plant, Fort Worth, Texas', 'The school at Pie Town, New Mexico is held in the Farm Bureau building, which was constructed by cooperative effort', 'Drilling horizontal stabilizers: operating a hand drill, this woman worker at Vultee-Nashville is shown working on the horizontal stabilizer for a Vultee &quot;Vengeance&quot; dive bomber, Tennessee. The &quot;Vengeance&quot; (A-31) was originally designed for the French. It was later adopted by the R.A.F. and still later by the U.S. Army Air Forces. It is a single-engine, low-wing plane, carrying a crew of two men and having six machine guns of varying calibers', 'Santa Fe R.R. trains going through Cajon Pass in the San Bernardino Mountains, Cajon, Calif. On the right, streamliner &quot;Chief&quot; going west; in the background, on the left, a freight train with a helper engine, going east. Santa Fe trip', 'General view of the city and the Atchison, Topeka, and Santa Fe Railroad, Amarillo, Texas. Santa Fe R.R. trip', 'Houses which have been condemned by the Board of Health but are still occupied by Negro migratory workers, Belle Glade, Fla.', 'Burley tobacco is placed on sticks to wilt after cutting, before it is taken into the barn for drying and curing on the Russell Spears&#39; farm, vicinity of Lexington, Ky.', 'Shasta dam under construction, California', 'Trampas, New Mexico'
$ year         <i64> 1942, 1942, 1940, 1943, 1943, 1943, 1941, 1940, 1942, 1943
$ month        <i64> 6, 10, 10, 2, 3, 3, 1, 9, 6, 1
$ state        <str> 'Alabama', 'Texas', 'New Mexico', 'Tennessee', 'California', 'Texas', 'Florida', 'Kentucky', 'California', 'New Mexico'
$ city         <str> 'Sheffield', 'Fort Worth', 'Pie Town', 'Nashville', 'Cajon', 'Amarillo', 'Belle Glade', 'Lexington', 'Redding', 'Trampas'
$ county       <str> 'Colbert County', 'Tarrant County', 'Catron County', 'Davidson County', 'San County', 'Potter County', 'Palm Beach', 'Fayette County', 'Shasta County', 'Taos County'
$ longitude    <f64> -87.6986407, -97.3208496, -108.1347836, -86.7844432, -116.9625269, -101.8312969, -80.6675577, -84.4777153, -122.3916754, -105.7589053
$ latitude     <f64> 34.7650887, 32.725409, 34.2983884, 36.1658899, 32.7947731, 35.2219971, 26.6845104, 37.9886892, 40.5865396, 36.1311359

22.27 MNIST

The mnist dataset is a 1,000-sample subset of the classic MNIST database, widely considered the “Hello World” of computer vision. It consists of grayscale images of handwritten digits ranging from 0 to 9. The table links the ground-truth numeric label to the corresponding image file via the filepath column and assigns each observation to a specific data partition (index) for training or testing purposes.

mnist = pl.read_csv("data/mnist_1000.csv")
mnist.glimpse()

Rows: 1000
Columns: 3
$ label    <i64> 3, 9, 9, 8, 8, 5, 9, 7, 3, 2
$ filepath <str> 'media/mnist_1000/00000.png', 'media/mnist_1000/00001.png', 'media/mnist_1000/00002.png', 'media/mnist_1000/00003.png', 'media/mnist_1000/00004.png', 'media/mnist_1000/00005.png', 'media/mnist_1000/00006.png', 'media/mnist_1000/00007.png', 'media/mnist_1000/00008.png', 'media/mnist_1000/00009.png'
$ index    <str> 'test', 'test', 'train', 'test', 'train', 'train', 'train', 'train', 'train', 'train'

The emnist dataset is a 10,000-sample subset of EMNIST (Extended MNIST), a more challenging dataset that expands the original digit recognition task to include handwritten letters. Unlike the standard MNIST, the label column here is alphanumeric, containing a mix of digits and both uppercase and lowercase characters (e.g., ‘g’, ‘P’, ‘4’), reflecting the greater complexity of the classification problem.

emnist = pl.read_csv("data/emnist_10000.csv")
emnist.glimpse()

Rows: 10000
Columns: 3
$ label    <str> 'g', 'b', 'P', '9', 't', '4', 'M', 'b', 'e', 'A'
$ filepath <str> 'media/emnist_10000/00000.png', 'media/emnist_10000/00001.png', 'media/emnist_10000/00002.png', 'media/emnist_10000/00003.png', 'media/emnist_10000/00004.png', 'media/emnist_10000/00005.png', 'media/emnist_10000/00006.png', 'media/emnist_10000/00007.png', 'media/emnist_10000/00008.png', 'media/emnist_10000/00009.png'
$ index    <str> 'train', 'train', 'train', 'train', 'test', 'train', 'train', 'train', 'train', 'train'

The fmnist dataset represents a 10,000-sample subset of Fashion-MNIST, designed by Zalando Research as a modern, more difficult drop-in replacement for the original digit dataset. Instead of abstract numbers, the images depict distinct articles of clothing and accessories. The label column provides the text description of the class—such as “dress,” “pullover,” or “ankle boot”—making it intuitive to interpret model errors (e.g., confusing a “shirt” with a “t-shirt”).

fmnist = pl.read_csv("data/fashionmnist_10000.csv")
fmnist.glimpse()

Rows: 10000
Columns: 3
$ label    <str> 'dress', 'shirt', 'pullover', 'pullover', 'ankle boot', 'pullover', 'trouser', 'ankle boot', 'ankle boot', 'pullover'
$ filepath <str> 'media/fashionmnist_10000/00000.png', 'media/fashionmnist_10000/00001.png', 'media/fashionmnist_10000/00002.png', 'media/fashionmnist_10000/00003.png', 'media/fashionmnist_10000/00004.png', 'media/fashionmnist_10000/00005.png', 'media/fashionmnist_10000/00006.png', 'media/fashionmnist_10000/00007.png', 'media/fashionmnist_10000/00008.png', 'media/fashionmnist_10000/00009.png'
$ index    <str> 'test', 'test', 'train', 'test', 'train', 'train', 'train', 'train', 'train', 'test'

22.28 ImageNet

The inet dataset is a 1,000-sample subset of Imagenette, a corpus derived from the massive ImageNet database but restricted to ten distinct, easily distinguishable classes (e.g., “church,” “gas_pump,” “garbage_truck”). Designed as a lightweight benchmark for rapid prototyping, it links each ground-truth label to its source image via filepath. Crucially, this version comes pre-packaged with two state-of-the-art embedding vectors (vit and siglip), allowing researchers to immediately compare how different vision transformer architectures represent these visually distinct objects without needing to run heavy inference tasks.

inet = pl.read_parquet("data/imagenette_1000.parquet")
inet.drop(c.vit, c.siglip).glimpse()

Rows: 1000
Columns: 3
$ label    <str> 'church', 'tench', 'church', 'gas_pump', 'tench', 'garbage_truck', 'tench', 'gas_pump', 'gas_pump', 'garbage_truck'
$ filepath <str> 'media/imagenette_1000/00000.png', 'media/imagenette_1000/00001.png', 'media/imagenette_1000/00002.png', 'media/imagenette_1000/00003.png', 'media/imagenette_1000/00004.png', 'media/imagenette_1000/00005.png', 'media/imagenette_1000/00006.png', 'media/imagenette_1000/00007.png', 'media/imagenette_1000/00008.png', 'media/imagenette_1000/00009.png'
$ index    <str> 'train', 'train', 'test', 'train', 'train', 'train', 'test', 'test', 'train', 'train'

The woof dataset is a 1,000-sample subset of Imagewoof, a significantly harder classification challenge also drawn from ImageNet. Unlike the broad categories in Imagenette, this dataset focuses exclusively on ten specific dog breeds (e.g., “shih_tzu,” “border_terrier,” “dingo”), forcing models to discern fine-grained features rather than gross structural differences. Like its sibling dataset, it includes standard metadata (label, filepath, index) and is enriched with pre-computed vit and siglip embeddings, making it an ideal testbed for evaluating the discriminative power of modern vision models on subtle, fine-grained tasks.

woof = pl.read_parquet("data/imagewoof_1000.parquet")
woof.drop(c.vit, c.siglip).glimpse()

Rows: 1000
Columns: 3
$ label    <str> 'shih_tzu', 'border_terrier', 'australian_terrier', 'golden_retriever', 'dingo', 'english_foxhound', 'border_terrier', 'old_english_sheepdog', 'rhodesian_ridgeback', 'rhodesian_ridgeback'
$ filepath <str> 'media/imagewoof_1000/00000.png', 'media/imagewoof_1000/00001.png', 'media/imagewoof_1000/00002.png', 'media/imagewoof_1000/00003.png', 'media/imagewoof_1000/00004.png', 'media/imagewoof_1000/00005.png', 'media/imagewoof_1000/00006.png', 'media/imagewoof_1000/00007.png', 'media/imagewoof_1000/00008.png', 'media/imagewoof_1000/00009.png'
$ index    <str> 'test', 'train', 'train', 'test', 'test', 'test', 'train', 'train', 'train', 'train'

22.29 Oxford Flowers

The flowers dataset is a 1,000-sample subset of the Oxford Flowers dataset, a standard benchmark for fine-grained image classification. Unlike general object recognition, this dataset focuses on distinguishing between closely related floral species, with label entries such as “foxglove,” “clematis,” and “bishop of llandaff.” Each row links the specific flower class to its image via filepath and assigns it to a training or testing partition (index). Consistent with the previous vision datasets, it is enriched with pre-computed vit and siglip embedding vectors, enabling researchers to immediately apply clustering or similarity search to explore how different models encode botanical features.

flowers = pl.read_parquet("data/flowers_1000.parquet")
flowers.drop(c.vit, c.siglip).glimpse()

Rows: 1000
Columns: 3
$ label    <str> 'foxglove', 'clematis', 'tree mallow', 'bishop of llandaff', 'daffodil', 'hippeastrum', 'bougainvillea', 'rose', 'bougainvillea', 'azalea'
$ filepath <str> 'media/flowers_1000/00000.png', 'media/flowers_1000/00001.png', 'media/flowers_1000/00002.png', 'media/flowers_1000/00003.png', 'media/flowers_1000/00004.png', 'media/flowers_1000/00005.png', 'media/flowers_1000/00006.png', 'media/flowers_1000/00007.png', 'media/flowers_1000/00008.png', 'media/flowers_1000/00009.png'
$ index    <str> 'test', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train', 'train'

22.30 Caltech-UCSD Birds

The birds dataset is a curated subset of the Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset, a renowned benchmark for fine-grained visual categorization. While the full collection covers 200 species, this version (birds10) likely restricts the task to 10 distinct classes (e.g., “canary”), making it more accessible for rapid model testing. Each entry links the species label to its image filepath and assigns it to a data partition (index). Like the other classification datasets in this series, it comes ready-to-use with pre-computed vit and siglip embeddings, allowing for immediate exploration of how transformer models distinguish between subtle avian features.

birds = pl.read_parquet("data/birds10.parquet")
birds.drop(c.vit, c.siglip).glimpse()

Rows: 1555
Columns: 3
$ label    <str> 'canary', 'canary', 'canary', 'canary', 'canary', 'canary', 'canary', 'canary', 'canary', 'canary'
$ filepath <str> 'media/birds10/00000.png', 'media/birds10/00001.png', 'media/birds10/00002.png', 'media/birds10/00003.png', 'media/birds10/00004.png', 'media/birds10/00005.png', 'media/birds10/00006.png', 'media/birds10/00007.png', 'media/birds10/00008.png', 'media/birds10/00009.png'
$ index    <str> 'test', 'train', 'train', 'train', 'train', 'test', 'train', 'train', 'train', 'test'

The birds_bbox dataset complements the classification data by focusing on object detection and localization. Also derived from the CUB-200-2011 collection, it contains 1,000 observations where the primary task is not just naming the bird, but locating it within the frame. In addition to the label and filepath, this dataset provides precise coordinates for a bounding box: bbox_x0 and bbox_y0 define the top-left corner, while bbox_x1 and bbox_y1 mark the bottom-right. This granular spatial data is essential for training models to separate the subject from complex natural backgrounds.

birds_bbox = pl.read_csv("data/birds_1000.csv")
birds_bbox.glimpse()

Rows: 1000
Columns: 7
$ label    <str> 'Gray_Catbird', 'Sayornis', 'Tennessee_Warbler', 'White_throated_Sparrow', 'Ring_billed_Gull', 'Tree_Swallow', 'Florida_Jay', 'Yellow_breasted_Chat', 'Rusty_Blackbird', 'House_Sparrow'
$ filepath <str> 'media/birds_1000/00000.png', 'media/birds_1000/00001.png', 'media/birds_1000/00002.png', 'media/birds_1000/00003.png', 'media/birds_1000/00004.png', 'media/birds_1000/00005.png', 'media/birds_1000/00006.png', 'media/birds_1000/00007.png', 'media/birds_1000/00008.png', 'media/birds_1000/00009.png'
$ bbox_x0  <f64> 15.0, 131.0, 40.0, 99.0, 104.0, 165.0, 147.0, 99.0, 86.0, 137.0
$ bbox_y0  <f64> 44.0, 85.0, 5.0, 42.0, 32.0, 133.0, 89.0, 60.0, 148.0, 77.0
$ bbox_x1  <f64> 480.0, 488.0, 345.0, 448.0, 451.0, 423.0, 360.0, 377.0, 410.0, 365.0
$ bbox_y1  <f64> 331.0, 326.0, 239.0, 344.0, 284.0, 286.0, 235.0, 353.0, 300.0, 277.0
$ index    <str> 'train', 'train', 'test', 'train', 'train', 'train', 'test', 'train', 'train', 'train'