2  EDA I: Organizing Data

2.1 Setup

We will begin this and every subsequent chapter by listing the core modules and datasets that must be loaded before getting started. Datasets and sources that require new functions will be loaded within the body of the text.

import numpy as np
import polars as pl

from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())

country = pl.read_csv("data/countries.csv")

2.2 Introduction

We will use a variety of methods to modify our datasets. All data operations shown in this chapter come from the Polars library—the same one we used to read our data. These operations take a DataFrame as input and produce a modified DataFrame as output. This consistency makes it easy to chain operations and build complex data transformations step by step. The operations we learn in this chapter are fundamental building blocks for data analysis; while they may seem simple individually, they become powerful when combined.

2.3 Sorting Rows

Organizing data in a meaningful order is often crucial for understanding patterns and trends. We’ll use the .sort() method to reorder the dataset by arranging rows according to one or more columns. This is particularly useful for identifying extreme values, understanding distributions, or preparing data for specific types of analysis. Let’s look at an example of sorting the countries dataset by population, then break down how it works.

(
    country
    .sort(c.pop)
)
shape: (135, 15)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"ISL" "Iceland" "Europe" "Northern Europe" 0.398 84.92 65.0 -19.0 0.972 67444 26.1 75.62 107.2 98.78217 "isl"
"MLT" "Malta" "Europe" "Southern Europe" 0.545 83.48 35.883333 14.5 0.924 63314 31.0 62.95 107.8 99.96455 "eng|mlt"
"MNE" "Montenegro" "Europe" "Southern Europe" 0.633 76.92 42.766667 19.216667 0.862 29492 29.4 58.13 185.1 96.657 "cnr|cnr"
"LUX" "Luxembourg" "Europe" "Western Europe" 0.68 83.82 49.77 6.13 0.922 131038 34.9 70.16 143.3 97.60054 "fra|deu|ltz"
"COM" "Comoros" "Africa" "Eastern Africa" 0.883 70.26 -12.3 43.7 0.603 3718 45.3 35.88 25.2 35.85662 "ara|fra"
"PAK" "Pakistan" "Asia" "Southern Asia" 255.22 66.71 30.0 71.0 0.544 5717 29.6 45.49 49.8 61.92651 "eng|urd"
"IDN" "Indonesia" "Asia" "South-eastern Asia" 285.721 72.25 -2.0 118.0 0.728 15391 37.9 56.95 85.8 77.51606 "ind"
"USA" "United States of America" "Americas" "Northern America" 347.276 79.83 39.828175 -98.5795 0.938 78389 47.7 65.21 91.7 99.72235 "eng"
"CHN" "China" "Asia" "Eastern Asia" 1416.096 78.81 35.844722 103.451944 0.797 25505 38.2 61.45 63.6 88.06831 "cmn"
"IND" "India" "Asia" "Southern Asia" 1463.866 72.4 22.8 83.0 0.685 10608 35.7 46.76 60.5 63.83039 "eng|hin"

There are several important things to note in this code. First, the code is wrapped in a pair of parentheses, each on its own line, and all other lines are indented by four spaces. Second, we place the dataset on its own line and then call the sort method on the following line. Third, we use the c. prefix followed by the column name to refer to a column. Finally, the results are printed by default, but the original dataset is not modified. After running this code, the country column still has its original order. All of these features are shared across the entire set of DataFrame methods.

The standard way to specify columns in the functions in this chapter is the c. format. You can also specify a column using c("<str>"), where <str> is the column name. This is useful when the column name contains spaces or special characters and cannot be represented using dot notation. Some functions, such as sort, also allow specifying the column name directly in quotes without referencing c. We will avoid this in our code, but you may encounter it in other sources.

You can reverse the sorting order of sort by setting its descending parameter to True.

(
    country
    .sort(c.pop, descending=True)
)
shape: (135, 15)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"IND" "India" "Asia" "Southern Asia" 1463.866 72.4 22.8 83.0 0.685 10608 35.7 46.76 60.5 63.83039 "eng|hin"
"CHN" "China" "Asia" "Eastern Asia" 1416.096 78.81 35.844722 103.451944 0.797 25505 38.2 61.45 63.6 88.06831 "cmn"
"USA" "United States of America" "Americas" "Northern America" 347.276 79.83 39.828175 -98.5795 0.938 78389 47.7 65.21 91.7 99.72235 "eng"
"IDN" "Indonesia" "Asia" "South-eastern Asia" 285.721 72.25 -2.0 118.0 0.728 15391 37.9 56.95 85.8 77.51606 "ind"
"PAK" "Pakistan" "Asia" "Southern Asia" 255.22 66.71 30.0 71.0 0.544 5717 29.6 45.49 49.8 61.92651 "eng|urd"
"COM" "Comoros" "Africa" "Eastern Africa" 0.883 70.26 -12.3 43.7 0.603 3718 45.3 35.88 25.2 35.85662 "ara|fra"
"LUX" "Luxembourg" "Europe" "Western Europe" 0.68 83.82 49.77 6.13 0.922 131038 34.9 70.16 143.3 97.60054 "fra|deu|ltz"
"MNE" "Montenegro" "Europe" "Southern Europe" 0.633 76.92 42.766667 19.216667 0.862 29492 29.4 58.13 185.1 96.657 "cnr|cnr"
"MLT" "Malta" "Europe" "Southern Europe" 0.545 83.48 35.883333 14.5 0.924 63314 31.0 62.95 107.8 99.96455 "eng|mlt"
"ISL" "Iceland" "Europe" "Northern Europe" 0.398 84.92 65.0 -19.0 0.972 67444 26.1 75.62 107.2 98.78217 "isl"

Now the countries with the highest population appear first. This descending order is often useful when we want to focus on the most extreme cases or identify outliers in our data. For example, the code below sorts by region (alphabetical order) and then, within each region, by population.

(
    country
    .sort(c.region, c.pop)
)
shape: (135, 15)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"COM" "Comoros" "Africa" "Eastern Africa" 0.883 70.26 -12.3 43.7 0.603 3718 45.3 35.88 25.2 35.85662 "ara|fra"
"MUS" "Mauritius" "Africa" "Eastern Africa" 1.268 76.35 -20.2 57.5 0.806 28927 36.8 57.59 92.8 95.45876 "eng|fra"
"BWA" "Botswana" "Africa" "Southern Africa" 2.562 63.08 -22.2 23.7 0.731 18189 53.3 33.32 116.2 77.26922 "eng"
"GAB" "Gabon" "Africa" "Middle Africa" 2.593 68.68 -0.683331 11.5 0.733 19543 38.0 51.04 93.6 49.20331 "fra"
"GMB" "Gambia" "Africa" "Western Africa" 2.822 68.16 13.5 -15.5 0.524 3199 null 46.91 76.7 46.90722 "eng"
"GBR" "United Kingdom of Great Britai… "Europe" "Northern Europe" 69.551 81.93 54.6 -2.0 0.946 53993 32.8 66.58 121.8 99.1103 "eng"
"DEU" "Germany" "Europe" "Western Europe" 84.075 82.13 51.0 10.0 0.959 64355 31.7 67.92 109.4 99.22504 "deu"
"RUS" "Russian Federation" "Europe" "Eastern Europe" 143.997 73.89 66.416667 94.25 0.832 40871 36.0 58.65 165.1 88.24222 "rus"
"NZL" "New Zealand" "Oceania" "Australia and New Zealand" 5.252 82.65 -41.2 174.0 0.938 49598 null 69.76 108.3 100.0 "eng|mri|nzs"
"AUS" "Australia" "Oceania" "Australia and New Zealand" 26.974 83.79 -25.0 133.0 0.958 61943 32.4 70.25 101.6 99.99923 "asf"

Finally, store a copy of the dataset in a named variable using the following code:

country_saved = (
    country
    .sort(c.region, c.pop)
)

After running the code, country_saved contains the reordered dataset, while country still contains the original data. The same format can be used to save the results from the other methods described in the sections below.

2.4 Selecting Columns

In some cases—particularly when a dataset grows large—you may want to select or reorder its columns. Use the .select method with the same c. notation to specify the columns to include.

(
    country
    .select(c.iso, c.lat, c.lon)
)
shape: (135, 3)
iso lat lon
str f64 f64
"SEN" 14.366667 -14.283333
"VEN" 8.0 -67.0
"FIN" 65.0 27.0
"USA" 39.828175 -98.5795
"LKA" 7.0 81.0
"ALB" 41.0 20.0
"MYS" 3.7805111 102.314362
"SLV" 13.668889 -88.866111
"CYP" 35.0 33.0
"PAK" 30.0 71.0

Selecting columns is something we will do frequently in this text to clarify which parts of the dataset we want you to examine. It is less important for most data science pipelines, but it is particularly useful when performing more complex transformations, such as those in Chapter 4.

There is a similar function, .drop, which lets us select columns to remove from the data. Dropping columns is useful when a dataset has many columns and you want to remove only a few.

2.5 Selecting Rows

Often, we want to take a subset of rows from a dataset for a particular analysis or visualization. There are several functions for doing this, depending on the selection criteria. The .head() method, for example, returns a specified number of rows from the top of the dataset. The example below selects the first four rows.

(
    country
    .head(n=4)
)
shape: (4, 15)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"SEN" "Senegal" "Africa" "Western Africa" 18.932 70.43 14.366667 -14.283333 0.53 4871 38.1 50.93 66.0 54.93987 "pbp|fra|wol"
"VEN" "Venezuela, Bolivarian Republic… "Americas" "South America" 28.517 76.18 8.0 -67.0 0.709 8899 44.8 57.65 96.8 95.66913 "spa|vsl"
"FIN" "Finland" "Europe" "Northern Europe" 5.623 82.84 65.0 27.0 0.948 57574 27.7 76.99 156.4 99.44798 "fin|swe"
"USA" "United States of America" "Americas" "Northern America" 347.276 79.83 39.828175 -98.5795 0.938 78389 47.7 65.21 91.7 99.72235 "eng"

The method .tail(n=<int>) works the same way, except that it selects rows starting from the bottom of the table. The methods .drop_nulls() and .drop_nans() take no arguments and return only rows that contain no missing or invalid data values, respectively.

The .sample method randomly selects rows. You can specify the number of rows with n=<int> or a fraction with fraction=<float>. Below is an example that takes a random 20% sample of the dataset.

(
    country
    .sample(fraction=0.2)
)
shape: (27, 15)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"AZE" "Azerbaijan" "Asia" "Western Asia" 10.398 71.73 40.3 47.7 0.789 23178 26.6 52.14 99.5 96.12412 "aze"
"CHN" "China" "Asia" "Eastern Asia" 1416.096 78.81 35.844722 103.451944 0.797 25505 38.2 61.45 63.6 88.06831 "cmn"
"AUT" "Austria" "Europe" "Western Europe" 9.114 83.17 48.0 14.0 0.93 64665 30.2 66.36 146.3 99.97291 "asq|deu"
"SAU" "Saudi Arabia" "Asia" "Western Asia" 34.566 75.66 23.716667 44.116667 0.9 64966 null 69.53 205.0 95.30723 "ara"
"LKA" "Sri Lanka" "Asia" "Southern Asia" 23.229 78.51 7.0 81.0 0.776 14380 39.3 36.02 83.1 90.77437 "sin|sin|tam|tam"
"CAN" "Canada" "Americas" "Northern America" 40.127 83.15 56.0 -109.0 0.939 58422 33.3 68.41 75.5 98.92568 "eng|fra"
"TGO" "Togo" "Africa" "Western Africa" 9.722 66.47 8.25 1.183333 0.571 3008 42.4 43.65 38.7 17.19931 "fra"
"BFA" "Burkina Faso" "Africa" "Western Africa" 24.075 62.97 12.266667 -2.066667 0.459 2661 47.3 44.62 35.3 21.77792 "bib|dyu|mos"
"NOR" "Norway" "Europe" "Northern Europe" 5.623 83.78 65.0 11.0 0.97 94896 27.7 72.49 114.5 98.05439 "nob|nor|nno"
"CYP" "Cyprus" "Asia" "Western Asia" 1.371 81.77 35.0 33.0 0.913 55720 31.2 60.71 123.1 99.41781 "ell|tur"

We can select rows that meet a condition defined by the dataset’s variables using the .filter() method. Similar to the sort() method, we use the syntax c. followed by variable names to define relationships between variables. Below is code that selects only the rows where the Human Development Index (HDI) is greater than 0.9.

(
    country
    .filter(c.hdi > 0.9)
)
shape: (32, 15)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"FIN" "Finland" "Europe" "Northern Europe" 5.623 82.84 65.0 27.0 0.948 57574 27.7 76.99 156.4 99.44798 "fin|swe"
"USA" "United States of America" "Americas" "Northern America" 347.276 79.83 39.828175 -98.5795 0.938 78389 47.7 65.21 91.7 99.72235 "eng"
"SGP" "Singapore" "Asia" "South-eastern Asia" 5.871 85.63 1.3 103.8 0.946 137906 null 66.54 145.5 100.0 "eng|msa|cmn|tam"
"AUT" "Austria" "Europe" "Western Europe" 9.114 83.17 48.0 14.0 0.93 64665 30.2 66.36 146.3 99.97291 "asq|deu"
"DEU" "Germany" "Europe" "Western Europe" 84.075 82.13 51.0 10.0 0.959 64355 31.7 67.92 109.4 99.22504 "deu"
"NLD" "Netherlands, Kingdom of the" "Europe" "Western Europe" 18.347 82.55 52.366667 4.883333 0.955 73384 null 72.55 114.4 97.71353 "nld"
"IRL" "Ireland" "Europe" "Northern Europe" 5.308 82.94 53.0 -8.0 0.949 119406 null 68.17 103.1 89.3966 "eng|gle"
"LUX" "Luxembourg" "Europe" "Western Europe" 0.68 83.82 49.77 6.13 0.922 131038 34.9 70.16 143.3 97.60054 "fra|deu|ltz"
"CAN" "Canada" "Americas" "Northern America" 40.127 83.15 56.0 -109.0 0.939 58422 33.3 68.41 75.5 98.92568 "eng|fra"
"CYP" "Cyprus" "Asia" "Western Asia" 1.371 81.77 35.0 33.0 0.913 55720 31.2 60.71 123.1 99.41781 "ell|tur"

We can also filter categorical variables using equality comparisons. To prevent the page from becoming too cluttered, we will avoid printing this and some other results in future examples.

(
    country
    .filter(c.region == "Europe")
)

This selects all countries in Europe. Note that we use the double-equals operator (==) to test equality; a single equals sign (=) is used only for assignment. The resulting dataset contains only rows where region == "Europe".

Often, we want to apply multiple conditions simultaneously. We can combine conditions using the & (and) and | (or) operators. When using these operators, wrap each condition in parentheses:

(
    country
    .filter((c.region == "Europe") & (c.hdi > 0.9))
)

Sometimes we want to know whether a value belongs to a larger set. We can do this with the .is_in method, as shown in the following code. Note that the collection of elements passed to .is_in must be enclosed in square brackets. A detailed explanation of the object being created is provided in Chapter 6.

(
    country
    .filter(c.region.is_in(["Europe", "Africa"]))
)
shape: (75, 15)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"SEN" "Senegal" "Africa" "Western Africa" 18.932 70.43 14.366667 -14.283333 0.53 4871 38.1 50.93 66.0 54.93987 "pbp|fra|wol"
"FIN" "Finland" "Europe" "Northern Europe" 5.623 82.84 65.0 27.0 0.948 57574 27.7 76.99 156.4 99.44798 "fin|swe"
"GAB" "Gabon" "Africa" "Middle Africa" 2.593 68.68 -0.683331 11.5 0.733 19543 38.0 51.04 93.6 49.20331 "fra"
"BGR" "Bulgaria" "Europe" "Eastern Europe" 6.715 74.33 42.75 25.5 0.845 36211 40.3 55.9 137.1 86.00395 "bul"
"TZA" "Tanzania, United Republic of" "Africa" "Eastern Africa" 70.546 68.59 -6.306944 34.853889 0.555 3924 40.5 40.42 46.9 26.78297 "eng|swa"
"LUX" "Luxembourg" "Europe" "Western Europe" 0.68 83.82 49.77 6.13 0.922 131038 34.9 70.16 143.3 97.60054 "fra|deu|ltz"
"LTU" "Lithuania" "Europe" "Northern Europe" 2.83 77.19 55.2 24.0 0.895 49761 35.3 65.53 157.9 92.73568 "lit"
"MNE" "Montenegro" "Europe" "Southern Europe" 0.633 76.92 42.766667 19.216667 0.862 29492 29.4 58.13 185.1 96.657 "cnr|cnr"
"GMB" "Gambia" "Africa" "Western Africa" 2.822 68.16 13.5 -15.5 0.524 3199 null 46.91 76.7 46.90722 "eng"
"ALB" "Albania" "Europe" "Southern Europe" 2.772 79.67 41.0 20.0 0.81 20362 30.8 54.45 91.9 98.5473 "sqi"

These operators, including != (not equal) and >= (greater than or equal to), allow us to select rows based on nearly any relationship in the dataset required for our analysis.

2.6 Grouping Data

There is a special method called .group_by(...) that allows you to group a dataset by one or more variables using a syntax similar to .sort(...). When used in a method chain, the next method is applied separately to each group of rows that share the same value(s) for the grouping variable(s). For example, the following code selects the first country from each region in the dataset.

(
    country
    .group_by(c.region)
    .head(n=1)
)
shape: (5, 15)
region iso full_name subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"Asia" "LKA" "Sri Lanka" "Southern Asia" 23.229 78.51 7.0 81.0 0.776 14380 39.3 36.02 83.1 90.77437 "sin|sin|tam|tam"
"Europe" "FIN" "Finland" "Northern Europe" 5.623 82.84 65.0 27.0 0.948 57574 27.7 76.99 156.4 99.44798 "fin|swe"
"Africa" "SEN" "Senegal" "Western Africa" 18.932 70.43 14.366667 -14.283333 0.53 4871 38.1 50.93 66.0 54.93987 "pbp|fra|wol"
"Americas" "VEN" "Venezuela, Bolivarian Republic… "South America" 28.517 76.18 8.0 -67.0 0.709 8899 44.8 57.65 96.8 95.66913 "spa|vsl"
"Oceania" "AUS" "Australia" "Australia and New Zealand" 26.974 83.79 -25.0 133.0 0.958 61943 32.4 70.25 101.6 99.99923 "asf"

This can be useful for various types of analysis. For example, we can determine the largest countries by population in each region using the following code:

(
    country
    .sort(c.pop, descending=True)
    .group_by(c.region)
    .head(n=1)
)
shape: (5, 15)
region iso full_name subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"Africa" "NGA" "Nigeria" "Western Africa" 237.528 66.07 9.0 8.0 0.56 5848 35.1 48.69 52.4 41.19972 "eng"
"Americas" "USA" "United States of America" "Northern America" 347.276 79.83 39.828175 -98.5795 0.938 78389 47.7 65.21 91.7 99.72235 "eng"
"Europe" "RUS" "Russian Federation" "Eastern Europe" 143.997 73.89 66.416667 94.25 0.832 40871 36.0 58.65 165.1 88.24222 "rus"
"Oceania" "AUS" "Australia" "Australia and New Zealand" 26.974 83.79 -25.0 133.0 0.958 61943 32.4 70.25 101.6 99.99923 "asf"
"Asia" "IND" "India" "Southern Asia" 1463.866 72.4 22.8 83.0 0.685 10608 35.7 46.76 60.5 63.83039 "eng|hin"

The following sections describe additional methods that can operate on grouped datasets.

2.7 Modifying Columns

Creating or modifying columns in a dataset is a fundamental operation in data analysis. Analysts often need to compute new variables from existing data, create indicator (dummy) variables, or transform variables for analysis.

The basic pattern for adding a new column is to use the .with_columns() method. In the method call, specify the values you want to create or modify and define them in terms of existing columns. For example, the following multiplies the population column (values in millions) by 1,000 to obtain the population in thousands.

(
    country
    .with_columns(population_1k = c.pop * 1000)
)
shape: (135, 16)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang population_1k
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str f64
"SEN" "Senegal" "Africa" "Western Africa" 18.932 70.43 14.366667 -14.283333 0.53 4871 38.1 50.93 66.0 54.93987 "pbp|fra|wol" 18932.0
"VEN" "Venezuela, Bolivarian Republic… "Americas" "South America" 28.517 76.18 8.0 -67.0 0.709 8899 44.8 57.65 96.8 95.66913 "spa|vsl" 28517.0
"FIN" "Finland" "Europe" "Northern Europe" 5.623 82.84 65.0 27.0 0.948 57574 27.7 76.99 156.4 99.44798 "fin|swe" 5623.0
"USA" "United States of America" "Americas" "Northern America" 347.276 79.83 39.828175 -98.5795 0.938 78389 47.7 65.21 91.7 99.72235 "eng" 347276.0
"LKA" "Sri Lanka" "Asia" "Southern Asia" 23.229 78.51 7.0 81.0 0.776 14380 39.3 36.02 83.1 90.77437 "sin|sin|tam|tam" 23229.0
"ALB" "Albania" "Europe" "Southern Europe" 2.772 79.67 41.0 20.0 0.81 20362 30.8 54.45 91.9 98.5473 "sqi" 2772.0
"MYS" "Malaysia" "Asia" "South-eastern Asia" 35.978 76.03 3.7805111 102.314362 0.819 35990 46.2 58.68 118.2 95.69194 "msa" 35978.0
"SLV" "El Salvador" "Americas" "Central America" 6.366 76.98 13.668889 -88.866111 0.678 12221 38.3 64.82 126.9 86.19786 "spa" 6366.0
"CYP" "Cyprus" "Asia" "Western Asia" 1.371 81.77 35.0 33.0 0.913 55720 31.2 60.71 123.1 99.41781 "ell|tur" 1371.0
"PAK" "Pakistan" "Asia" "Southern Asia" 255.22 66.71 30.0 71.0 0.544 5717 29.6 45.49 49.8 61.92651 "eng|urd" 255220.0

Common mathematical transformations of variables can be accessed by using a . after the variable name, followed by a transformation function such as sqrt() or log10(). There are also to fill in missing values: fill_null() to fill all values with a specified default, fill_nan to fill invalid numbers with a default, forward_fill to fill missing values with the first valid value above, and backward_fill to fill with the first valid value below.

The following example demonstrates the method of using methods from the columns: take the square root and the sine of population, and save each as a new column. Neither of these are meaningful or particularly useful; they are just to demonstrate the technique.

(
    country
    .with_columns(
      pop_sqrt = c.pop.sqrt(),
      pop_sin = c.pop.sin()
    )
)

Note the format of the code above. When a column definition is long (or when defining multiple columns at once), put each definition on its own line and indent it by four spaces.

Polars includes a method called .alias that can be used within .with_columns to create or overwrite a column in a dataset. It was originally the standard technique, but it has largely been replaced by the equals-sign method, which many people, including me, find more intuitive. It’s useful to understand only because you’ll often see it in the Polars documentation.

2.8 Aggregation

Although examining individual rows can be informative, we often need summary statistics to understand the overall patterns in our data. Polars provides powerful aggregation tools that let us compute statistics across rows within specific groups. The .agg() method (short for “aggregate”) is the primary tool for computing summary statistics. It allows us to specify exactly which statistics to compute and for which columns.

(
    country
    .group_by(c.region)
    .agg(
      hdi_avg = c.hdi.mean()
    )
)
shape: (5, 2)
region hdi_avg
str f64
"Africa" 0.583351
"Europe" 0.898895
"Asia" 0.760897
"Oceania" 0.948
"Americas" 0.783579

In this example, we compute the average HDI across countries within each region. The syntax uses c. followed by the column name, then the aggregation function with parentheses. Note that only the grouping variable(s) and the columns created by the .agg() function appear in the result.

We can compute multiple statistics at once by including additional items in our aggregation. It is often useful to include the special function pl.len(), which counts the number of rows in each group.

(
    country
    .group_by(c.region)
    .agg(
        hdi_avg = c.hdi.mean(),
        hdi_median = c.hdi.median(),
        count = pl.len()
    )
)
shape: (5, 4)
region hdi_avg hdi_median count
str f64 f64 u32
"Asia" 0.760897 0.766 39
"Europe" 0.898895 0.9115 38
"Americas" 0.783579 0.786 19
"Africa" 0.583351 0.571 37
"Oceania" 0.948 0.948 2

This operation computes the average density and the number of regions in each division of the United States. The result is a new DataFrame where each row represents a division and each column shows one of the computed statistics. This grouped analysis is essential for understanding how variables vary across categories in our data.

In some cases, we may want to perform one or more aggregation functions on an entire DataFrame. In Polars, it is not possible to apply the .agg method without first grouping the data (this is different from R or Pandas). Instead, use .select and the same syntax as .agg inside the function call. While using the column-selector function to perform aggregation may seem unusual, it makes sense given how Polars optimizes chains of data processing. The specific reason, however, is well beyond the scope of this text.

Polars provides a wide range of aggregation functions that can be used with the .agg() method. Here are some of the most commonly used functions:

  • .mean(), .median(), .min(), .max(), .sum(), .quantile(<f64>)
  • .first(), .last() - .any(), .all()
  • .n_unique(), .count(), .len()
  • pl.corr(..., ...) (correlation)
  • pl.concat_str(...)

These cover most use cases in our analyses.

There are three ways to count the number of observations in a group. We can use the .len() method applied to any column (it doesn’t matter which) to get the total number of rows, or use .count() on a column to get the number of non-missing values in that column. The special function pl.len() returns the same result as applying .len() to any column, but makes it clear that the result does not depend on which column we choose. For this reason, I prefer pl.len() unless I specifically want the number of non-missing values. There is also .n_unique() to find the number of unique non-missing values in a column.

2.9 Pipes

The syntax of chaining methods is very powerful. It makes our code more readable, avoids errors from naming intermediate values, and allows Polars to create internally optimized functions that chain the sequence of methods in the most efficient way. One possible roadblock to method chaining is that we only have access to methods built into Polars. To overcome this, Polars allows us to call an arbitrary function with the DataFrame as its first argument using the .pipe method. .pipe takes the function to apply as its first argument; additional arguments are passed on to the function.

As a simple example, we can apply the helper function print_rows to a DataFrame along with a number of rows to show a specific number of rows (instead of the default 10). Here for example, we select 4 to show the first two and the last two rows.

(
    country
    .pipe(print_rows, 4)
)
shape: (135, 15)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str
"SEN" "Senegal" "Africa" "Western Africa" 18.932 70.43 14.366667 -14.283333 0.53 4871 38.1 50.93 66.0 54.93987 "pbp|fra|wol"
"VEN" "Venezuela, Bolivarian Republic… "Americas" "South America" 28.517 76.18 8.0 -67.0 0.709 8899 44.8 57.65 96.8 95.66913 "spa|vsl"
"CYP" "Cyprus" "Asia" "Western Asia" 1.371 81.77 35.0 33.0 0.913 55720 31.2 60.71 123.1 99.41781 "ell|tur"
"PAK" "Pakistan" "Asia" "Southern Asia" 255.22 66.71 30.0 71.0 0.544 5717 29.6 45.49 49.8 61.92651 "eng|urd"

We can also exclude the the number and it will print all of the rows. This is particularly useful when we want to investigate all of the results after reducing the number of columns of the data and getting a final set of results.

The pipe method will be particularly helpful in Chapter 3, where we will use the .pipe method to pass a DataFrame to the ggplot function to create a data visualization directly within a chain of methods.

2.10 Further Functions

Let’s finish with a few additional functions that are useful inside the with_columns method. We can create new columns that depend on conditions in our existing data using the special function pl.when. The code below shows its use in creating a variable equal to 1 for countries in the Northern Hemisphere and 0 for those in the Southern Hemisphere.

(
    country
    .with_columns(
        is_north = pl.when(c.lat >= 0).then(1).otherwise(0)
    )
)

We can also access the next or previous values of a column using shift(<i64>), where <i64> specifies the number of rows to shift: positive values access rows above, and negative values access rows below. For example, the code below shows the percentage by which each country is larger than the next-smallest country in the dataset.

(
    country
    .sort(c.pop, descending=True)
    .with_columns(
        perc_larger = c.pop / c.pop.shift(-1) * 100
    )
)

The .with_columns method can compute values within a group, returning aggregations repeated for each row. However, it does not work with the group_by() function. Instead, use the .over() method on a specific column to indicate the grouping. For example, here we compute the percentage that each country occupies within its region.

(
    country
    .with_columns(
        pop_region = c.pop / c.pop.over("region").sum() * 100
    )
    .sort(c.pop_region, descending=True)
)
shape: (135, 16)
iso full_name region subregion pop lexp lat lon hdi gdp gini happy cellphone water_access lang pop_region
str str str str f64 f64 f64 f64 f64 i64 f64 f64 f64 f64 str f64
"IND" "India" "Asia" "Southern Asia" 1463.866 72.4 22.8 83.0 0.685 10608 35.7 46.76 60.5 63.83039 "eng|hin" 18.762505
"CHN" "China" "Asia" "Eastern Asia" 1416.096 78.81 35.844722 103.451944 0.797 25505 38.2 61.45 63.6 88.06831 "cmn" 18.150232
"USA" "United States of America" "Americas" "Northern America" 347.276 79.83 39.828175 -98.5795 0.938 78389 47.7 65.21 91.7 99.72235 "eng" 4.451068
"IDN" "Indonesia" "Asia" "South-eastern Asia" 285.721 72.25 -2.0 118.0 0.728 15391 37.9 56.95 85.8 77.51606 "ind" 3.662112
"PAK" "Pakistan" "Asia" "Southern Asia" 255.22 66.71 30.0 71.0 0.544 5717 29.6 45.49 49.8 61.92651 "eng|urd" 3.271178
"COM" "Comoros" "Africa" "Eastern Africa" 0.883 70.26 -12.3 43.7 0.603 3718 45.3 35.88 25.2 35.85662 "ara|fra" 0.011317
"LUX" "Luxembourg" "Europe" "Western Europe" 0.68 83.82 49.77 6.13 0.922 131038 34.9 70.16 143.3 97.60054 "fra|deu|ltz" 0.008716
"MNE" "Montenegro" "Europe" "Southern Europe" 0.633 76.92 42.766667 19.216667 0.862 29492 29.4 58.13 185.1 96.657 "cnr|cnr" 0.008113
"MLT" "Malta" "Europe" "Southern Europe" 0.545 83.48 35.883333 14.5 0.924 63314 31.0 62.95 107.8 99.96455 "eng|mlt" 0.006985
"ISL" "Iceland" "Europe" "Northern Europe" 0.398 84.92 65.0 -19.0 0.972 67444 26.1 75.62 107.2 98.78217 "isl" 0.005101

The .over() and .shift() methods can also be used within .filter().

2.11 R’s dplyr and Python Pandas

If you’re familiar with dplyr in R or pandas in Python, the code in this chapter should feel familiar. Specific function names differ, and each library has its own quirks. Ultimately, the important thing is to understand how to compose the methods above to transform data flexibly into the format you need for a particular application.

If you’re coming from dplyr, the main differences are prefixing column names with c., using methods instead of pipes (the same idea, different syntax), and using methods in place of aggregate functions. Some methods are renamed—for example, agg replaces summarize and sort replaces arrange—but they otherwise behave similarly.

Transitioning from Pandas requires a few more changes, but most are designed to make the code simpler and more consistent. We no longer have an index column that needs to be reset after every aggregation. Nor do we have aggregation functions wrapped in complex tuples or the eval and query methods buried inside an extra layer of quotes.

As with Pandas and dplyr, Polars also provides functions for pivots and table joins. We will see those in Chapter 4. Although some arguments have different names across the three libraries, the differences in these transformations are relatively minor.