import numpy as np
import polars as pl
from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())
country = pl.read_csv("data/countries.csv")2 EDA I: Organizing Data
2.1 Setup
We will begin this and every subsequent chapter by listing the core modules and datasets that must be loaded before getting started. Datasets and sources that require new functions will be loaded within the body of the text.
2.2 Introduction
We will use a variety of methods to modify our datasets. All data operations shown in this chapter come from the Polars library—the same one we used to read our data. These operations take a DataFrame as input and produce a modified DataFrame as output. This consistency makes it easy to chain operations and build complex data transformations step by step. The operations we learn in this chapter are fundamental building blocks for data analysis; while they may seem simple individually, they become powerful when combined.
2.3 Sorting Rows
Organizing data in a meaningful order is often crucial for understanding patterns and trends. We’ll use the .sort() method to reorder the dataset by arranging rows according to one or more columns. This is particularly useful for identifying extreme values, understanding distributions, or preparing data for specific types of analysis. Let’s look at an example of sorting the countries dataset by population, then break down how it works.
(
country
.sort(c.pop)
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "ISL" | "Iceland" | "Europe" | "Northern Europe" | 0.398 | 84.92 | 65.0 | -19.0 | 0.972 | 67444 | 26.1 | 75.62 | 107.2 | 98.78217 | "isl" |
| "MLT" | "Malta" | "Europe" | "Southern Europe" | 0.545 | 83.48 | 35.883333 | 14.5 | 0.924 | 63314 | 31.0 | 62.95 | 107.8 | 99.96455 | "eng|mlt" |
| "MNE" | "Montenegro" | "Europe" | "Southern Europe" | 0.633 | 76.92 | 42.766667 | 19.216667 | 0.862 | 29492 | 29.4 | 58.13 | 185.1 | 96.657 | "cnr|cnr" |
| "LUX" | "Luxembourg" | "Europe" | "Western Europe" | 0.68 | 83.82 | 49.77 | 6.13 | 0.922 | 131038 | 34.9 | 70.16 | 143.3 | 97.60054 | "fra|deu|ltz" |
| "COM" | "Comoros" | "Africa" | "Eastern Africa" | 0.883 | 70.26 | -12.3 | 43.7 | 0.603 | 3718 | 45.3 | 35.88 | 25.2 | 35.85662 | "ara|fra" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "PAK" | "Pakistan" | "Asia" | "Southern Asia" | 255.22 | 66.71 | 30.0 | 71.0 | 0.544 | 5717 | 29.6 | 45.49 | 49.8 | 61.92651 | "eng|urd" |
| "IDN" | "Indonesia" | "Asia" | "South-eastern Asia" | 285.721 | 72.25 | -2.0 | 118.0 | 0.728 | 15391 | 37.9 | 56.95 | 85.8 | 77.51606 | "ind" |
| "USA" | "United States of America" | "Americas" | "Northern America" | 347.276 | 79.83 | 39.828175 | -98.5795 | 0.938 | 78389 | 47.7 | 65.21 | 91.7 | 99.72235 | "eng" |
| "CHN" | "China" | "Asia" | "Eastern Asia" | 1416.096 | 78.81 | 35.844722 | 103.451944 | 0.797 | 25505 | 38.2 | 61.45 | 63.6 | 88.06831 | "cmn" |
| "IND" | "India" | "Asia" | "Southern Asia" | 1463.866 | 72.4 | 22.8 | 83.0 | 0.685 | 10608 | 35.7 | 46.76 | 60.5 | 63.83039 | "eng|hin" |
There are several important things to note in this code. First, the code is wrapped in a pair of parentheses, each on its own line, and all other lines are indented by four spaces. Second, we place the dataset on its own line and then call the sort method on the following line. Third, we use the c. prefix followed by the column name to refer to a column. Finally, the results are printed by default, but the original dataset is not modified. After running this code, the country column still has its original order. All of these features are shared across the entire set of DataFrame methods.
The standard way to specify columns in the functions in this chapter is the c. format. You can also specify a column using c("<str>"), where <str> is the column name. This is useful when the column name contains spaces or special characters and cannot be represented using dot notation. Some functions, such as sort, also allow specifying the column name directly in quotes without referencing c. We will avoid this in our code, but you may encounter it in other sources.
You can reverse the sorting order of sort by setting its descending parameter to True.
(
country
.sort(c.pop, descending=True)
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "IND" | "India" | "Asia" | "Southern Asia" | 1463.866 | 72.4 | 22.8 | 83.0 | 0.685 | 10608 | 35.7 | 46.76 | 60.5 | 63.83039 | "eng|hin" |
| "CHN" | "China" | "Asia" | "Eastern Asia" | 1416.096 | 78.81 | 35.844722 | 103.451944 | 0.797 | 25505 | 38.2 | 61.45 | 63.6 | 88.06831 | "cmn" |
| "USA" | "United States of America" | "Americas" | "Northern America" | 347.276 | 79.83 | 39.828175 | -98.5795 | 0.938 | 78389 | 47.7 | 65.21 | 91.7 | 99.72235 | "eng" |
| "IDN" | "Indonesia" | "Asia" | "South-eastern Asia" | 285.721 | 72.25 | -2.0 | 118.0 | 0.728 | 15391 | 37.9 | 56.95 | 85.8 | 77.51606 | "ind" |
| "PAK" | "Pakistan" | "Asia" | "Southern Asia" | 255.22 | 66.71 | 30.0 | 71.0 | 0.544 | 5717 | 29.6 | 45.49 | 49.8 | 61.92651 | "eng|urd" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "COM" | "Comoros" | "Africa" | "Eastern Africa" | 0.883 | 70.26 | -12.3 | 43.7 | 0.603 | 3718 | 45.3 | 35.88 | 25.2 | 35.85662 | "ara|fra" |
| "LUX" | "Luxembourg" | "Europe" | "Western Europe" | 0.68 | 83.82 | 49.77 | 6.13 | 0.922 | 131038 | 34.9 | 70.16 | 143.3 | 97.60054 | "fra|deu|ltz" |
| "MNE" | "Montenegro" | "Europe" | "Southern Europe" | 0.633 | 76.92 | 42.766667 | 19.216667 | 0.862 | 29492 | 29.4 | 58.13 | 185.1 | 96.657 | "cnr|cnr" |
| "MLT" | "Malta" | "Europe" | "Southern Europe" | 0.545 | 83.48 | 35.883333 | 14.5 | 0.924 | 63314 | 31.0 | 62.95 | 107.8 | 99.96455 | "eng|mlt" |
| "ISL" | "Iceland" | "Europe" | "Northern Europe" | 0.398 | 84.92 | 65.0 | -19.0 | 0.972 | 67444 | 26.1 | 75.62 | 107.2 | 98.78217 | "isl" |
Now the countries with the highest population appear first. This descending order is often useful when we want to focus on the most extreme cases or identify outliers in our data. For example, the code below sorts by region (alphabetical order) and then, within each region, by population.
(
country
.sort(c.region, c.pop)
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "COM" | "Comoros" | "Africa" | "Eastern Africa" | 0.883 | 70.26 | -12.3 | 43.7 | 0.603 | 3718 | 45.3 | 35.88 | 25.2 | 35.85662 | "ara|fra" |
| "MUS" | "Mauritius" | "Africa" | "Eastern Africa" | 1.268 | 76.35 | -20.2 | 57.5 | 0.806 | 28927 | 36.8 | 57.59 | 92.8 | 95.45876 | "eng|fra" |
| "BWA" | "Botswana" | "Africa" | "Southern Africa" | 2.562 | 63.08 | -22.2 | 23.7 | 0.731 | 18189 | 53.3 | 33.32 | 116.2 | 77.26922 | "eng" |
| "GAB" | "Gabon" | "Africa" | "Middle Africa" | 2.593 | 68.68 | -0.683331 | 11.5 | 0.733 | 19543 | 38.0 | 51.04 | 93.6 | 49.20331 | "fra" |
| "GMB" | "Gambia" | "Africa" | "Western Africa" | 2.822 | 68.16 | 13.5 | -15.5 | 0.524 | 3199 | null | 46.91 | 76.7 | 46.90722 | "eng" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "GBR" | "United Kingdom of Great Britai… | "Europe" | "Northern Europe" | 69.551 | 81.93 | 54.6 | -2.0 | 0.946 | 53993 | 32.8 | 66.58 | 121.8 | 99.1103 | "eng" |
| "DEU" | "Germany" | "Europe" | "Western Europe" | 84.075 | 82.13 | 51.0 | 10.0 | 0.959 | 64355 | 31.7 | 67.92 | 109.4 | 99.22504 | "deu" |
| "RUS" | "Russian Federation" | "Europe" | "Eastern Europe" | 143.997 | 73.89 | 66.416667 | 94.25 | 0.832 | 40871 | 36.0 | 58.65 | 165.1 | 88.24222 | "rus" |
| "NZL" | "New Zealand" | "Oceania" | "Australia and New Zealand" | 5.252 | 82.65 | -41.2 | 174.0 | 0.938 | 49598 | null | 69.76 | 108.3 | 100.0 | "eng|mri|nzs" |
| "AUS" | "Australia" | "Oceania" | "Australia and New Zealand" | 26.974 | 83.79 | -25.0 | 133.0 | 0.958 | 61943 | 32.4 | 70.25 | 101.6 | 99.99923 | "asf" |
Finally, store a copy of the dataset in a named variable using the following code:
country_saved = (
country
.sort(c.region, c.pop)
)After running the code, country_saved contains the reordered dataset, while country still contains the original data. The same format can be used to save the results from the other methods described in the sections below.
2.4 Selecting Columns
In some cases—particularly when a dataset grows large—you may want to select or reorder its columns. Use the .select method with the same c. notation to specify the columns to include.
(
country
.select(c.iso, c.lat, c.lon)
)| iso | lat | lon |
|---|---|---|
| str | f64 | f64 |
| "SEN" | 14.366667 | -14.283333 |
| "VEN" | 8.0 | -67.0 |
| "FIN" | 65.0 | 27.0 |
| "USA" | 39.828175 | -98.5795 |
| "LKA" | 7.0 | 81.0 |
| … | … | … |
| "ALB" | 41.0 | 20.0 |
| "MYS" | 3.7805111 | 102.314362 |
| "SLV" | 13.668889 | -88.866111 |
| "CYP" | 35.0 | 33.0 |
| "PAK" | 30.0 | 71.0 |
Selecting columns is something we will do frequently in this text to clarify which parts of the dataset we want you to examine. It is less important for most data science pipelines, but it is particularly useful when performing more complex transformations, such as those in Chapter 4.
There is a similar function, .drop, which lets us select columns to remove from the data. Dropping columns is useful when a dataset has many columns and you want to remove only a few.
2.5 Selecting Rows
Often, we want to take a subset of rows from a dataset for a particular analysis or visualization. There are several functions for doing this, depending on the selection criteria. The .head() method, for example, returns a specified number of rows from the top of the dataset. The example below selects the first four rows.
(
country
.head(n=4)
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "SEN" | "Senegal" | "Africa" | "Western Africa" | 18.932 | 70.43 | 14.366667 | -14.283333 | 0.53 | 4871 | 38.1 | 50.93 | 66.0 | 54.93987 | "pbp|fra|wol" |
| "VEN" | "Venezuela, Bolivarian Republic… | "Americas" | "South America" | 28.517 | 76.18 | 8.0 | -67.0 | 0.709 | 8899 | 44.8 | 57.65 | 96.8 | 95.66913 | "spa|vsl" |
| "FIN" | "Finland" | "Europe" | "Northern Europe" | 5.623 | 82.84 | 65.0 | 27.0 | 0.948 | 57574 | 27.7 | 76.99 | 156.4 | 99.44798 | "fin|swe" |
| "USA" | "United States of America" | "Americas" | "Northern America" | 347.276 | 79.83 | 39.828175 | -98.5795 | 0.938 | 78389 | 47.7 | 65.21 | 91.7 | 99.72235 | "eng" |
The method .tail(n=<int>) works the same way, except that it selects rows starting from the bottom of the table. The methods .drop_nulls() and .drop_nans() take no arguments and return only rows that contain no missing or invalid data values, respectively.
The .sample method randomly selects rows. You can specify the number of rows with n=<int> or a fraction with fraction=<float>. Below is an example that takes a random 20% sample of the dataset.
(
country
.sample(fraction=0.2)
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "AZE" | "Azerbaijan" | "Asia" | "Western Asia" | 10.398 | 71.73 | 40.3 | 47.7 | 0.789 | 23178 | 26.6 | 52.14 | 99.5 | 96.12412 | "aze" |
| "CHN" | "China" | "Asia" | "Eastern Asia" | 1416.096 | 78.81 | 35.844722 | 103.451944 | 0.797 | 25505 | 38.2 | 61.45 | 63.6 | 88.06831 | "cmn" |
| "AUT" | "Austria" | "Europe" | "Western Europe" | 9.114 | 83.17 | 48.0 | 14.0 | 0.93 | 64665 | 30.2 | 66.36 | 146.3 | 99.97291 | "asq|deu" |
| "SAU" | "Saudi Arabia" | "Asia" | "Western Asia" | 34.566 | 75.66 | 23.716667 | 44.116667 | 0.9 | 64966 | null | 69.53 | 205.0 | 95.30723 | "ara" |
| "LKA" | "Sri Lanka" | "Asia" | "Southern Asia" | 23.229 | 78.51 | 7.0 | 81.0 | 0.776 | 14380 | 39.3 | 36.02 | 83.1 | 90.77437 | "sin|sin|tam|tam" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "CAN" | "Canada" | "Americas" | "Northern America" | 40.127 | 83.15 | 56.0 | -109.0 | 0.939 | 58422 | 33.3 | 68.41 | 75.5 | 98.92568 | "eng|fra" |
| "TGO" | "Togo" | "Africa" | "Western Africa" | 9.722 | 66.47 | 8.25 | 1.183333 | 0.571 | 3008 | 42.4 | 43.65 | 38.7 | 17.19931 | "fra" |
| "BFA" | "Burkina Faso" | "Africa" | "Western Africa" | 24.075 | 62.97 | 12.266667 | -2.066667 | 0.459 | 2661 | 47.3 | 44.62 | 35.3 | 21.77792 | "bib|dyu|mos" |
| "NOR" | "Norway" | "Europe" | "Northern Europe" | 5.623 | 83.78 | 65.0 | 11.0 | 0.97 | 94896 | 27.7 | 72.49 | 114.5 | 98.05439 | "nob|nor|nno" |
| "CYP" | "Cyprus" | "Asia" | "Western Asia" | 1.371 | 81.77 | 35.0 | 33.0 | 0.913 | 55720 | 31.2 | 60.71 | 123.1 | 99.41781 | "ell|tur" |
We can select rows that meet a condition defined by the dataset’s variables using the .filter() method. Similar to the sort() method, we use the syntax c. followed by variable names to define relationships between variables. Below is code that selects only the rows where the Human Development Index (HDI) is greater than 0.9.
(
country
.filter(c.hdi > 0.9)
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "FIN" | "Finland" | "Europe" | "Northern Europe" | 5.623 | 82.84 | 65.0 | 27.0 | 0.948 | 57574 | 27.7 | 76.99 | 156.4 | 99.44798 | "fin|swe" |
| "USA" | "United States of America" | "Americas" | "Northern America" | 347.276 | 79.83 | 39.828175 | -98.5795 | 0.938 | 78389 | 47.7 | 65.21 | 91.7 | 99.72235 | "eng" |
| "SGP" | "Singapore" | "Asia" | "South-eastern Asia" | 5.871 | 85.63 | 1.3 | 103.8 | 0.946 | 137906 | null | 66.54 | 145.5 | 100.0 | "eng|msa|cmn|tam" |
| "AUT" | "Austria" | "Europe" | "Western Europe" | 9.114 | 83.17 | 48.0 | 14.0 | 0.93 | 64665 | 30.2 | 66.36 | 146.3 | 99.97291 | "asq|deu" |
| "DEU" | "Germany" | "Europe" | "Western Europe" | 84.075 | 82.13 | 51.0 | 10.0 | 0.959 | 64355 | 31.7 | 67.92 | 109.4 | 99.22504 | "deu" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "NLD" | "Netherlands, Kingdom of the" | "Europe" | "Western Europe" | 18.347 | 82.55 | 52.366667 | 4.883333 | 0.955 | 73384 | null | 72.55 | 114.4 | 97.71353 | "nld" |
| "IRL" | "Ireland" | "Europe" | "Northern Europe" | 5.308 | 82.94 | 53.0 | -8.0 | 0.949 | 119406 | null | 68.17 | 103.1 | 89.3966 | "eng|gle" |
| "LUX" | "Luxembourg" | "Europe" | "Western Europe" | 0.68 | 83.82 | 49.77 | 6.13 | 0.922 | 131038 | 34.9 | 70.16 | 143.3 | 97.60054 | "fra|deu|ltz" |
| "CAN" | "Canada" | "Americas" | "Northern America" | 40.127 | 83.15 | 56.0 | -109.0 | 0.939 | 58422 | 33.3 | 68.41 | 75.5 | 98.92568 | "eng|fra" |
| "CYP" | "Cyprus" | "Asia" | "Western Asia" | 1.371 | 81.77 | 35.0 | 33.0 | 0.913 | 55720 | 31.2 | 60.71 | 123.1 | 99.41781 | "ell|tur" |
We can also filter categorical variables using equality comparisons. To prevent the page from becoming too cluttered, we will avoid printing this and some other results in future examples.
(
country
.filter(c.region == "Europe")
)This selects all countries in Europe. Note that we use the double-equals operator (==) to test equality; a single equals sign (=) is used only for assignment. The resulting dataset contains only rows where region == "Europe".
Often, we want to apply multiple conditions simultaneously. We can combine conditions using the & (and) and | (or) operators. When using these operators, wrap each condition in parentheses:
(
country
.filter((c.region == "Europe") & (c.hdi > 0.9))
)Sometimes we want to know whether a value belongs to a larger set. We can do this with the .is_in method, as shown in the following code. Note that the collection of elements passed to .is_in must be enclosed in square brackets. A detailed explanation of the object being created is provided in Chapter 6.
(
country
.filter(c.region.is_in(["Europe", "Africa"]))
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "SEN" | "Senegal" | "Africa" | "Western Africa" | 18.932 | 70.43 | 14.366667 | -14.283333 | 0.53 | 4871 | 38.1 | 50.93 | 66.0 | 54.93987 | "pbp|fra|wol" |
| "FIN" | "Finland" | "Europe" | "Northern Europe" | 5.623 | 82.84 | 65.0 | 27.0 | 0.948 | 57574 | 27.7 | 76.99 | 156.4 | 99.44798 | "fin|swe" |
| "GAB" | "Gabon" | "Africa" | "Middle Africa" | 2.593 | 68.68 | -0.683331 | 11.5 | 0.733 | 19543 | 38.0 | 51.04 | 93.6 | 49.20331 | "fra" |
| "BGR" | "Bulgaria" | "Europe" | "Eastern Europe" | 6.715 | 74.33 | 42.75 | 25.5 | 0.845 | 36211 | 40.3 | 55.9 | 137.1 | 86.00395 | "bul" |
| "TZA" | "Tanzania, United Republic of" | "Africa" | "Eastern Africa" | 70.546 | 68.59 | -6.306944 | 34.853889 | 0.555 | 3924 | 40.5 | 40.42 | 46.9 | 26.78297 | "eng|swa" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "LUX" | "Luxembourg" | "Europe" | "Western Europe" | 0.68 | 83.82 | 49.77 | 6.13 | 0.922 | 131038 | 34.9 | 70.16 | 143.3 | 97.60054 | "fra|deu|ltz" |
| "LTU" | "Lithuania" | "Europe" | "Northern Europe" | 2.83 | 77.19 | 55.2 | 24.0 | 0.895 | 49761 | 35.3 | 65.53 | 157.9 | 92.73568 | "lit" |
| "MNE" | "Montenegro" | "Europe" | "Southern Europe" | 0.633 | 76.92 | 42.766667 | 19.216667 | 0.862 | 29492 | 29.4 | 58.13 | 185.1 | 96.657 | "cnr|cnr" |
| "GMB" | "Gambia" | "Africa" | "Western Africa" | 2.822 | 68.16 | 13.5 | -15.5 | 0.524 | 3199 | null | 46.91 | 76.7 | 46.90722 | "eng" |
| "ALB" | "Albania" | "Europe" | "Southern Europe" | 2.772 | 79.67 | 41.0 | 20.0 | 0.81 | 20362 | 30.8 | 54.45 | 91.9 | 98.5473 | "sqi" |
These operators, including != (not equal) and >= (greater than or equal to), allow us to select rows based on nearly any relationship in the dataset required for our analysis.
2.6 Grouping Data
There is a special method called .group_by(...) that allows you to group a dataset by one or more variables using a syntax similar to .sort(...). When used in a method chain, the next method is applied separately to each group of rows that share the same value(s) for the grouping variable(s). For example, the following code selects the first country from each region in the dataset.
(
country
.group_by(c.region)
.head(n=1)
)| region | iso | full_name | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "Asia" | "LKA" | "Sri Lanka" | "Southern Asia" | 23.229 | 78.51 | 7.0 | 81.0 | 0.776 | 14380 | 39.3 | 36.02 | 83.1 | 90.77437 | "sin|sin|tam|tam" |
| "Europe" | "FIN" | "Finland" | "Northern Europe" | 5.623 | 82.84 | 65.0 | 27.0 | 0.948 | 57574 | 27.7 | 76.99 | 156.4 | 99.44798 | "fin|swe" |
| "Africa" | "SEN" | "Senegal" | "Western Africa" | 18.932 | 70.43 | 14.366667 | -14.283333 | 0.53 | 4871 | 38.1 | 50.93 | 66.0 | 54.93987 | "pbp|fra|wol" |
| "Americas" | "VEN" | "Venezuela, Bolivarian Republic… | "South America" | 28.517 | 76.18 | 8.0 | -67.0 | 0.709 | 8899 | 44.8 | 57.65 | 96.8 | 95.66913 | "spa|vsl" |
| "Oceania" | "AUS" | "Australia" | "Australia and New Zealand" | 26.974 | 83.79 | -25.0 | 133.0 | 0.958 | 61943 | 32.4 | 70.25 | 101.6 | 99.99923 | "asf" |
This can be useful for various types of analysis. For example, we can determine the largest countries by population in each region using the following code:
(
country
.sort(c.pop, descending=True)
.group_by(c.region)
.head(n=1)
)| region | iso | full_name | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "Africa" | "NGA" | "Nigeria" | "Western Africa" | 237.528 | 66.07 | 9.0 | 8.0 | 0.56 | 5848 | 35.1 | 48.69 | 52.4 | 41.19972 | "eng" |
| "Americas" | "USA" | "United States of America" | "Northern America" | 347.276 | 79.83 | 39.828175 | -98.5795 | 0.938 | 78389 | 47.7 | 65.21 | 91.7 | 99.72235 | "eng" |
| "Europe" | "RUS" | "Russian Federation" | "Eastern Europe" | 143.997 | 73.89 | 66.416667 | 94.25 | 0.832 | 40871 | 36.0 | 58.65 | 165.1 | 88.24222 | "rus" |
| "Oceania" | "AUS" | "Australia" | "Australia and New Zealand" | 26.974 | 83.79 | -25.0 | 133.0 | 0.958 | 61943 | 32.4 | 70.25 | 101.6 | 99.99923 | "asf" |
| "Asia" | "IND" | "India" | "Southern Asia" | 1463.866 | 72.4 | 22.8 | 83.0 | 0.685 | 10608 | 35.7 | 46.76 | 60.5 | 63.83039 | "eng|hin" |
The following sections describe additional methods that can operate on grouped datasets.
2.7 Modifying Columns
Creating or modifying columns in a dataset is a fundamental operation in data analysis. Analysts often need to compute new variables from existing data, create indicator (dummy) variables, or transform variables for analysis.
The basic pattern for adding a new column is to use the .with_columns() method. In the method call, specify the values you want to create or modify and define them in terms of existing columns. For example, the following multiplies the population column (values in millions) by 1,000 to obtain the population in thousands.
(
country
.with_columns(population_1k = c.pop * 1000)
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang | population_1k |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str | f64 |
| "SEN" | "Senegal" | "Africa" | "Western Africa" | 18.932 | 70.43 | 14.366667 | -14.283333 | 0.53 | 4871 | 38.1 | 50.93 | 66.0 | 54.93987 | "pbp|fra|wol" | 18932.0 |
| "VEN" | "Venezuela, Bolivarian Republic… | "Americas" | "South America" | 28.517 | 76.18 | 8.0 | -67.0 | 0.709 | 8899 | 44.8 | 57.65 | 96.8 | 95.66913 | "spa|vsl" | 28517.0 |
| "FIN" | "Finland" | "Europe" | "Northern Europe" | 5.623 | 82.84 | 65.0 | 27.0 | 0.948 | 57574 | 27.7 | 76.99 | 156.4 | 99.44798 | "fin|swe" | 5623.0 |
| "USA" | "United States of America" | "Americas" | "Northern America" | 347.276 | 79.83 | 39.828175 | -98.5795 | 0.938 | 78389 | 47.7 | 65.21 | 91.7 | 99.72235 | "eng" | 347276.0 |
| "LKA" | "Sri Lanka" | "Asia" | "Southern Asia" | 23.229 | 78.51 | 7.0 | 81.0 | 0.776 | 14380 | 39.3 | 36.02 | 83.1 | 90.77437 | "sin|sin|tam|tam" | 23229.0 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "ALB" | "Albania" | "Europe" | "Southern Europe" | 2.772 | 79.67 | 41.0 | 20.0 | 0.81 | 20362 | 30.8 | 54.45 | 91.9 | 98.5473 | "sqi" | 2772.0 |
| "MYS" | "Malaysia" | "Asia" | "South-eastern Asia" | 35.978 | 76.03 | 3.7805111 | 102.314362 | 0.819 | 35990 | 46.2 | 58.68 | 118.2 | 95.69194 | "msa" | 35978.0 |
| "SLV" | "El Salvador" | "Americas" | "Central America" | 6.366 | 76.98 | 13.668889 | -88.866111 | 0.678 | 12221 | 38.3 | 64.82 | 126.9 | 86.19786 | "spa" | 6366.0 |
| "CYP" | "Cyprus" | "Asia" | "Western Asia" | 1.371 | 81.77 | 35.0 | 33.0 | 0.913 | 55720 | 31.2 | 60.71 | 123.1 | 99.41781 | "ell|tur" | 1371.0 |
| "PAK" | "Pakistan" | "Asia" | "Southern Asia" | 255.22 | 66.71 | 30.0 | 71.0 | 0.544 | 5717 | 29.6 | 45.49 | 49.8 | 61.92651 | "eng|urd" | 255220.0 |
Common mathematical transformations of variables can be accessed by using a . after the variable name, followed by a transformation function such as sqrt() or log10(). There are also to fill in missing values: fill_null() to fill all values with a specified default, fill_nan to fill invalid numbers with a default, forward_fill to fill missing values with the first valid value above, and backward_fill to fill with the first valid value below.
The following example demonstrates the method of using methods from the columns: take the square root and the sine of population, and save each as a new column. Neither of these are meaningful or particularly useful; they are just to demonstrate the technique.
(
country
.with_columns(
pop_sqrt = c.pop.sqrt(),
pop_sin = c.pop.sin()
)
)Note the format of the code above. When a column definition is long (or when defining multiple columns at once), put each definition on its own line and indent it by four spaces.
Polars includes a method called .alias that can be used within .with_columns to create or overwrite a column in a dataset. It was originally the standard technique, but it has largely been replaced by the equals-sign method, which many people, including me, find more intuitive. It’s useful to understand only because you’ll often see it in the Polars documentation.
2.8 Aggregation
Although examining individual rows can be informative, we often need summary statistics to understand the overall patterns in our data. Polars provides powerful aggregation tools that let us compute statistics across rows within specific groups. The .agg() method (short for “aggregate”) is the primary tool for computing summary statistics. It allows us to specify exactly which statistics to compute and for which columns.
(
country
.group_by(c.region)
.agg(
hdi_avg = c.hdi.mean()
)
)| region | hdi_avg |
|---|---|
| str | f64 |
| "Africa" | 0.583351 |
| "Europe" | 0.898895 |
| "Asia" | 0.760897 |
| "Oceania" | 0.948 |
| "Americas" | 0.783579 |
In this example, we compute the average HDI across countries within each region. The syntax uses c. followed by the column name, then the aggregation function with parentheses. Note that only the grouping variable(s) and the columns created by the .agg() function appear in the result.
We can compute multiple statistics at once by including additional items in our aggregation. It is often useful to include the special function pl.len(), which counts the number of rows in each group.
(
country
.group_by(c.region)
.agg(
hdi_avg = c.hdi.mean(),
hdi_median = c.hdi.median(),
count = pl.len()
)
)| region | hdi_avg | hdi_median | count |
|---|---|---|---|
| str | f64 | f64 | u32 |
| "Asia" | 0.760897 | 0.766 | 39 |
| "Europe" | 0.898895 | 0.9115 | 38 |
| "Americas" | 0.783579 | 0.786 | 19 |
| "Africa" | 0.583351 | 0.571 | 37 |
| "Oceania" | 0.948 | 0.948 | 2 |
This operation computes the average density and the number of regions in each division of the United States. The result is a new DataFrame where each row represents a division and each column shows one of the computed statistics. This grouped analysis is essential for understanding how variables vary across categories in our data.
In some cases, we may want to perform one or more aggregation functions on an entire DataFrame. In Polars, it is not possible to apply the .agg method without first grouping the data (this is different from R or Pandas). Instead, use .select and the same syntax as .agg inside the function call. While using the column-selector function to perform aggregation may seem unusual, it makes sense given how Polars optimizes chains of data processing. The specific reason, however, is well beyond the scope of this text.
Polars provides a wide range of aggregation functions that can be used with the .agg() method. Here are some of the most commonly used functions:
.mean(),.median(),.min(),.max(),.sum(),.quantile(<f64>).first(),.last()-.any(),.all().n_unique(),.count(),.len()pl.corr(..., ...)(correlation)pl.concat_str(...)
These cover most use cases in our analyses.
There are three ways to count the number of observations in a group. We can use the .len() method applied to any column (it doesn’t matter which) to get the total number of rows, or use .count() on a column to get the number of non-missing values in that column. The special function pl.len() returns the same result as applying .len() to any column, but makes it clear that the result does not depend on which column we choose. For this reason, I prefer pl.len() unless I specifically want the number of non-missing values. There is also .n_unique() to find the number of unique non-missing values in a column.
2.9 Pipes
The syntax of chaining methods is very powerful. It makes our code more readable, avoids errors from naming intermediate values, and allows Polars to create internally optimized functions that chain the sequence of methods in the most efficient way. One possible roadblock to method chaining is that we only have access to methods built into Polars. To overcome this, Polars allows us to call an arbitrary function with the DataFrame as its first argument using the .pipe method. .pipe takes the function to apply as its first argument; additional arguments are passed on to the function.
As a simple example, we can apply the helper function print_rows to a DataFrame along with a number of rows to show a specific number of rows (instead of the default 10). Here for example, we select 4 to show the first two and the last two rows.
(
country
.pipe(print_rows, 4)
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str |
| "SEN" | "Senegal" | "Africa" | "Western Africa" | 18.932 | 70.43 | 14.366667 | -14.283333 | 0.53 | 4871 | 38.1 | 50.93 | 66.0 | 54.93987 | "pbp|fra|wol" |
| "VEN" | "Venezuela, Bolivarian Republic… | "Americas" | "South America" | 28.517 | 76.18 | 8.0 | -67.0 | 0.709 | 8899 | 44.8 | 57.65 | 96.8 | 95.66913 | "spa|vsl" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "CYP" | "Cyprus" | "Asia" | "Western Asia" | 1.371 | 81.77 | 35.0 | 33.0 | 0.913 | 55720 | 31.2 | 60.71 | 123.1 | 99.41781 | "ell|tur" |
| "PAK" | "Pakistan" | "Asia" | "Southern Asia" | 255.22 | 66.71 | 30.0 | 71.0 | 0.544 | 5717 | 29.6 | 45.49 | 49.8 | 61.92651 | "eng|urd" |
We can also exclude the the number and it will print all of the rows. This is particularly useful when we want to investigate all of the results after reducing the number of columns of the data and getting a final set of results.
The pipe method will be particularly helpful in Chapter 3, where we will use the .pipe method to pass a DataFrame to the ggplot function to create a data visualization directly within a chain of methods.
2.10 Further Functions
Let’s finish with a few additional functions that are useful inside the with_columns method. We can create new columns that depend on conditions in our existing data using the special function pl.when. The code below shows its use in creating a variable equal to 1 for countries in the Northern Hemisphere and 0 for those in the Southern Hemisphere.
(
country
.with_columns(
is_north = pl.when(c.lat >= 0).then(1).otherwise(0)
)
)We can also access the next or previous values of a column using shift(<i64>), where <i64> specifies the number of rows to shift: positive values access rows above, and negative values access rows below. For example, the code below shows the percentage by which each country is larger than the next-smallest country in the dataset.
(
country
.sort(c.pop, descending=True)
.with_columns(
perc_larger = c.pop / c.pop.shift(-1) * 100
)
)The .with_columns method can compute values within a group, returning aggregations repeated for each row. However, it does not work with the group_by() function. Instead, use the .over() method on a specific column to indicate the grouping. For example, here we compute the percentage that each country occupies within its region.
(
country
.with_columns(
pop_region = c.pop / c.pop.over("region").sum() * 100
)
.sort(c.pop_region, descending=True)
)| iso | full_name | region | subregion | pop | lexp | lat | lon | hdi | gdp | gini | happy | cellphone | water_access | lang | pop_region |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | str | str | f64 | f64 | f64 | f64 | f64 | i64 | f64 | f64 | f64 | f64 | str | f64 |
| "IND" | "India" | "Asia" | "Southern Asia" | 1463.866 | 72.4 | 22.8 | 83.0 | 0.685 | 10608 | 35.7 | 46.76 | 60.5 | 63.83039 | "eng|hin" | 18.762505 |
| "CHN" | "China" | "Asia" | "Eastern Asia" | 1416.096 | 78.81 | 35.844722 | 103.451944 | 0.797 | 25505 | 38.2 | 61.45 | 63.6 | 88.06831 | "cmn" | 18.150232 |
| "USA" | "United States of America" | "Americas" | "Northern America" | 347.276 | 79.83 | 39.828175 | -98.5795 | 0.938 | 78389 | 47.7 | 65.21 | 91.7 | 99.72235 | "eng" | 4.451068 |
| "IDN" | "Indonesia" | "Asia" | "South-eastern Asia" | 285.721 | 72.25 | -2.0 | 118.0 | 0.728 | 15391 | 37.9 | 56.95 | 85.8 | 77.51606 | "ind" | 3.662112 |
| "PAK" | "Pakistan" | "Asia" | "Southern Asia" | 255.22 | 66.71 | 30.0 | 71.0 | 0.544 | 5717 | 29.6 | 45.49 | 49.8 | 61.92651 | "eng|urd" | 3.271178 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| "COM" | "Comoros" | "Africa" | "Eastern Africa" | 0.883 | 70.26 | -12.3 | 43.7 | 0.603 | 3718 | 45.3 | 35.88 | 25.2 | 35.85662 | "ara|fra" | 0.011317 |
| "LUX" | "Luxembourg" | "Europe" | "Western Europe" | 0.68 | 83.82 | 49.77 | 6.13 | 0.922 | 131038 | 34.9 | 70.16 | 143.3 | 97.60054 | "fra|deu|ltz" | 0.008716 |
| "MNE" | "Montenegro" | "Europe" | "Southern Europe" | 0.633 | 76.92 | 42.766667 | 19.216667 | 0.862 | 29492 | 29.4 | 58.13 | 185.1 | 96.657 | "cnr|cnr" | 0.008113 |
| "MLT" | "Malta" | "Europe" | "Southern Europe" | 0.545 | 83.48 | 35.883333 | 14.5 | 0.924 | 63314 | 31.0 | 62.95 | 107.8 | 99.96455 | "eng|mlt" | 0.006985 |
| "ISL" | "Iceland" | "Europe" | "Northern Europe" | 0.398 | 84.92 | 65.0 | -19.0 | 0.972 | 67444 | 26.1 | 75.62 | 107.2 | 98.78217 | "isl" | 0.005101 |
The .over() and .shift() methods can also be used within .filter().
2.11 R’s dplyr and Python Pandas
If you’re familiar with dplyr in R or pandas in Python, the code in this chapter should feel familiar. Specific function names differ, and each library has its own quirks. Ultimately, the important thing is to understand how to compose the methods above to transform data flexibly into the format you need for a particular application.
If you’re coming from dplyr, the main differences are prefixing column names with c., using methods instead of pipes (the same idea, different syntax), and using methods in place of aggregate functions. Some methods are renamed—for example, agg replaces summarize and sort replaces arrange—but they otherwise behave similarly.
Transitioning from Pandas requires a few more changes, but most are designed to make the code simpler and more consistent. We no longer have an index column that needs to be reset after every aggregation. Nor do we have aggregation functions wrapped in complex tuples or the eval and query methods buried inside an extra layer of quotes.
As with Pandas and dplyr, Polars also provides functions for pivots and table joins. We will see those in Chapter 4. Although some arguments have different names across the three libraries, the differences in these transformations are relatively minor.