2 EDA I: Organizing Data

Practice Notebooks

Notebook02a [Colab↗]
Notebook02b [Colab↗]
Notebook02c [Colab↗]
Notebook02d [Colab↗]

2.1 Setup

We will begin this and every subsequent chapter by listing the core modules and datasets that must be loaded before getting started. Datasets and sources that require new functions will be loaded within the body of the text.

import numpy as np
import polars as pl

from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())

country = pl.read_csv("data/countries.csv")

2.2 Introduction

We will use a variety of methods to modify our datasets. All data operations shown in this chapter come from the Polars library—the same one we used to read our data. These operations take a DataFrame as input and produce a modified DataFrame as output. This consistency makes it easy to chain operations and build complex data transformations step by step. The operations we learn in this chapter are fundamental building blocks for data analysis; while they may seem simple individually, they become powerful when combined.

2.3 Sorting Rows

Organizing data in a meaningful order is often crucial for understanding patterns and trends. We’ll use the .sort() method to reorder the dataset by arranging rows according to one or more columns. This is particularly useful for identifying extreme values, understanding distributions, or preparing data for specific types of analysis. Let’s look at an example of sorting the countries dataset by population, then break down how it works.

(
    country
    .sort(c.pop)
)

shape: (135, 15)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"ISL"	"Iceland"	"Europe"	"Northern Europe"	0.398	84.92	65.0	-19.0	0.972	67444	26.1	75.62	107.2	98.78217	"isl"
"MLT"	"Malta"	"Europe"	"Southern Europe"	0.545	83.48	35.883333	14.5	0.924	63314	31.0	62.95	107.8	99.96455	"eng\|mlt"
"MNE"	"Montenegro"	"Europe"	"Southern Europe"	0.633	76.92	42.766667	19.216667	0.862	29492	29.4	58.13	185.1	96.657	"cnr\|cnr"
"LUX"	"Luxembourg"	"Europe"	"Western Europe"	0.68	83.82	49.77	6.13	0.922	131038	34.9	70.16	143.3	97.60054	"fra\|deu\|ltz"
"COM"	"Comoros"	"Africa"	"Eastern Africa"	0.883	70.26	-12.3	43.7	0.603	3718	45.3	35.88	25.2	35.85662	"ara\|fra"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"PAK"	"Pakistan"	"Asia"	"Southern Asia"	255.22	66.71	30.0	71.0	0.544	5717	29.6	45.49	49.8	61.92651	"eng\|urd"
"IDN"	"Indonesia"	"Asia"	"South-eastern Asia"	285.721	72.25	-2.0	118.0	0.728	15391	37.9	56.95	85.8	77.51606	"ind"
"USA"	"United States of America"	"Americas"	"Northern America"	347.276	79.83	39.828175	-98.5795	0.938	78389	47.7	65.21	91.7	99.72235	"eng"
"CHN"	"China"	"Asia"	"Eastern Asia"	1416.096	78.81	35.844722	103.451944	0.797	25505	38.2	61.45	63.6	88.06831	"cmn"
"IND"	"India"	"Asia"	"Southern Asia"	1463.866	72.4	22.8	83.0	0.685	10608	35.7	46.76	60.5	63.83039	"eng\|hin"

There are several important things to note in this code. First, the code is wrapped in a pair of parentheses, each on its own line, and all other lines are indented by four spaces. Second, we place the dataset on its own line and then call the sort method on the following line. Third, we use the c. prefix followed by the column name to refer to a column. Finally, the results are printed by default, but the original dataset is not modified. After running this code, the country column still has its original order. All of these features are shared across the entire set of DataFrame methods.

Selecting columns

The standard way to specify columns in the functions in this chapter is the c. format. You can also specify a column using c("<str>"), where <str> is the column name. This is useful when the column name contains spaces or special characters and cannot be represented using dot notation. Some functions, such as sort, also allow specifying the column name directly in quotes without referencing c. We will avoid this in our code, but you may encounter it in other sources.

You can reverse the sorting order of sort by setting its descending parameter to True.

(
    country
    .sort(c.pop, descending=True)
)

shape: (135, 15)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"IND"	"India"	"Asia"	"Southern Asia"	1463.866	72.4	22.8	83.0	0.685	10608	35.7	46.76	60.5	63.83039	"eng\|hin"
"CHN"	"China"	"Asia"	"Eastern Asia"	1416.096	78.81	35.844722	103.451944	0.797	25505	38.2	61.45	63.6	88.06831	"cmn"
"USA"	"United States of America"	"Americas"	"Northern America"	347.276	79.83	39.828175	-98.5795	0.938	78389	47.7	65.21	91.7	99.72235	"eng"
"IDN"	"Indonesia"	"Asia"	"South-eastern Asia"	285.721	72.25	-2.0	118.0	0.728	15391	37.9	56.95	85.8	77.51606	"ind"
"PAK"	"Pakistan"	"Asia"	"Southern Asia"	255.22	66.71	30.0	71.0	0.544	5717	29.6	45.49	49.8	61.92651	"eng\|urd"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"COM"	"Comoros"	"Africa"	"Eastern Africa"	0.883	70.26	-12.3	43.7	0.603	3718	45.3	35.88	25.2	35.85662	"ara\|fra"
"LUX"	"Luxembourg"	"Europe"	"Western Europe"	0.68	83.82	49.77	6.13	0.922	131038	34.9	70.16	143.3	97.60054	"fra\|deu\|ltz"
"MNE"	"Montenegro"	"Europe"	"Southern Europe"	0.633	76.92	42.766667	19.216667	0.862	29492	29.4	58.13	185.1	96.657	"cnr\|cnr"
"MLT"	"Malta"	"Europe"	"Southern Europe"	0.545	83.48	35.883333	14.5	0.924	63314	31.0	62.95	107.8	99.96455	"eng\|mlt"
"ISL"	"Iceland"	"Europe"	"Northern Europe"	0.398	84.92	65.0	-19.0	0.972	67444	26.1	75.62	107.2	98.78217	"isl"

Now the countries with the highest population appear first. This descending order is often useful when we want to focus on the most extreme cases or identify outliers in our data. For example, the code below sorts by region (alphabetical order) and then, within each region, by population.

(
    country
    .sort(c.region, c.pop)
)

shape: (135, 15)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"COM"	"Comoros"	"Africa"	"Eastern Africa"	0.883	70.26	-12.3	43.7	0.603	3718	45.3	35.88	25.2	35.85662	"ara\|fra"
"MUS"	"Mauritius"	"Africa"	"Eastern Africa"	1.268	76.35	-20.2	57.5	0.806	28927	36.8	57.59	92.8	95.45876	"eng\|fra"
"BWA"	"Botswana"	"Africa"	"Southern Africa"	2.562	63.08	-22.2	23.7	0.731	18189	53.3	33.32	116.2	77.26922	"eng"
"GAB"	"Gabon"	"Africa"	"Middle Africa"	2.593	68.68	-0.683331	11.5	0.733	19543	38.0	51.04	93.6	49.20331	"fra"
"GMB"	"Gambia"	"Africa"	"Western Africa"	2.822	68.16	13.5	-15.5	0.524	3199	null	46.91	76.7	46.90722	"eng"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"GBR"	"United Kingdom of Great Britai…	"Europe"	"Northern Europe"	69.551	81.93	54.6	-2.0	0.946	53993	32.8	66.58	121.8	99.1103	"eng"
"DEU"	"Germany"	"Europe"	"Western Europe"	84.075	82.13	51.0	10.0	0.959	64355	31.7	67.92	109.4	99.22504	"deu"
"RUS"	"Russian Federation"	"Europe"	"Eastern Europe"	143.997	73.89	66.416667	94.25	0.832	40871	36.0	58.65	165.1	88.24222	"rus"
"NZL"	"New Zealand"	"Oceania"	"Australia and New Zealand"	5.252	82.65	-41.2	174.0	0.938	49598	null	69.76	108.3	100.0	"eng\|mri\|nzs"
"AUS"	"Australia"	"Oceania"	"Australia and New Zealand"	26.974	83.79	-25.0	133.0	0.958	61943	32.4	70.25	101.6	99.99923	"asf"

Finally, store a copy of the dataset in a named variable using the following code:

country_saved = (
    country
    .sort(c.region, c.pop)
)

After running the code, country_saved contains the reordered dataset, while country still contains the original data. The same format can be used to save the results from the other methods described in the sections below.

2.4 Selecting Columns

In some cases—particularly when a dataset grows large—you may want to select or reorder its columns. Use the .select method with the same c. notation to specify the columns to include.

(
    country
    .select(c.iso, c.lat, c.lon)
)

shape: (135, 3)

iso	lat	lon
str	f64	f64
"SEN"	14.366667	-14.283333
"VEN"	8.0	-67.0
"FIN"	65.0	27.0
"USA"	39.828175	-98.5795
"LKA"	7.0	81.0
…	…	…
"ALB"	41.0	20.0
"MYS"	3.7805111	102.314362
"SLV"	13.668889	-88.866111
"CYP"	35.0	33.0
"PAK"	30.0	71.0

Selecting columns is something we will do frequently in this text to clarify which parts of the dataset we want you to examine. It is less important for most data science pipelines, but it is particularly useful when performing more complex transformations, such as those in Chapter 4.

There is a similar function, .drop, which lets us select columns to remove from the data. Dropping columns is useful when a dataset has many columns and you want to remove only a few.

2.5 Selecting Rows

Often, we want to take a subset of rows from a dataset for a particular analysis or visualization. There are several functions for doing this, depending on the selection criteria. The .head() method, for example, returns a specified number of rows from the top of the dataset. The example below selects the first four rows.

(
    country
    .head(n=4)
)

shape: (4, 15)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"SEN"	"Senegal"	"Africa"	"Western Africa"	18.932	70.43	14.366667	-14.283333	0.53	4871	38.1	50.93	66.0	54.93987	"pbp\|fra\|wol"
"VEN"	"Venezuela, Bolivarian Republic…	"Americas"	"South America"	28.517	76.18	8.0	-67.0	0.709	8899	44.8	57.65	96.8	95.66913	"spa\|vsl"
"FIN"	"Finland"	"Europe"	"Northern Europe"	5.623	82.84	65.0	27.0	0.948	57574	27.7	76.99	156.4	99.44798	"fin\|swe"
"USA"	"United States of America"	"Americas"	"Northern America"	347.276	79.83	39.828175	-98.5795	0.938	78389	47.7	65.21	91.7	99.72235	"eng"

The method .tail(n=<int>) works the same way, except that it selects rows starting from the bottom of the table. The methods .drop_nulls() and .drop_nans() take no arguments and return only rows that contain no missing or invalid data values, respectively.

The .sample method randomly selects rows. You can specify the number of rows with n=<int> or a fraction with fraction=<float>. Below is an example that takes a random 20% sample of the dataset.

(
    country
    .sample(fraction=0.2)
)

shape: (27, 15)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"AZE"	"Azerbaijan"	"Asia"	"Western Asia"	10.398	71.73	40.3	47.7	0.789	23178	26.6	52.14	99.5	96.12412	"aze"
"CHN"	"China"	"Asia"	"Eastern Asia"	1416.096	78.81	35.844722	103.451944	0.797	25505	38.2	61.45	63.6	88.06831	"cmn"
"AUT"	"Austria"	"Europe"	"Western Europe"	9.114	83.17	48.0	14.0	0.93	64665	30.2	66.36	146.3	99.97291	"asq\|deu"
"SAU"	"Saudi Arabia"	"Asia"	"Western Asia"	34.566	75.66	23.716667	44.116667	0.9	64966	null	69.53	205.0	95.30723	"ara"
"LKA"	"Sri Lanka"	"Asia"	"Southern Asia"	23.229	78.51	7.0	81.0	0.776	14380	39.3	36.02	83.1	90.77437	"sin\|sin\|tam\|tam"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"CAN"	"Canada"	"Americas"	"Northern America"	40.127	83.15	56.0	-109.0	0.939	58422	33.3	68.41	75.5	98.92568	"eng\|fra"
"TGO"	"Togo"	"Africa"	"Western Africa"	9.722	66.47	8.25	1.183333	0.571	3008	42.4	43.65	38.7	17.19931	"fra"
"BFA"	"Burkina Faso"	"Africa"	"Western Africa"	24.075	62.97	12.266667	-2.066667	0.459	2661	47.3	44.62	35.3	21.77792	"bib\|dyu\|mos"
"NOR"	"Norway"	"Europe"	"Northern Europe"	5.623	83.78	65.0	11.0	0.97	94896	27.7	72.49	114.5	98.05439	"nob\|nor\|nno"
"CYP"	"Cyprus"	"Asia"	"Western Asia"	1.371	81.77	35.0	33.0	0.913	55720	31.2	60.71	123.1	99.41781	"ell\|tur"

We can select rows that meet a condition defined by the dataset’s variables using the .filter() method. Similar to the sort() method, we use the syntax c. followed by variable names to define relationships between variables. Below is code that selects only the rows where the Human Development Index (HDI) is greater than 0.9.

(
    country
    .filter(c.hdi > 0.9)
)

shape: (32, 15)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"FIN"	"Finland"	"Europe"	"Northern Europe"	5.623	82.84	65.0	27.0	0.948	57574	27.7	76.99	156.4	99.44798	"fin\|swe"
"USA"	"United States of America"	"Americas"	"Northern America"	347.276	79.83	39.828175	-98.5795	0.938	78389	47.7	65.21	91.7	99.72235	"eng"
"SGP"	"Singapore"	"Asia"	"South-eastern Asia"	5.871	85.63	1.3	103.8	0.946	137906	null	66.54	145.5	100.0	"eng\|msa\|cmn\|tam"
"AUT"	"Austria"	"Europe"	"Western Europe"	9.114	83.17	48.0	14.0	0.93	64665	30.2	66.36	146.3	99.97291	"asq\|deu"
"DEU"	"Germany"	"Europe"	"Western Europe"	84.075	82.13	51.0	10.0	0.959	64355	31.7	67.92	109.4	99.22504	"deu"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"NLD"	"Netherlands, Kingdom of the"	"Europe"	"Western Europe"	18.347	82.55	52.366667	4.883333	0.955	73384	null	72.55	114.4	97.71353	"nld"
"IRL"	"Ireland"	"Europe"	"Northern Europe"	5.308	82.94	53.0	-8.0	0.949	119406	null	68.17	103.1	89.3966	"eng\|gle"
"LUX"	"Luxembourg"	"Europe"	"Western Europe"	0.68	83.82	49.77	6.13	0.922	131038	34.9	70.16	143.3	97.60054	"fra\|deu\|ltz"
"CAN"	"Canada"	"Americas"	"Northern America"	40.127	83.15	56.0	-109.0	0.939	58422	33.3	68.41	75.5	98.92568	"eng\|fra"
"CYP"	"Cyprus"	"Asia"	"Western Asia"	1.371	81.77	35.0	33.0	0.913	55720	31.2	60.71	123.1	99.41781	"ell\|tur"

We can also filter categorical variables using equality comparisons. To prevent the page from becoming too cluttered, we will avoid printing this and some other results in future examples.

(
    country
    .filter(c.region == "Europe")
)

This selects all countries in Europe. Note that we use the double-equals operator (==) to test equality; a single equals sign (=) is used only for assignment. The resulting dataset contains only rows where region == "Europe".

Often, we want to apply multiple conditions simultaneously. We can combine conditions using the & (and) and | (or) operators. When using these operators, wrap each condition in parentheses:

(
    country
    .filter((c.region == "Europe") & (c.hdi > 0.9))
)

Sometimes we want to know whether a value belongs to a larger set. We can do this with the .is_in method, as shown in the following code. Note that the collection of elements passed to .is_in must be enclosed in square brackets. A detailed explanation of the object being created is provided in Chapter 6.

(
    country
    .filter(c.region.is_in(["Europe", "Africa"]))
)

shape: (75, 15)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"SEN"	"Senegal"	"Africa"	"Western Africa"	18.932	70.43	14.366667	-14.283333	0.53	4871	38.1	50.93	66.0	54.93987	"pbp\|fra\|wol"
"FIN"	"Finland"	"Europe"	"Northern Europe"	5.623	82.84	65.0	27.0	0.948	57574	27.7	76.99	156.4	99.44798	"fin\|swe"
"GAB"	"Gabon"	"Africa"	"Middle Africa"	2.593	68.68	-0.683331	11.5	0.733	19543	38.0	51.04	93.6	49.20331	"fra"
"BGR"	"Bulgaria"	"Europe"	"Eastern Europe"	6.715	74.33	42.75	25.5	0.845	36211	40.3	55.9	137.1	86.00395	"bul"
"TZA"	"Tanzania, United Republic of"	"Africa"	"Eastern Africa"	70.546	68.59	-6.306944	34.853889	0.555	3924	40.5	40.42	46.9	26.78297	"eng\|swa"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"LUX"	"Luxembourg"	"Europe"	"Western Europe"	0.68	83.82	49.77	6.13	0.922	131038	34.9	70.16	143.3	97.60054	"fra\|deu\|ltz"
"LTU"	"Lithuania"	"Europe"	"Northern Europe"	2.83	77.19	55.2	24.0	0.895	49761	35.3	65.53	157.9	92.73568	"lit"
"MNE"	"Montenegro"	"Europe"	"Southern Europe"	0.633	76.92	42.766667	19.216667	0.862	29492	29.4	58.13	185.1	96.657	"cnr\|cnr"
"GMB"	"Gambia"	"Africa"	"Western Africa"	2.822	68.16	13.5	-15.5	0.524	3199	null	46.91	76.7	46.90722	"eng"
"ALB"	"Albania"	"Europe"	"Southern Europe"	2.772	79.67	41.0	20.0	0.81	20362	30.8	54.45	91.9	98.5473	"sqi"

These operators, including != (not equal) and >= (greater than or equal to), allow us to select rows based on nearly any relationship in the dataset required for our analysis.

2.6 Grouping Data

There is a special method called .group_by(...) that allows you to group a dataset by one or more variables using a syntax similar to .sort(...). When used in a method chain, the next method is applied separately to each group of rows that share the same value(s) for the grouping variable(s). For example, the following code selects the first country from each region in the dataset.

(
    country
    .group_by(c.region)
    .head(n=1)
)

shape: (5, 15)

region	iso	full_name	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"Asia"	"LKA"	"Sri Lanka"	"Southern Asia"	23.229	78.51	7.0	81.0	0.776	14380	39.3	36.02	83.1	90.77437	"sin\|sin\|tam\|tam"
"Europe"	"FIN"	"Finland"	"Northern Europe"	5.623	82.84	65.0	27.0	0.948	57574	27.7	76.99	156.4	99.44798	"fin\|swe"
"Africa"	"SEN"	"Senegal"	"Western Africa"	18.932	70.43	14.366667	-14.283333	0.53	4871	38.1	50.93	66.0	54.93987	"pbp\|fra\|wol"
"Americas"	"VEN"	"Venezuela, Bolivarian Republic…	"South America"	28.517	76.18	8.0	-67.0	0.709	8899	44.8	57.65	96.8	95.66913	"spa\|vsl"
"Oceania"	"AUS"	"Australia"	"Australia and New Zealand"	26.974	83.79	-25.0	133.0	0.958	61943	32.4	70.25	101.6	99.99923	"asf"

This can be useful for various types of analysis. For example, we can determine the largest countries by population in each region using the following code:

(
    country
    .sort(c.pop, descending=True)
    .group_by(c.region)
    .head(n=1)
)

shape: (5, 15)

region	iso	full_name	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"Africa"	"NGA"	"Nigeria"	"Western Africa"	237.528	66.07	9.0	8.0	0.56	5848	35.1	48.69	52.4	41.19972	"eng"
"Americas"	"USA"	"United States of America"	"Northern America"	347.276	79.83	39.828175	-98.5795	0.938	78389	47.7	65.21	91.7	99.72235	"eng"
"Europe"	"RUS"	"Russian Federation"	"Eastern Europe"	143.997	73.89	66.416667	94.25	0.832	40871	36.0	58.65	165.1	88.24222	"rus"
"Oceania"	"AUS"	"Australia"	"Australia and New Zealand"	26.974	83.79	-25.0	133.0	0.958	61943	32.4	70.25	101.6	99.99923	"asf"
"Asia"	"IND"	"India"	"Southern Asia"	1463.866	72.4	22.8	83.0	0.685	10608	35.7	46.76	60.5	63.83039	"eng\|hin"

The following sections describe additional methods that can operate on grouped datasets.

2.7 Modifying Columns

Creating or modifying columns in a dataset is a fundamental operation in data analysis. Analysts often need to compute new variables from existing data, create indicator (dummy) variables, or transform variables for analysis.

The basic pattern for adding a new column is to use the .with_columns() method. In the method call, specify the values you want to create or modify and define them in terms of existing columns. For example, the following multiplies the population column (values in millions) by 1,000 to obtain the population in thousands.

(
    country
    .with_columns(population_1k = c.pop * 1000)
)

shape: (135, 16)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang	population_1k
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str	f64
"SEN"	"Senegal"	"Africa"	"Western Africa"	18.932	70.43	14.366667	-14.283333	0.53	4871	38.1	50.93	66.0	54.93987	"pbp\|fra\|wol"	18932.0
"VEN"	"Venezuela, Bolivarian Republic…	"Americas"	"South America"	28.517	76.18	8.0	-67.0	0.709	8899	44.8	57.65	96.8	95.66913	"spa\|vsl"	28517.0
"FIN"	"Finland"	"Europe"	"Northern Europe"	5.623	82.84	65.0	27.0	0.948	57574	27.7	76.99	156.4	99.44798	"fin\|swe"	5623.0
"USA"	"United States of America"	"Americas"	"Northern America"	347.276	79.83	39.828175	-98.5795	0.938	78389	47.7	65.21	91.7	99.72235	"eng"	347276.0
"LKA"	"Sri Lanka"	"Asia"	"Southern Asia"	23.229	78.51	7.0	81.0	0.776	14380	39.3	36.02	83.1	90.77437	"sin\|sin\|tam\|tam"	23229.0
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"ALB"	"Albania"	"Europe"	"Southern Europe"	2.772	79.67	41.0	20.0	0.81	20362	30.8	54.45	91.9	98.5473	"sqi"	2772.0
"MYS"	"Malaysia"	"Asia"	"South-eastern Asia"	35.978	76.03	3.7805111	102.314362	0.819	35990	46.2	58.68	118.2	95.69194	"msa"	35978.0
"SLV"	"El Salvador"	"Americas"	"Central America"	6.366	76.98	13.668889	-88.866111	0.678	12221	38.3	64.82	126.9	86.19786	"spa"	6366.0
"CYP"	"Cyprus"	"Asia"	"Western Asia"	1.371	81.77	35.0	33.0	0.913	55720	31.2	60.71	123.1	99.41781	"ell\|tur"	1371.0
"PAK"	"Pakistan"	"Asia"	"Southern Asia"	255.22	66.71	30.0	71.0	0.544	5717	29.6	45.49	49.8	61.92651	"eng\|urd"	255220.0

Common mathematical transformations of variables can be accessed by using a . after the variable name, followed by a transformation function such as sqrt() or log10(). There are also to fill in missing values: fill_null() to fill all values with a specified default, fill_nan to fill invalid numbers with a default, forward_fill to fill missing values with the first valid value above, and backward_fill to fill with the first valid value below.

The following example demonstrates the method of using methods from the columns: take the square root and the sine of population, and save each as a new column. Neither of these are meaningful or particularly useful; they are just to demonstrate the technique.

(
    country
    .with_columns(
      pop_sqrt = c.pop.sqrt(),
      pop_sin = c.pop.sin()
    )
)

Note the format of the code above. When a column definition is long (or when defining multiple columns at once), put each definition on its own line and indent it by four spaces.

Alias method

Polars includes a method called .alias that can be used within .with_columns to create or overwrite a column in a dataset. It was originally the standard technique, but it has largely been replaced by the equals-sign method, which many people, including me, find more intuitive. It’s useful to understand only because you’ll often see it in the Polars documentation.

2.8 Aggregation

Although examining individual rows can be informative, we often need summary statistics to understand the overall patterns in our data. Polars provides powerful aggregation tools that let us compute statistics across rows within specific groups. The .agg() method (short for “aggregate”) is the primary tool for computing summary statistics. It allows us to specify exactly which statistics to compute and for which columns.

(
    country
    .group_by(c.region)
    .agg(
      hdi_avg = c.hdi.mean()
    )
)

shape: (5, 2)

region	hdi_avg
str	f64
"Africa"	0.583351
"Europe"	0.898895
"Asia"	0.760897
"Oceania"	0.948
"Americas"	0.783579

In this example, we compute the average HDI across countries within each region. The syntax uses c. followed by the column name, then the aggregation function with parentheses. Note that only the grouping variable(s) and the columns created by the .agg() function appear in the result.

We can compute multiple statistics at once by including additional items in our aggregation. It is often useful to include the special function pl.len(), which counts the number of rows in each group.

(
    country
    .group_by(c.region)
    .agg(
        hdi_avg = c.hdi.mean(),
        hdi_median = c.hdi.median(),
        count = pl.len()
    )
)

shape: (5, 4)

region	hdi_avg	hdi_median	count
str	f64	f64	u32
"Asia"	0.760897	0.766	39
"Europe"	0.898895	0.9115	38
"Americas"	0.783579	0.786	19
"Africa"	0.583351	0.571	37
"Oceania"	0.948	0.948	2

This operation computes the average density and the number of regions in each division of the United States. The result is a new DataFrame where each row represents a division and each column shows one of the computed statistics. This grouped analysis is essential for understanding how variables vary across categories in our data.

Aggregating the entire DataFrame

In some cases, we may want to perform one or more aggregation functions on an entire DataFrame. In Polars, it is not possible to apply the .agg method without first grouping the data (this is different from R or Pandas). Instead, use .select and the same syntax as .agg inside the function call. While using the column-selector function to perform aggregation may seem unusual, it makes sense given how Polars optimizes chains of data processing. The specific reason, however, is well beyond the scope of this text.

Polars provides a wide range of aggregation functions that can be used with the .agg() method. Here are some of the most commonly used functions:

.mean(), .median(), .min(), .max(), .sum(), .quantile(<f64>)
.first(), .last() - .any(), .all()
.n_unique(), .count(), .len()
pl.corr(..., ...) (correlation)
pl.concat_str(...)

These cover most use cases in our analyses.

Counting observations

There are three ways to count the number of observations in a group. We can use the .len() method applied to any column (it doesn’t matter which) to get the total number of rows, or use .count() on a column to get the number of non-missing values in that column. The special function pl.len() returns the same result as applying .len() to any column, but makes it clear that the result does not depend on which column we choose. For this reason, I prefer pl.len() unless I specifically want the number of non-missing values. There is also .n_unique() to find the number of unique non-missing values in a column.

2.9 Pipes

The syntax of chaining methods is very powerful. It makes our code more readable, avoids errors from naming intermediate values, and allows Polars to create internally optimized functions that chain the sequence of methods in the most efficient way. One possible roadblock to method chaining is that we only have access to methods built into Polars. To overcome this, Polars allows us to call an arbitrary function with the DataFrame as its first argument using the .pipe method. .pipe takes the function to apply as its first argument; additional arguments are passed on to the function.

As a simple example, we can apply the helper function print_rows to a DataFrame along with a number of rows to show a specific number of rows (instead of the default 10). Here for example, we select 4 to show the first two and the last two rows.

(
    country
    .pipe(print_rows, 4)
)

shape: (135, 15)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str
"SEN"	"Senegal"	"Africa"	"Western Africa"	18.932	70.43	14.366667	-14.283333	0.53	4871	38.1	50.93	66.0	54.93987	"pbp\|fra\|wol"
"VEN"	"Venezuela, Bolivarian Republic…	"Americas"	"South America"	28.517	76.18	8.0	-67.0	0.709	8899	44.8	57.65	96.8	95.66913	"spa\|vsl"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"CYP"	"Cyprus"	"Asia"	"Western Asia"	1.371	81.77	35.0	33.0	0.913	55720	31.2	60.71	123.1	99.41781	"ell\|tur"
"PAK"	"Pakistan"	"Asia"	"Southern Asia"	255.22	66.71	30.0	71.0	0.544	5717	29.6	45.49	49.8	61.92651	"eng\|urd"

We can also exclude the the number and it will print all of the rows. This is particularly useful when we want to investigate all of the results after reducing the number of columns of the data and getting a final set of results.

The pipe method will be particularly helpful in Chapter 3, where we will use the .pipe method to pass a DataFrame to the ggplot function to create a data visualization directly within a chain of methods.

2.10 Further Functions

Let’s finish with a few additional functions that are useful inside the with_columns method. We can create new columns that depend on conditions in our existing data using the special function pl.when. The code below shows its use in creating a variable equal to 1 for countries in the Northern Hemisphere and 0 for those in the Southern Hemisphere.

(
    country
    .with_columns(
        is_north = pl.when(c.lat >= 0).then(1).otherwise(0)
    )
)

We can also access the next or previous values of a column using shift(<i64>), where <i64> specifies the number of rows to shift: positive values access rows above, and negative values access rows below. For example, the code below shows the percentage by which each country is larger than the next-smallest country in the dataset.

(
    country
    .sort(c.pop, descending=True)
    .with_columns(
        perc_larger = c.pop / c.pop.shift(-1) * 100
    )
)

The .with_columns method can compute values within a group, returning aggregations repeated for each row. However, it does not work with the group_by() function. Instead, use the .over() method on a specific column to indicate the grouping. For example, here we compute the percentage that each country occupies within its region.

(
    country
    .with_columns(
        pop_region = c.pop / c.pop.over("region").sum() * 100
    )
    .sort(c.pop_region, descending=True)
)

shape: (135, 16)

iso	full_name	region	subregion	pop	lexp	lat	lon	hdi	gdp	gini	happy	cellphone	water_access	lang	pop_region
str	str	str	str	f64	f64	f64	f64	f64	i64	f64	f64	f64	f64	str	f64
"IND"	"India"	"Asia"	"Southern Asia"	1463.866	72.4	22.8	83.0	0.685	10608	35.7	46.76	60.5	63.83039	"eng\|hin"	18.762505
"CHN"	"China"	"Asia"	"Eastern Asia"	1416.096	78.81	35.844722	103.451944	0.797	25505	38.2	61.45	63.6	88.06831	"cmn"	18.150232
"USA"	"United States of America"	"Americas"	"Northern America"	347.276	79.83	39.828175	-98.5795	0.938	78389	47.7	65.21	91.7	99.72235	"eng"	4.451068
"IDN"	"Indonesia"	"Asia"	"South-eastern Asia"	285.721	72.25	-2.0	118.0	0.728	15391	37.9	56.95	85.8	77.51606	"ind"	3.662112
"PAK"	"Pakistan"	"Asia"	"Southern Asia"	255.22	66.71	30.0	71.0	0.544	5717	29.6	45.49	49.8	61.92651	"eng\|urd"	3.271178
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"COM"	"Comoros"	"Africa"	"Eastern Africa"	0.883	70.26	-12.3	43.7	0.603	3718	45.3	35.88	25.2	35.85662	"ara\|fra"	0.011317
"LUX"	"Luxembourg"	"Europe"	"Western Europe"	0.68	83.82	49.77	6.13	0.922	131038	34.9	70.16	143.3	97.60054	"fra\|deu\|ltz"	0.008716
"MNE"	"Montenegro"	"Europe"	"Southern Europe"	0.633	76.92	42.766667	19.216667	0.862	29492	29.4	58.13	185.1	96.657	"cnr\|cnr"	0.008113
"MLT"	"Malta"	"Europe"	"Southern Europe"	0.545	83.48	35.883333	14.5	0.924	63314	31.0	62.95	107.8	99.96455	"eng\|mlt"	0.006985
"ISL"	"Iceland"	"Europe"	"Northern Europe"	0.398	84.92	65.0	-19.0	0.972	67444	26.1	75.62	107.2	98.78217	"isl"	0.005101

The .over() and .shift() methods can also be used within .filter().

2.11 R’s dplyr and Python Pandas

If you’re familiar with dplyr in R or pandas in Python, the code in this chapter should feel familiar. Specific function names differ, and each library has its own quirks. Ultimately, the important thing is to understand how to compose the methods above to transform data flexibly into the format you need for a particular application.

If you’re coming from dplyr, the main differences are prefixing column names with c., using methods instead of pipes (the same idea, different syntax), and using methods in place of aggregate functions. Some methods are renamed—for example, agg replaces summarize and sort replaces arrange—but they otherwise behave similarly.

Transitioning from Pandas requires a few more changes, but most are designed to make the code simpler and more consistent. We no longer have an index column that needs to be reset after every aggregation. Nor do we have aggregation functions wrapped in complex tuples or the eval and query methods buried inside an extra layer of quotes.

As with Pandas and dplyr, Polars also provides functions for pivots and table joins. We will see those in Chapter 4. Although some arguments have different names across the three libraries, the differences in these transformations are relatively minor.