4 EDA III: Restructuring Data

Practice Notebooks

Notebook04a [Colab↗]
Notebook04b [Colab↗]
Notebook04c [Colab↗]
Notebook04d [Colab↗]

4.1 Setup

Load all of the modules and datasets needed for the chapter.

import numpy as np
import polars as pl

from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())

country = pl.read_csv("data/countries.csv")

4.2 Introduction

In Chapter 2, we learned to use data operations to modify a dataset in Python. These modifications included taking a subset of a DataFrame, such as filtering rows or selecting a reduced set of columns. We saw how to add new columns and how to rearrange rows by sorting on one or more columns. We also explored grouping and summarizing our data to create a new set of summary statistics that aggregate the original dataset at different levels of granularity. Along the way, we saw how these modifications can help create informative data visualizations, particularly when there is too much data to label every row as a point on a single plot.

In this chapter, we’ll continue exploring how to modify data using increasingly advanced techniques. First, we’ll examine how to combine information from two different data tables. Then we’ll move on to pivots, a relatively complex but powerful method for reshaping data.

4.3 Primary and Foreign Keys

Primary keys are a core concept we need to understand to combine two datasets in Python. A primary key consists of one or more columns that can uniquely and minimally identify a row of data. For example, in our country dataset the column iso is a primary key. The column full_name is also a primary key. In database theory a distinction is made between candidate keys (any set of one or more columns that could serve as a primary key) and the specific primary key chosen for a table. This distinction is less applicable to data science work, where we will use the term primary key to refer to any possible way of identifying a row in our dataset.

A foreign key is a primary key from one table that appears in another table. For example, we might have a dataset of customers, with one column indicating the iso of the country in which they live. We join multiple tables by associating the primary key in one table with the matching foreign key in another. Notice that the difference between primary and foreign is the table in which the key appears, not the information itself. The value iso is both a primary key in the countries dataset and a foreign key in the customers dataset. The connection between these two tables is called a relation. This is where the term relational database comes from.

4.4 Joining by Relation

We will start with an abstract example before turning to a specific example using actual DataFrames in Python. Assume we have two tables, table_one and table_two. The column key_col appears in both tables: it is a primary key in table_two and a foreign key in table_one. This means it has a single unique row associated with it in table_two but possibly many instances in table_one.

To join the information in these two tables, we use the .join() method of the table containing the foreign key. The arguments include the second dataset and the name of the key used to combine the tables.

(
    table_one
    .join(table_two, on=c.key_col)
)

Assuming our data are correctly organized, the output will be a dataset with the same number of rows as table_one, where the first set of columns will also match table_one. After the original columns, there will be a version of all the columns from table_two that correspond to the matching values of our key.

Often some of the values in table_one will not have a matching value in table_two. In that case, the default behavior is to remove rows without matches; this is called an inner join. Alternatively, you can fill missing columns with null values, which results in a left join. We can indicate this to Python as follows:

(
    table_one
    .join(table_two, on=c.key_col, how="left")
)

Left and right tables

The terminology of “left” and “right” tables comes from thinking about how the join would look when written on a single line. Consider the code table_one.join(table_two); the first table is to the left of the join function and the second table is to the right of the join function. As a rule of thumb, we try to place the table with the foreign key on the left and the table with the primary key on the right. The reason for this is that the output should match the number of observations in the table with the foreign key, so it typically makes it easier to reason about treating it as a method of the foreign-key table. Note that usually the table with the foreign key will be much larger than the table with the primary key.

Equi-joins: inner, left, right, full, semi, anti, cross

There are six different options that we can assign to the how parameter in the table join method. The default, inner, returns only rows that match in both tables. Left joins return all rows from the left DataFrame; right joins return all rows from the right DataFrame. Full joins (rarely used) return all rows from both tables. A semi join returns only rows from the left table that have a match in the right table; no additional columns are added. An anti join returns only rows in the left table that do not match any rows in the right table. These six options are called equi-joins because they work by finding where the key(s) in one table are equal to the key(s) in another table.

A cross join returns all combinations of rows from the two tables, with no key required. It has several interesting applications, which we will see in later chapters.

There are several other modifications that we can apply to the basic syntax. For example, if our key has multiple columns, we can pass a list to on instead of a single value. Here is the code to join on two matching columns from the two DataFrames.

(
    table_one
    .join(table_two, on=[c.key_col1, c.key_col2])
)

Or, if the key has different names in the two tables, we can describe them with left_on (the first table) and right_on (the second table). These can be individual values or lists, where each element in one list corresponds to the element in the other.

(
    table_one
    .join(table_two, left_on=c.key_col1, right_on=c.key_col1)
)

Note that Python will let you join on any set of columns that match between the two tables. The results can be surprising, and possibly very large, if we are not sure that the join column(s) are a primary key in one of the tables. Avoid joins that do not follow the pattern here (foreign key in the left table and a primary key in the right table).

4.5 Join Examples

The code outline in the previous section should make more sense once we see examples using actual datasets. To do this, we need multiple datasets that we can join together. We will create a smaller version of the country dataset that contains only two columns, which makes it easier to view the results without scrolling. We also load a dataset of country borders, where each row shows the relationship between a country and one of its neighbors.

c_sml = country.select(c.iso, c.gini)
border = pl.read_csv("data/countries_borders.csv")
border

shape: (829, 2)

iso	iso_neighbor
str	str
"AFG"	"IRN"
"AFG"	"PAK"
"AFG"	"CHN"
"AFG"	"TJK"
"AFG"	"TKM"
…	…
"ZMB"	"ZWE"
"ZWE"	"BWA"
"ZWE"	"MOZ"
"ZWE"	"ZAF"
"ZWE"	"ZMB"

We can use a relational join to add the Gini coefficient for the country referenced in the border data. Note that iso is present in both tables; it’s a primary key in c_sml and a foreign key in border. We can join them using the code below.

(
    border
    .join(c_sml, on=c.iso)
)

shape: (641, 3)

iso	iso_neighbor	gini
str	str	f64
"AFG"	"IRN"	27.8
"AFG"	"PAK"	27.8
"AFG"	"CHN"	27.8
"AFG"	"TJK"	27.8
"AFG"	"TKM"	27.8
…	…	…
"ZMB"	"ZWE"	57.1
"ZWE"	"BWA"	50.3
"ZWE"	"MOZ"	50.3
"ZWE"	"ZAF"	50.3
"ZWE"	"ZMB"	50.3

Notice that the Gini coefficient of the base country has been added to the data. The first five rows all correspond to borders with Afghanistan and therefore the same value of the Gini coefficient (27.8) is duplicated for each of the rows. Similarly, the last four rows correspond to Zimbabwe and repeat the Gini coefficient value of 50.3 in each. As with other data methods we use, the .join() method returns a new dataset without modifying the original data. We would need to save the output if we wanted to work further with the results.

The DataFrame after the join has only 641 observations compared to the 829 observations in the original. The reason is that our country data only contains those countries for which we have full economic data. This means that some of the countries in border do not have an entry in c_sml. The default relational join is an inner join, so these missing corresponding rows are removed from the result. Changing to a left join restores the DataFrame to the same number of observations as the original border data.

(
    border
    .join(c_sml, on=c.iso, how="left")
)

shape: (829, 3)

iso	iso_neighbor	gini
str	str	f64
"AFG"	"IRN"	27.8
"AFG"	"PAK"	27.8
"AFG"	"CHN"	27.8
"AFG"	"TJK"	27.8
"AFG"	"TKM"	27.8
…	…	…
"ZMB"	"ZWE"	57.1
"ZWE"	"BWA"	50.3
"ZWE"	"MOZ"	50.3
"ZWE"	"ZAF"	50.3
"ZWE"	"ZMB"	50.3

So far, we have added the Gini coefficient of the base country. We could also add the coefficient of the bordering country by associating the column iso_neighbor with the column iso in our countries DataFrame. Here is the code to add the neighboring Gini coefficient.

(
    border
    .join(c_sml, left_on=c.iso_neighbor, right_on=c.iso)
)

shape: (667, 3)

iso	iso_neighbor	gini
str	str	f64
"AFG"	"IRN"	40.9
"AFG"	"PAK"	29.6
"AFG"	"CHN"	38.2
"AFG"	"TJK"	34.0
"AFG"	"UZB"	35.3
…	…	…
"ZMB"	"ZWE"	50.3
"ZWE"	"BWA"	53.3
"ZWE"	"MOZ"	54.0
"ZWE"	"ZAF"	63.1
"ZWE"	"ZMB"	57.1

What if we wanted to add the Gini coefficient to both countries in each row? This requires doing two separate joins, one after the other. In the second join, we add the suffix _neighbor to the Gini coefficient to distinguish the two scores in the output.

(
    border
    .join(c_sml, on=c.iso)
    .join(c_sml, left_on=c.iso_neighbor, right_on=c.iso, suffix="_neighbor")
)

shape: (523, 4)

iso	iso_neighbor	gini	gini_neighbor
str	str	f64	f64
"AFG"	"IRN"	27.8	40.9
"AFG"	"PAK"	27.8	29.6
"AFG"	"CHN"	27.8	38.2
"AFG"	"TJK"	27.8	34.0
"AFG"	"UZB"	27.8	35.3
…	…	…	…
"ZMB"	"ZWE"	57.1	50.3
"ZWE"	"BWA"	50.3	53.3
"ZWE"	"MOZ"	50.3	54.0
"ZWE"	"ZAF"	50.3	63.1
"ZWE"	"ZMB"	50.3	57.1

As noted above, we can add the option how= with the values inner (default), left, right, or outer to indicate what should happen when keys appear only in one table. Left joins retain all keys from the first (left) table, right joins retain all keys from the second (right) table, and outer joins retain all keys. When there is no matching value in the other table, null values are filled in. The default, inner, is often a good choice to avoid missing values.

Another use of a key-based join is to determine which elements match between one table and another without adding any extra columns. These are called filtering joins because they act like the .filter method in that they select a subset of rows. There are two variants: a semi-join retains only those rows that have a match in the second table, and an anti-join retains only those rows that do not. When there is only a single key, semi- and anti-joins can be replaced by a call to isin. The real power comes when there are multiple keys and we want to know whether a specific combination of keys matches (or does not match) those found in another table.

4.6 Conditional Joins

In this section, we introduce two additional kinds of joins that are helpful when working with certain types of complex data. Both of these allow us to join by relationships between keys that do not require values to be exactly equal. To start, let’s create two subsets of our countries data, separating out countries in Africa and in Asia, and sorting by the Gini coefficient. We remove any missing values from the output.

africa = (
    country
    .filter(c.region == "Africa")
    .select(c.full_name, c.gini)
    .sort(c.gini)
    .drop_nulls()
)
asia = (
    country
    .filter(c.region == "Asia")
    .select(c.full_name, c.gini)
    .sort(c.gini)
    .drop_nulls()
)

The function .join_asof combines two datasets by matching each row in the first (left) dataset to the nearest value in the second (right) dataset based on numeric values in a specified pair of columns. It always returns a dataset with the same number of rows as the first (left) dataset. The function is optimized to do this quickly even for large datasets; to achieve this, it requires that both datasets be sorted by the join key before being called. Below is an example where we associate each African country with the nearest Asian country in terms of their Gini coefficient. We set coalesce=False to ensure the matching country’s Gini coefficient is retained in the output.

(
    africa
    .join_asof(asia, on=c.gini, strategy="nearest", coalesce=False)
)

shape: (35, 4)

full_name	gini	full_name_right	gini_right
str	f64	str	f64
"Guinea"	29.6	"Pakistan"	29.6
"Egypt"	31.5	"Korea, Republic of"	31.4
"Mauritania"	32.6	"Mongolia"	32.7
"Ethiopia"	35.0	"Thailand"	35.0
"Nigeria"	35.1	"Thailand"	35.0
…	…	…	…
"Botswana"	53.3	"Malaysia"	46.2
"Mozambique"	54.0	"Malaysia"	46.2
"Zambia"	57.1	"Malaysia"	46.2
"Namibia"	59.1	"Malaysia"	46.2
"South Africa"	63.1	"Malaysia"	46.2

The original motivation for the .join_asof method was to join two time-series datasets. The idea is that we might want the current (most recent) value from one table relative to another. Hence the function’s name and the fact that it uses the backward strategy by default to find the nearest value in the second dataset that is less than (i.e., earlier than) the value in the first. You can set the strategy to forward to look in the other direction, or to nearest to find the closest value. You can also add additional keys with the by, by_left, or by_right options; these keys must match exactly before performing the nearest search. The tables must be pre-sorted by the join key(s) for the function to work.

If we need to do more exotic joins between two tables, we can use the .join_where method. It takes one or more expressions in the same format used by the .filter method to specify exactly which pairs of rows to keep in the joined dataset. To refer to columns that share the same name in both datasets, use the _right suffix to indicate the second dataset. Below is an example that finds all pairs of countries in Africa and Asia where the difference in the Gini coefficient is less than 5.

(
    africa
    .join_where(
        asia,
        (c.gini - c.gini_right).abs() < 5
    )
)

shape: (414, 4)

full_name	gini	full_name_right	gini_right
str	f64	str	f64
"Guinea"	29.6	"Armenia"	25.2
"Guinea"	29.6	"United Arab Emirates"	26.0
"Guinea"	29.6	"Azerbaijan"	26.6
"Guinea"	29.6	"Afghanistan"	27.8
"Guinea"	29.6	"Kazakhstan"	27.8
…	…	…	…
"Cameroon"	46.6	"Philippines"	42.3
"Cameroon"	46.6	"Malaysia"	46.2
"Burkina Faso"	47.3	"Malaysia"	46.2
"Congo"	48.9	"Malaysia"	46.2
"Zimbabwe"	50.3	"Malaysia"	46.2

This method can be slow because it needs to manually check every combination of rows in the two datasets. However, it is very powerful and useful when combining two reasonably sized datasets in a complex way. Notice that we could obtain the same result by starting with a cross join (which returns every combination of rows) and then applying a filter. The benefit of the .join_where is that it avoids having to store in memory all combinations of non-matching rows by effectively combining the cross join and filtering into a single step.

4.7 Units of Observation

When designing a way to store tabular data, we often must decide whether to favor many rows and few columns or many columns and few rows. A related decision, particularly when choosing a format with more columns, is what information to store in columns versus rows. In some cases the answers are clear and obvious. In others, there are trade-offs about which types of analyses will be easiest or hardest to perform with the data. Let’s work through an example.

Assume that you are a biologist tracking the heights of 100 plants over the next 14 days. There are several different ways that this data could be collected. We could have each day’s measurements in a column and each plant as a row. We could have each plant as a column and each day as a row. Or we could have a row for each combination of plant and day, with 100*14 = 1,400 rows and three columns (plant ID, day ID, and height).

A core data-science concept will help us distinguish the three cases above. A unit of observation is the entity about which data are collected and recorded in a dataset. In other words, it is the “who” or “what” that each row of the table represents. In the example with one row for each plant and one column for each day, the unit of observation is a plant. In the example with one row for each day and one column for each plant, the unit of observation is a day. In the final example, the one with 1,400 rows, we call the unit of observation plant × day (read “plant crossed with day”), which means each row corresponds to a unique combination of one plant and one day. The key takeaway is that once we have picked the unit of observation, there is only a single way to construct our data table. Choosing the right unit of observation for our data will be a central concern in the next chapter.

Note that the concept of a unit of observation is closely linked to that of a primary key. By definition, a primary key must uniquely identify the “what” of the unit of observation. For example, a primary key for a table whose unit of observation is a plant must uniquely identify each plant. Likewise, a table with a plant × day unit of observation needs a key that uniquely identifies both the plant and the day.

4.8 Pivot and Unpivot

Sometimes the unit of observation in which a data table is stored does not match the unit needed for our analysis. If the stored format uses a compound unit of observation, we can simplify it by using an aggregation to remove one or more components. For example, we could group a plant × day table by plant ID and calculate summary statistics (such as the maximum height over all days) to generate a dataset with plant as the unit of observation. While aggregations are useful, by design they remove information present in the original dataset. Other times, we want to restructure the unit of observation without losing or adding any information—we simply need the data in a different shape for a particular analysis or visualization. To do this, we can use one of two additional data manipulation functions.

Consider the cellphone dataset; we show its first few rows below. The unit of observation is year × country: each row gives the number of cellphones per 100 people for that year and country.

cellphone = pl.read_csv("data/countries_cellphone.csv")
cellphone

shape: (3_480, 3)

iso	year	cell
str	i64	f64
"AFG"	2003	0.87978
"AFG"	2004	2.54662
"AFG"	2005	4.91711
"AFG"	2006	9.9133
"AFG"	2007	18.0167
…	…	…
"ZWE"	2018	85.8627
"ZWE"	2019	86.4094
"ZWE"	2020	84.9604
"ZWE"	2021	90.2538
"ZWE"	2022	88.9958

We could have stored this dataset in a format where the unit of observation was year and each country had its own column. What if we want to work with the data in that format? For example, if we wanted to compute the difference in cellphone ownership between the United States and Canada, we would need the data in this format. We can convert our dataset using the .pivot() method, which requires indicating which column will be used to index the output (this becomes the unit of observation), which column will be used to create the new columns, and which column contains the values that will fill the table.

cell_wide = (
    cellphone
    .pivot(index="year", on="iso", values="cell")
)
cell_wide

shape: (20, 175)

year	AFG	AGO	ALB	AND	ARE	ARG	ARM	AUS	AUT	AZE	BDI	BEL	BEN	BFA	BGD	BGR	BHR	BHS	BIH	BLR	BLZ	BOL	BRA	BRB	BRN	BTN	BWA	CAF	CAN	CHE	CHL	CHN	CIV	CMR	COD	COG	…	SEN	SGP	SLB	SLV	SMR	SOM	STP	SUR	SVK	SVN	SWE	SWZ	SYC	SYR	TCD	TGO	THA	TJK	TLS	TON	TTO	TUN	TUR	TZA	UGA	UKR	URY	USA	UZB	VCT	VEN	VNM	VUT	YEM	ZAF	ZMB	ZWE
i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	…	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
2003	0.87978	1.95054	35.278	74.6812	70.8971	20.4096	3.76374	72.3885	89.5526	12.5837	0.90868	82.9377	2.98462	1.81994	0.96829	44.8576	54.6188	36.4188	25.7218	11.4045	22.899	14.1298	25.6741	52.6759	51.1674	0.35363	25.0588	0.97488	41.7477	84.3941	45.3566	20.8536	6.69626	6.66446	2.25179	9.57701	…	7.29792	86.4376	0.22739	19.1676	60.6975	2.01834	3.133	33.2775	68.4271	87.2851	98.2426	7.9685	56.563	6.62407	0.68509	4.38771	33.3244	0.71512	2.18764	10.7214	25.1251	19.0389	41.1128	3.49964	2.94538	13.6577	15.1631	55.4096	1.24776	55.6041	27.093	3.44629	3.88551	3.14667	34.7626	2.19418	2.97287
2004	2.54662	3.9784	40.6495	78.5281	83.1597	34.8114	6.72643	82.2081	97.7795	17.156	1.37553	87.6255	5.62957	2.93277	1.94647	61.0333	74.5619	54.743	33.9881	23.0022	27.6052	19.5647	35.9135	74.8745	57.3669	2.94185	28.9318	1.43102	46.7437	84.9756	57.2138	25.7141	8.54469	9.21627	3.49263	10.76	…	10.2163	95.6368	0.6321	30.5192	60.5787	4.86334	4.91107	41.4107	79.5226	92.7311	97.6783	13.5115	61.6736	12.7939	1.23991	5.83805	41.1989	1.98806	2.76856	15.6085	48.4019	36.7464	50.5995	5.09582	4.29039	29.0897	18.2681	63.1242	2.09033	63.81	31.9693	6.17384	5.10619	6.68019	42.536	4.09548	3.4429
2005	4.91711	8.35159	49.7453	83.3882	97.1993	56.4973	10.5791	90.7636	105.296	26.0781	2.01653	91.6576	7.07639	4.54743	6.21909	81.1551	82.2539	66.2329	38.9203	42.4035	34.3187	25.8676	46.6789	76.7344	64.8898	5.43918	30.6424	2.33165	52.4553	91.959	64.6505	30.0304	11.7071	13.1922	4.67216	15.101	…	15.3986	102.723	1.24203	40.1585	60.0238	4.69738	7.39487	44.8256	84.4454	88.0975	100.822	18.5353	65.8773	15.6797	2.03329	7.41937	46.1397	3.82687	3.48746	28.2689	68.3489	55.3921	62.901	7.56468	4.70367	64.0054	35.1492	68.8835	2.73172	62.8012	46.6505	11.8306	6.02202	9.99364	68.6198	8.10286	5.18375
2006	9.9133	15.2614	62.6026	86.7048	110.185	79.5273	42.1272	95.9663	112.226	38.1512	2.54273	93.3538	12.1442	7.07267	13.0671	108.025	91.2343	72.6192	46.4606	62.0314	40.9896	30.2062	53.5317	87.8212	82.6517	12.2359	43.9084	2.5128	57.2331	99.4337	75.4174	34.9802	19.7861	17.8682	7.28434	23.9093	…	25.9356	108.658	1.42394	64.1687	60.0857	5.01116	11.1231	60.9527	91.0056	90.8372	105.797	23.0535	77.833	23.8071	4.35335	11.7791	60.2777	30.4638	5.02766	28.2825	111.783	70.9333	75.1837	13.9199	6.97652	105.277	70.8316	76.8586	9.47736	78.2349	69.0163	22.9925	6.94692	12.6482	79.3267	13.713	6.71982
2007	18.0167	23.8782	76.8284	77.559	137.458	100.962	63.0765	101.458	119.466	51.1876	3.30806	101.06	22.9147	12.5405	23.2247	130.455	105.539	105.849	61.0557	72.7795	39.9574	33.5993	64.1626	94.953	98.7233	22.0203	60.2637	7.61859	61.2755	108.793	83.7065	41.2808	35.5088	25.1303	10.551	32.349	…	30.8471	128.882	2.17932	102.122	59.6944	5.29914	17.7366	71.5609	112.852	95.7428	110.588	34.8399	84.4864	29.8537	8.28717	19.2349	78.945	29.6445	7.78345	43.5929	110.578	75.1086	87.5976	19.9153	14.1477	119.088	91.174	82.5923	21.0203	99.0675	86.1507	53.835	11.75	17.8838	83.715	21.0029	9.57239
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2018	59.8109	42.4589	93.8031	109.915	214.842	131.224	123.0	110.033	124.23	102.987	53.2737	100.164	76.408	94.622	98.9289	119.395	139.735	98.9789	102.187	122.57	64.7802	99.4566	101.602	117.319	129.268	92.6023	147.065	26.2216	89.0495	126.681	133.27	116.229	123.098	74.3995	40.5014	91.189	…	104.056	152.074	67.9775	151.669	111.5	49.5263	77.3644	125.48	133.04	119.153	124.086	89.2004	155.551	87.4952	43.2279	74.3997	175.266	107.442	115.089	102.246	134.143	125.533	94.9086	75.7302	58.8754	127.754	131.363	103.935	71.7444	107.164	69.55	146.138	88.1403	44.8811	157.692	86.0723	85.8627
2019	59.6471	45.8065	91.1635	114.953	209.033	125.303	124.614	109.753	120.776	106.328	53.2996	100.165	81.3235	97.1499	100.4	116.617	129.143	107.812	112.255	123.23	66.0705	100.149	97.3747	117.112	129.845	95.3473	160.662	32.2627	90.9589	126.921	130.493	122.67	132.574	83.9044	39.94	91.0654	…	109.342	159.347	65.662	167.089	111.823	51.3194	77.6791	134.349	135.756	120.569	125.457	91.0519	164.617	87.1297	45.9379	73.7225	181.222	115.373	109.555	58.7722	145.716	124.387	94.6498	80.5836	59.0598	130.629	133.237	105.321	101.26	92.5178	63.8474	140.193	90.8331	43.399	162.739	93.0148	86.4094
2020	58.0461	43.7804	91.1881	103.877	194.468	121.181	120.682	104.93	120.132	101.597	52.5571	99.9101	85.2391	102.973	106.002	114.597	119.412	117.717	106.375	125.165	67.5517	99.8988	98.6456	104.722	119.934	96.7703	161.859	36.426	84.7748	127.376	129.414	120.497	138.664	85.2722	42.5028	97.6561	…	113.638	150.259	63.6671	159.575	112.169	53.1133	80.1173	146.875	133.633	121.331	123.549	104.747	152.545	79.1482	50.4343	75.164	162.328	118.819	103.911	58.6544	134.231	124.042	95.3961	84.0051	62.2824	129.341	133.902	103.855	99.4064	96.9438	58.1626	141.656	82.391	42.0038	158.447	100.235	84.9604
2021	56.6945	44.3869	92.4843	119.688	186.301	130.353	125.395	105.119	121.358	105.694	59.7008	101.463	94.9183	112.198	110.012	114.912	128.079	100.257	114.912	127.117	66.7769	100.809	104.825	112.223	133.597	100.331	173.252	35.817	87.4056	123.212	136.572	121.492	150.344	80.8081	47.2883	95.8615	…	115.325	157.978	62.1565	176.973	115.701	51.2059	85.2578	146.654	135.268	123.363	124.961	118.627	149.479	78.5558	58.0323	70.4497	168.485	116.225	102.669	60.9802	134.222	129.846	99.5415	86.0163	65.6221	135.026	138.113	106.325	102.41	95.5174	60.2345	136.806	81.6914	47.809	163.128	103.283	90.2538
2022	56.2655	66.6008	98.395	142.426	195.625	131.484	130.54	109.469	121.166	107.506	56.26	101.68	105.745	119.463	109.867	116.679	139.636	101.624	118.944	128.315	67.175	101.572	101.245	114.581	116.108	95.1298	178.205	38.8218	91.2387	123.971	135.114	124.195	161.232	89.8274	48.6774	93.6259	…	118.15	173.005	62.0735	183.249	120.27	49.6801	86.7785	149.052	136.029	126.459	140.782	121.492	163.43	78.6568	65.4378	72.2148	176.223	126.161	108.158	61.7258	133.632	131.875	103.721	93.016	70.1401	120.261	139.829	108.786	105.472	102.334	66.4668	137.852	81.6203	46.4534	160.729	98.4373	88.9958

It is possible that some combinations of the new rows and columns do not exist in the original DataFrame. When this happens, Python will fill in null values for the missing combinations. Sometimes, particularly if these are counts that we are looking at, we may want to convert all of these missing values to zeros (or another default). To do that, we can call the method .fill_null(0). The value zero can be changed to any other value that makes sense. This can also be used on individual columns as needed in other analyses.

Now, what if we want to go the other way? That is, suppose we stored the data originally in the format above (cell_wide) but want to return to the version we started with. This can be achieved by using the .unpivot() method, which requires specifying the index column to use. The new columns created to store the data are given the default names variable and value. These can be changed by setting the variable_name and value_name arguments.

(
  cell_wide
  .unpivot(index="year")
)

shape: (3_480, 3)

year	variable	value
i64	str	f64
2003	"AFG"	0.87978
2004	"AFG"	2.54662
2005	"AFG"	4.91711
2006	"AFG"	9.9133
2007	"AFG"	18.0167
…	…	…
2018	"ZWE"	85.8627
2019	"ZWE"	86.4094
2020	"ZWE"	84.9604
2021	"ZWE"	90.2538
2022	"ZWE"	88.9958

Pivot and unpivot syntax

The current version of Polars on Google Colab does not allow using the c. notation when specifying columns as we do with other Polars methods. Instead, you must specify the column as a string, similar to how we do it in plotnine graphics. This has already been changed in the development version of Polars for pivot and will likely change for unpivot in the near future. For now, remember that both pivot and unpivot require a different way of specifying columns compared to other Polars methods.

The .pivot() and .unpivot() methods are powerful tools for data manipulation and constitute the final major methods we need for manipulating tabular datasets. Although they may at first glance seem straightforward (and should be, at least in theory), care is needed when using them to reshape data, because they can easily introduce subtle errors. The best practice is to decide on the unit of observation up front and structure your data accordingly. Use these reshaping functions when necessary, but avoid unnecessary back-and-forth transformations that can introduce mistakes.

4.9 Expand List Columns

A final type of transformation arises when multiple elements occupy a single cell in a table. This usually occurs only after a data-cleaning step we’ve performed, rather than as raw input from a CSV file. For example, the function str.split, which we’ll learn more about in Chapter 7, can split the text in a DataFrame column into parts. We can use this to split the multiple language codes found in the lang column of our data.

(
    country
    .select(c.full_name, c.lang)
    .with_columns(
        lang_codes = c.lang.str.split("|")
    )
)

shape: (135, 3)

full_name	lang	lang_codes
str	str	list[str]
"Senegal"	"pbp\|fra\|wol"	["pbp", "fra", "wol"]
"Venezuela, Bolivarian Republic…	"spa\|vsl"	["spa", "vsl"]
"Finland"	"fin\|swe"	["fin", "swe"]
"United States of America"	"eng"	["eng"]
"Sri Lanka"	"sin\|sin\|tam\|tam"	["sin", "sin", … "tam"]
…	…	…
"Albania"	"sqi"	["sqi"]
"Malaysia"	"msa"	["msa"]
"El Salvador"	"spa"	["spa"]
"Cyprus"	"ell\|tur"	["ell", "tur"]
"Pakistan"	"eng\|urd"	["eng", "urd"]

It is hard to work directly with the embedding list structure. Often the easiest way to simplify a data frame that contains list elements is to unravel the lists, duplicating the other columns. Use the .explode() method, passing the column(s) you want to unravel.

(
    country
    .select(c.full_name, c.lang)
    .with_columns(
        lang_codes = c.lang.str.split("|")
    )
    .explode(c.lang_codes)
)

shape: (265, 3)

full_name	lang	lang_codes
str	str	str
"Senegal"	"pbp\|fra\|wol"	"pbp"
"Senegal"	"pbp\|fra\|wol"	"fra"
"Senegal"	"pbp\|fra\|wol"	"wol"
"Venezuela, Bolivarian Republic…	"spa\|vsl"	"spa"
"Venezuela, Bolivarian Republic…	"spa\|vsl"	"vsl"
…	…	…
"El Salvador"	"spa"	"spa"
"Cyprus"	"ell\|tur"	"ell"
"Cyprus"	"ell\|tur"	"tur"
"Pakistan"	"eng\|urd"	"eng"
"Pakistan"	"eng\|urd"	"urd"

The term “explode” warns us that the resulting dataset can become very large, particularly if we unravel multiple columns simultaneously.

4.10 Coming from R or Pandas

If you are familiar with joins and pivots in R’s dplyr or in Pandas, the code here should be relatively familiar, except for the specific defaults and argument names. What makes these operations difficult are the concepts, not the code, and those do not change across implementations.