4  EDA III: Restructuring Data

4.1 Setup

Load all of the modules and datasets needed for the chapter.

import numpy as np
import polars as pl

from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())

country = pl.read_csv("data/countries.csv")

4.2 Introduction

In Chapter 2, we learned to use data operations to modify a dataset in Python. These modifications included taking a subset of a DataFrame, such as filtering rows or selecting a reduced set of columns. We saw how to add new columns and how to rearrange rows by sorting on one or more columns. We also explored grouping and summarizing our data to create a new set of summary statistics that aggregate the original dataset at different levels of granularity. Along the way, we saw how these modifications can help create informative data visualizations, particularly when there is too much data to label every row as a point on a single plot.

In this chapter, we’ll continue exploring how to modify data using increasingly advanced techniques. First, we’ll examine how to combine information from two different data tables. Then we’ll move on to pivots, a relatively complex but powerful method for reshaping data.

4.3 Primary and Foreign Keys

Primary keys are a core concept we need to understand to combine two datasets in Python. A primary key consists of one or more columns that can uniquely and minimally identify a row of data. For example, in our country dataset the column iso is a primary key. The column full_name is also a primary key. In database theory a distinction is made between candidate keys (any set of one or more columns that could serve as a primary key) and the specific primary key chosen for a table. This distinction is less applicable to data science work, where we will use the term primary key to refer to any possible way of identifying a row in our dataset.

A foreign key is a primary key from one table that appears in another table. For example, we might have a dataset of customers, with one column indicating the iso of the country in which they live. We join multiple tables by associating the primary key in one table with the matching foreign key in another. Notice that the difference between primary and foreign is the table in which the key appears, not the information itself. The value iso is both a primary key in the countries dataset and a foreign key in the customers dataset. The connection between these two tables is called a relation. This is where the term relational database comes from.

4.4 Joining by Relation

We will start with an abstract example before turning to a specific example using actual DataFrames in Python. Assume we have two tables, table_one and table_two. The column key_col appears in both tables: it is a primary key in table_two and a foreign key in table_one. This means it has a single unique row associated with it in table_two but possibly many instances in table_one.

To join the information in these two tables, we use the .join() method of the table containing the foreign key. The arguments include the second dataset and the name of the key used to combine the tables.

(
    table_one
    .join(table_two, on=c.key_col)
)

Assuming our data are correctly organized, the output will be a dataset with the same number of rows as table_one, where the first set of columns will also match table_one. After the original columns, there will be a version of all the columns from table_two that correspond to the matching values of our key.

Often some of the values in table_one will not have a matching value in table_two. In that case, the default behavior is to remove rows without matches; this is called an inner join. Alternatively, you can fill missing columns with null values, which results in a left join. We can indicate this to Python as follows:

(
    table_one
    .join(table_two, on=c.key_col, how="left")
)

The terminology of “left” and “right” tables comes from thinking about how the join would look when written on a single line. Consider the code table_one.join(table_two); the first table is to the left of the join function and the second table is to the right of the join function. As a rule of thumb, we try to place the table with the foreign key on the left and the table with the primary key on the right. The reason for this is that the output should match the number of observations in the table with the foreign key, so it typically makes it easier to reason about treating it as a method of the foreign-key table. Note that usually the table with the foreign key will be much larger than the table with the primary key.

There are six different options that we can assign to the how parameter in the table join method. The default, inner, returns only rows that match in both tables. Left joins return all rows from the left DataFrame; right joins return all rows from the right DataFrame. Full joins (rarely used) return all rows from both tables. A semi join returns only rows from the left table that have a match in the right table; no additional columns are added. An anti join returns only rows in the left table that do not match any rows in the right table. These six options are called equi-joins because they work by finding where the key(s) in one table are equal to the key(s) in another table.

A cross join returns all combinations of rows from the two tables, with no key required. It has several interesting applications, which we will see in later chapters.

There are several other modifications that we can apply to the basic syntax. For example, if our key has multiple columns, we can pass a list to on instead of a single value. Here is the code to join on two matching columns from the two DataFrames.

(
    table_one
    .join(table_two, on=[c.key_col1, c.key_col2])
)

Or, if the key has different names in the two tables, we can describe them with left_on (the first table) and right_on (the second table). These can be individual values or lists, where each element in one list corresponds to the element in the other.

(
    table_one
    .join(table_two, left_on=c.key_col1, right_on=c.key_col1)
)

Note that Python will let you join on any set of columns that match between the two tables. The results can be surprising, and possibly very large, if we are not sure that the join column(s) are a primary key in one of the tables. Avoid joins that do not follow the pattern here (foreign key in the left table and a primary key in the right table).

4.5 Join Examples

The code outline in the previous section should make more sense once we see examples using actual datasets. To do this, we need multiple datasets that we can join together. We will create a smaller version of the country dataset that contains only two columns, which makes it easier to view the results without scrolling. We also load a dataset of country borders, where each row shows the relationship between a country and one of its neighbors.

c_sml = country.select(c.iso, c.gini)
border = pl.read_csv("data/countries_borders.csv")
border
shape: (829, 2)
iso iso_neighbor
str str
"AFG" "IRN"
"AFG" "PAK"
"AFG" "CHN"
"AFG" "TJK"
"AFG" "TKM"
"ZMB" "ZWE"
"ZWE" "BWA"
"ZWE" "MOZ"
"ZWE" "ZAF"
"ZWE" "ZMB"

We can use a relational join to add the Gini coefficient for the country referenced in the border data. Note that iso is present in both tables; it’s a primary key in c_sml and a foreign key in border. We can join them using the code below.

(
    border
    .join(c_sml, on=c.iso)
)
shape: (641, 3)
iso iso_neighbor gini
str str f64
"AFG" "IRN" 27.8
"AFG" "PAK" 27.8
"AFG" "CHN" 27.8
"AFG" "TJK" 27.8
"AFG" "TKM" 27.8
"ZMB" "ZWE" 57.1
"ZWE" "BWA" 50.3
"ZWE" "MOZ" 50.3
"ZWE" "ZAF" 50.3
"ZWE" "ZMB" 50.3

Notice that the Gini coefficient of the base country has been added to the data. The first five rows all correspond to borders with Afghanistan and therefore the same value of the Gini coefficient (27.8) is duplicated for each of the rows. Similarly, the last four rows correspond to Zimbabwe and repeat the Gini coefficient value of 50.3 in each. As with other data methods we use, the .join() method returns a new dataset without modifying the original data. We would need to save the output if we wanted to work further with the results.

The DataFrame after the join has only 641 observations compared to the 829 observations in the original. The reason is that our country data only contains those countries for which we have full economic data. This means that some of the countries in border do not have an entry in c_sml. The default relational join is an inner join, so these missing corresponding rows are removed from the result. Changing to a left join restores the DataFrame to the same number of observations as the original border data.

(
    border
    .join(c_sml, on=c.iso, how="left")
)
shape: (829, 3)
iso iso_neighbor gini
str str f64
"AFG" "IRN" 27.8
"AFG" "PAK" 27.8
"AFG" "CHN" 27.8
"AFG" "TJK" 27.8
"AFG" "TKM" 27.8
"ZMB" "ZWE" 57.1
"ZWE" "BWA" 50.3
"ZWE" "MOZ" 50.3
"ZWE" "ZAF" 50.3
"ZWE" "ZMB" 50.3

So far, we have added the Gini coefficient of the base country. We could also add the coefficient of the bordering country by associating the column iso_neighbor with the column iso in our countries DataFrame. Here is the code to add the neighboring Gini coefficient.

(
    border
    .join(c_sml, left_on=c.iso_neighbor, right_on=c.iso)
)
shape: (667, 3)
iso iso_neighbor gini
str str f64
"AFG" "IRN" 40.9
"AFG" "PAK" 29.6
"AFG" "CHN" 38.2
"AFG" "TJK" 34.0
"AFG" "UZB" 35.3
"ZMB" "ZWE" 50.3
"ZWE" "BWA" 53.3
"ZWE" "MOZ" 54.0
"ZWE" "ZAF" 63.1
"ZWE" "ZMB" 57.1

What if we wanted to add the Gini coefficient to both countries in each row? This requires doing two separate joins, one after the other. In the second join, we add the suffix _neighbor to the Gini coefficient to distinguish the two scores in the output.

(
    border
    .join(c_sml, on=c.iso)
    .join(c_sml, left_on=c.iso_neighbor, right_on=c.iso, suffix="_neighbor")
)
shape: (523, 4)
iso iso_neighbor gini gini_neighbor
str str f64 f64
"AFG" "IRN" 27.8 40.9
"AFG" "PAK" 27.8 29.6
"AFG" "CHN" 27.8 38.2
"AFG" "TJK" 27.8 34.0
"AFG" "UZB" 27.8 35.3
"ZMB" "ZWE" 57.1 50.3
"ZWE" "BWA" 50.3 53.3
"ZWE" "MOZ" 50.3 54.0
"ZWE" "ZAF" 50.3 63.1
"ZWE" "ZMB" 50.3 57.1

As noted above, we can add the option how= with the values inner (default), left, right, or outer to indicate what should happen when keys appear only in one table. Left joins retain all keys from the first (left) table, right joins retain all keys from the second (right) table, and outer joins retain all keys. When there is no matching value in the other table, null values are filled in. The default, inner, is often a good choice to avoid missing values.

Another use of a key-based join is to determine which elements match between one table and another without adding any extra columns. These are called filtering joins because they act like the .filter method in that they select a subset of rows. There are two variants: a semi-join retains only those rows that have a match in the second table, and an anti-join retains only those rows that do not. When there is only a single key, semi- and anti-joins can be replaced by a call to isin. The real power comes when there are multiple keys and we want to know whether a specific combination of keys matches (or does not match) those found in another table.

4.6 Conditional Joins

In this section, we introduce two additional kinds of joins that are helpful when working with certain types of complex data. Both of these allow us to join by relationships between keys that do not require values to be exactly equal. To start, let’s create two subsets of our countries data, separating out countries in Africa and in Asia, and sorting by the Gini coefficient. We remove any missing values from the output.

africa = (
    country
    .filter(c.region == "Africa")
    .select(c.full_name, c.gini)
    .sort(c.gini)
    .drop_nulls()
)
asia = (
    country
    .filter(c.region == "Asia")
    .select(c.full_name, c.gini)
    .sort(c.gini)
    .drop_nulls()
)

The function .join_asof combines two datasets by matching each row in the first (left) dataset to the nearest value in the second (right) dataset based on numeric values in a specified pair of columns. It always returns a dataset with the same number of rows as the first (left) dataset. The function is optimized to do this quickly even for large datasets; to achieve this, it requires that both datasets be sorted by the join key before being called. Below is an example where we associate each African country with the nearest Asian country in terms of their Gini coefficient. We set coalesce=False to ensure the matching country’s Gini coefficient is retained in the output.

(
    africa
    .join_asof(asia, on=c.gini, strategy="nearest", coalesce=False)
)
shape: (35, 4)
full_name gini full_name_right gini_right
str f64 str f64
"Guinea" 29.6 "Pakistan" 29.6
"Egypt" 31.5 "Korea, Republic of" 31.4
"Mauritania" 32.6 "Mongolia" 32.7
"Ethiopia" 35.0 "Thailand" 35.0
"Nigeria" 35.1 "Thailand" 35.0
"Botswana" 53.3 "Malaysia" 46.2
"Mozambique" 54.0 "Malaysia" 46.2
"Zambia" 57.1 "Malaysia" 46.2
"Namibia" 59.1 "Malaysia" 46.2
"South Africa" 63.1 "Malaysia" 46.2

The original motivation for the .join_asof method was to join two time-series datasets. The idea is that we might want the current (most recent) value from one table relative to another. Hence the function’s name and the fact that it uses the backward strategy by default to find the nearest value in the second dataset that is less than (i.e., earlier than) the value in the first. You can set the strategy to forward to look in the other direction, or to nearest to find the closest value. You can also add additional keys with the by, by_left, or by_right options; these keys must match exactly before performing the nearest search. The tables must be pre-sorted by the join key(s) for the function to work.

If we need to do more exotic joins between two tables, we can use the .join_where method. It takes one or more expressions in the same format used by the .filter method to specify exactly which pairs of rows to keep in the joined dataset. To refer to columns that share the same name in both datasets, use the _right suffix to indicate the second dataset. Below is an example that finds all pairs of countries in Africa and Asia where the difference in the Gini coefficient is less than 5.

(
    africa
    .join_where(
        asia,
        (c.gini - c.gini_right).abs() < 5
    )
)
shape: (414, 4)
full_name gini full_name_right gini_right
str f64 str f64
"Guinea" 29.6 "Armenia" 25.2
"Guinea" 29.6 "United Arab Emirates" 26.0
"Guinea" 29.6 "Azerbaijan" 26.6
"Guinea" 29.6 "Afghanistan" 27.8
"Guinea" 29.6 "Kazakhstan" 27.8
"Cameroon" 46.6 "Philippines" 42.3
"Cameroon" 46.6 "Malaysia" 46.2
"Burkina Faso" 47.3 "Malaysia" 46.2
"Congo" 48.9 "Malaysia" 46.2
"Zimbabwe" 50.3 "Malaysia" 46.2

This method can be slow because it needs to manually check every combination of rows in the two datasets. However, it is very powerful and useful when combining two reasonably sized datasets in a complex way. Notice that we could obtain the same result by starting with a cross join (which returns every combination of rows) and then applying a filter. The benefit of the .join_where is that it avoids having to store in memory all combinations of non-matching rows by effectively combining the cross join and filtering into a single step.

4.7 Units of Observation

When designing a way to store tabular data, we often must decide whether to favor many rows and few columns or many columns and few rows. A related decision, particularly when choosing a format with more columns, is what information to store in columns versus rows. In some cases the answers are clear and obvious. In others, there are trade-offs about which types of analyses will be easiest or hardest to perform with the data. Let’s work through an example.

Assume that you are a biologist tracking the heights of 100 plants over the next 14 days. There are several different ways that this data could be collected. We could have each day’s measurements in a column and each plant as a row. We could have each plant as a column and each day as a row. Or we could have a row for each combination of plant and day, with 100*14 = 1,400 rows and three columns (plant ID, day ID, and height).

A core data-science concept will help us distinguish the three cases above. A unit of observation is the entity about which data are collected and recorded in a dataset. In other words, it is the “who” or “what” that each row of the table represents. In the example with one row for each plant and one column for each day, the unit of observation is a plant. In the example with one row for each day and one column for each plant, the unit of observation is a day. In the final example, the one with 1,400 rows, we call the unit of observation plant × day (read “plant crossed with day”), which means each row corresponds to a unique combination of one plant and one day. The key takeaway is that once we have picked the unit of observation, there is only a single way to construct our data table. Choosing the right unit of observation for our data will be a central concern in the next chapter.

Note that the concept of a unit of observation is closely linked to that of a primary key. By definition, a primary key must uniquely identify the “what” of the unit of observation. For example, a primary key for a table whose unit of observation is a plant must uniquely identify each plant. Likewise, a table with a plant × day unit of observation needs a key that uniquely identifies both the plant and the day.

4.8 Pivot and Unpivot

Sometimes the unit of observation in which a data table is stored does not match the unit needed for our analysis. If the stored format uses a compound unit of observation, we can simplify it by using an aggregation to remove one or more components. For example, we could group a plant × day table by plant ID and calculate summary statistics (such as the maximum height over all days) to generate a dataset with plant as the unit of observation. While aggregations are useful, by design they remove information present in the original dataset. Other times, we want to restructure the unit of observation without losing or adding any information—we simply need the data in a different shape for a particular analysis or visualization. To do this, we can use one of two additional data manipulation functions.

Consider the cellphone dataset; we show its first few rows below. The unit of observation is year × country: each row gives the number of cellphones per 100 people for that year and country.

cellphone = pl.read_csv("data/countries_cellphone.csv")
cellphone
shape: (3_480, 3)
iso year cell
str i64 f64
"AFG" 2003 0.87978
"AFG" 2004 2.54662
"AFG" 2005 4.91711
"AFG" 2006 9.9133
"AFG" 2007 18.0167
"ZWE" 2018 85.8627
"ZWE" 2019 86.4094
"ZWE" 2020 84.9604
"ZWE" 2021 90.2538
"ZWE" 2022 88.9958

We could have stored this dataset in a format where the unit of observation was year and each country had its own column. What if we want to work with the data in that format? For example, if we wanted to compute the difference in cellphone ownership between the United States and Canada, we would need the data in this format. We can convert our dataset using the .pivot() method, which requires indicating which column will be used to index the output (this becomes the unit of observation), which column will be used to create the new columns, and which column contains the values that will fill the table.

cell_wide = (
    cellphone
    .pivot(index="year", on="iso", values="cell")
)
cell_wide
shape: (20, 175)
year AFG AGO ALB AND ARE ARG ARM AUS AUT AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BLZ BOL BRA BRB BRN BTN BWA CAF CAN CHE CHL CHN CIV CMR COD COG SEN SGP SLB SLV SMR SOM STP SUR SVK SVN SWE SWZ SYC SYR TCD TGO THA TJK TLS TON TTO TUN TUR TZA UGA UKR URY USA UZB VCT VEN VNM VUT YEM ZAF ZMB ZWE
i64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64
2003 0.87978 1.95054 35.278 74.6812 70.8971 20.4096 3.76374 72.3885 89.5526 12.5837 0.90868 82.9377 2.98462 1.81994 0.96829 44.8576 54.6188 36.4188 25.7218 11.4045 22.899 14.1298 25.6741 52.6759 51.1674 0.35363 25.0588 0.97488 41.7477 84.3941 45.3566 20.8536 6.69626 6.66446 2.25179 9.57701 7.29792 86.4376 0.22739 19.1676 60.6975 2.01834 3.133 33.2775 68.4271 87.2851 98.2426 7.9685 56.563 6.62407 0.68509 4.38771 33.3244 0.71512 2.18764 10.7214 25.1251 19.0389 41.1128 3.49964 2.94538 13.6577 15.1631 55.4096 1.24776 55.6041 27.093 3.44629 3.88551 3.14667 34.7626 2.19418 2.97287
2004 2.54662 3.9784 40.6495 78.5281 83.1597 34.8114 6.72643 82.2081 97.7795 17.156 1.37553 87.6255 5.62957 2.93277 1.94647 61.0333 74.5619 54.743 33.9881 23.0022 27.6052 19.5647 35.9135 74.8745 57.3669 2.94185 28.9318 1.43102 46.7437 84.9756 57.2138 25.7141 8.54469 9.21627 3.49263 10.76 10.2163 95.6368 0.6321 30.5192 60.5787 4.86334 4.91107 41.4107 79.5226 92.7311 97.6783 13.5115 61.6736 12.7939 1.23991 5.83805 41.1989 1.98806 2.76856 15.6085 48.4019 36.7464 50.5995 5.09582 4.29039 29.0897 18.2681 63.1242 2.09033 63.81 31.9693 6.17384 5.10619 6.68019 42.536 4.09548 3.4429
2005 4.91711 8.35159 49.7453 83.3882 97.1993 56.4973 10.5791 90.7636 105.296 26.0781 2.01653 91.6576 7.07639 4.54743 6.21909 81.1551 82.2539 66.2329 38.9203 42.4035 34.3187 25.8676 46.6789 76.7344 64.8898 5.43918 30.6424 2.33165 52.4553 91.959 64.6505 30.0304 11.7071 13.1922 4.67216 15.101 15.3986 102.723 1.24203 40.1585 60.0238 4.69738 7.39487 44.8256 84.4454 88.0975 100.822 18.5353 65.8773 15.6797 2.03329 7.41937 46.1397 3.82687 3.48746 28.2689 68.3489 55.3921 62.901 7.56468 4.70367 64.0054 35.1492 68.8835 2.73172 62.8012 46.6505 11.8306 6.02202 9.99364 68.6198 8.10286 5.18375
2006 9.9133 15.2614 62.6026 86.7048 110.185 79.5273 42.1272 95.9663 112.226 38.1512 2.54273 93.3538 12.1442 7.07267 13.0671 108.025 91.2343 72.6192 46.4606 62.0314 40.9896 30.2062 53.5317 87.8212 82.6517 12.2359 43.9084 2.5128 57.2331 99.4337 75.4174 34.9802 19.7861 17.8682 7.28434 23.9093 25.9356 108.658 1.42394 64.1687 60.0857 5.01116 11.1231 60.9527 91.0056 90.8372 105.797 23.0535 77.833 23.8071 4.35335 11.7791 60.2777 30.4638 5.02766 28.2825 111.783 70.9333 75.1837 13.9199 6.97652 105.277 70.8316 76.8586 9.47736 78.2349 69.0163 22.9925 6.94692 12.6482 79.3267 13.713 6.71982
2007 18.0167 23.8782 76.8284 77.559 137.458 100.962 63.0765 101.458 119.466 51.1876 3.30806 101.06 22.9147 12.5405 23.2247 130.455 105.539 105.849 61.0557 72.7795 39.9574 33.5993 64.1626 94.953 98.7233 22.0203 60.2637 7.61859 61.2755 108.793 83.7065 41.2808 35.5088 25.1303 10.551 32.349 30.8471 128.882 2.17932 102.122 59.6944 5.29914 17.7366 71.5609 112.852 95.7428 110.588 34.8399 84.4864 29.8537 8.28717 19.2349 78.945 29.6445 7.78345 43.5929 110.578 75.1086 87.5976 19.9153 14.1477 119.088 91.174 82.5923 21.0203 99.0675 86.1507 53.835 11.75 17.8838 83.715 21.0029 9.57239
2018 59.8109 42.4589 93.8031 109.915 214.842 131.224 123.0 110.033 124.23 102.987 53.2737 100.164 76.408 94.622 98.9289 119.395 139.735 98.9789 102.187 122.57 64.7802 99.4566 101.602 117.319 129.268 92.6023 147.065 26.2216 89.0495 126.681 133.27 116.229 123.098 74.3995 40.5014 91.189 104.056 152.074 67.9775 151.669 111.5 49.5263 77.3644 125.48 133.04 119.153 124.086 89.2004 155.551 87.4952 43.2279 74.3997 175.266 107.442 115.089 102.246 134.143 125.533 94.9086 75.7302 58.8754 127.754 131.363 103.935 71.7444 107.164 69.55 146.138 88.1403 44.8811 157.692 86.0723 85.8627
2019 59.6471 45.8065 91.1635 114.953 209.033 125.303 124.614 109.753 120.776 106.328 53.2996 100.165 81.3235 97.1499 100.4 116.617 129.143 107.812 112.255 123.23 66.0705 100.149 97.3747 117.112 129.845 95.3473 160.662 32.2627 90.9589 126.921 130.493 122.67 132.574 83.9044 39.94 91.0654 109.342 159.347 65.662 167.089 111.823 51.3194 77.6791 134.349 135.756 120.569 125.457 91.0519 164.617 87.1297 45.9379 73.7225 181.222 115.373 109.555 58.7722 145.716 124.387 94.6498 80.5836 59.0598 130.629 133.237 105.321 101.26 92.5178 63.8474 140.193 90.8331 43.399 162.739 93.0148 86.4094
2020 58.0461 43.7804 91.1881 103.877 194.468 121.181 120.682 104.93 120.132 101.597 52.5571 99.9101 85.2391 102.973 106.002 114.597 119.412 117.717 106.375 125.165 67.5517 99.8988 98.6456 104.722 119.934 96.7703 161.859 36.426 84.7748 127.376 129.414 120.497 138.664 85.2722 42.5028 97.6561 113.638 150.259 63.6671 159.575 112.169 53.1133 80.1173 146.875 133.633 121.331 123.549 104.747 152.545 79.1482 50.4343 75.164 162.328 118.819 103.911 58.6544 134.231 124.042 95.3961 84.0051 62.2824 129.341 133.902 103.855 99.4064 96.9438 58.1626 141.656 82.391 42.0038 158.447 100.235 84.9604
2021 56.6945 44.3869 92.4843 119.688 186.301 130.353 125.395 105.119 121.358 105.694 59.7008 101.463 94.9183 112.198 110.012 114.912 128.079 100.257 114.912 127.117 66.7769 100.809 104.825 112.223 133.597 100.331 173.252 35.817 87.4056 123.212 136.572 121.492 150.344 80.8081 47.2883 95.8615 115.325 157.978 62.1565 176.973 115.701 51.2059 85.2578 146.654 135.268 123.363 124.961 118.627 149.479 78.5558 58.0323 70.4497 168.485 116.225 102.669 60.9802 134.222 129.846 99.5415 86.0163 65.6221 135.026 138.113 106.325 102.41 95.5174 60.2345 136.806 81.6914 47.809 163.128 103.283 90.2538
2022 56.2655 66.6008 98.395 142.426 195.625 131.484 130.54 109.469 121.166 107.506 56.26 101.68 105.745 119.463 109.867 116.679 139.636 101.624 118.944 128.315 67.175 101.572 101.245 114.581 116.108 95.1298 178.205 38.8218 91.2387 123.971 135.114 124.195 161.232 89.8274 48.6774 93.6259 118.15 173.005 62.0735 183.249 120.27 49.6801 86.7785 149.052 136.029 126.459 140.782 121.492 163.43 78.6568 65.4378 72.2148 176.223 126.161 108.158 61.7258 133.632 131.875 103.721 93.016 70.1401 120.261 139.829 108.786 105.472 102.334 66.4668 137.852 81.6203 46.4534 160.729 98.4373 88.9958

It is possible that some combinations of the new rows and columns do not exist in the original DataFrame. When this happens, Python will fill in null values for the missing combinations. Sometimes, particularly if these are counts that we are looking at, we may want to convert all of these missing values to zeros (or another default). To do that, we can call the method .fill_null(0). The value zero can be changed to any other value that makes sense. This can also be used on individual columns as needed in other analyses.

Now, what if we want to go the other way? That is, suppose we stored the data originally in the format above (cell_wide) but want to return to the version we started with. This can be achieved by using the .unpivot() method, which requires specifying the index column to use. The new columns created to store the data are given the default names variable and value. These can be changed by setting the variable_name and value_name arguments.

(
  cell_wide
  .unpivot(index="year")
)
shape: (3_480, 3)
year variable value
i64 str f64
2003 "AFG" 0.87978
2004 "AFG" 2.54662
2005 "AFG" 4.91711
2006 "AFG" 9.9133
2007 "AFG" 18.0167
2018 "ZWE" 85.8627
2019 "ZWE" 86.4094
2020 "ZWE" 84.9604
2021 "ZWE" 90.2538
2022 "ZWE" 88.9958

The current version of Polars on Google Colab does not allow using the c. notation when specifying columns as we do with other Polars methods. Instead, you must specify the column as a string, similar to how we do it in plotnine graphics. This has already been changed in the development version of Polars for pivot and will likely change for unpivot in the near future. For now, remember that both pivot and unpivot require a different way of specifying columns compared to other Polars methods.

The .pivot() and .unpivot() methods are powerful tools for data manipulation and constitute the final major methods we need for manipulating tabular datasets. Although they may at first glance seem straightforward (and should be, at least in theory), care is needed when using them to reshape data, because they can easily introduce subtle errors. The best practice is to decide on the unit of observation up front and structure your data accordingly. Use these reshaping functions when necessary, but avoid unnecessary back-and-forth transformations that can introduce mistakes.

4.9 Expand List Columns

A final type of transformation arises when multiple elements occupy a single cell in a table. This usually occurs only after a data-cleaning step we’ve performed, rather than as raw input from a CSV file. For example, the function str.split, which we’ll learn more about in Chapter 7, can split the text in a DataFrame column into parts. We can use this to split the multiple language codes found in the lang column of our data.

(
    country
    .select(c.full_name, c.lang)
    .with_columns(
        lang_codes = c.lang.str.split("|")
    )
)
shape: (135, 3)
full_name lang lang_codes
str str list[str]
"Senegal" "pbp|fra|wol" ["pbp", "fra", "wol"]
"Venezuela, Bolivarian Republic… "spa|vsl" ["spa", "vsl"]
"Finland" "fin|swe" ["fin", "swe"]
"United States of America" "eng" ["eng"]
"Sri Lanka" "sin|sin|tam|tam" ["sin", "sin", … "tam"]
"Albania" "sqi" ["sqi"]
"Malaysia" "msa" ["msa"]
"El Salvador" "spa" ["spa"]
"Cyprus" "ell|tur" ["ell", "tur"]
"Pakistan" "eng|urd" ["eng", "urd"]

It is hard to work directly with the embedding list structure. Often the easiest way to simplify a data frame that contains list elements is to unravel the lists, duplicating the other columns. Use the .explode() method, passing the column(s) you want to unravel.

(
    country
    .select(c.full_name, c.lang)
    .with_columns(
        lang_codes = c.lang.str.split("|")
    )
    .explode(c.lang_codes)
)
shape: (265, 3)
full_name lang lang_codes
str str str
"Senegal" "pbp|fra|wol" "pbp"
"Senegal" "pbp|fra|wol" "fra"
"Senegal" "pbp|fra|wol" "wol"
"Venezuela, Bolivarian Republic… "spa|vsl" "spa"
"Venezuela, Bolivarian Republic… "spa|vsl" "vsl"
"El Salvador" "spa" "spa"
"Cyprus" "ell|tur" "ell"
"Cyprus" "ell|tur" "tur"
"Pakistan" "eng|urd" "eng"
"Pakistan" "eng|urd" "urd"

The term “explode” warns us that the resulting dataset can become very large, particularly if we unravel multiple columns simultaneously.

4.10 Coming from R or Pandas

If you are familiar with joins and pivots in R’s dplyr or in Pandas, the code here should be relatively familiar, except for the specific defaults and argument names. What makes these operations difficult are the concepts, not the code, and those do not change across implementations.

References