import numpy as np
import polars as pl
from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())
country = pl.read_csv("data/countries.csv")4 EDA III: Restructuring Data
4.1 Setup
Load all of the modules and datasets needed for the chapter.
4.2 Introduction
In Chapter 2, we learned to use data operations to modify a dataset in Python. These modifications included taking a subset of a DataFrame, such as filtering rows or selecting a reduced set of columns. We saw how to add new columns and how to rearrange rows by sorting on one or more columns. We also explored grouping and summarizing our data to create a new set of summary statistics that aggregate the original dataset at different levels of granularity. Along the way, we saw how these modifications can help create informative data visualizations, particularly when there is too much data to label every row as a point on a single plot.
In this chapter, we’ll continue exploring how to modify data using increasingly advanced techniques. First, we’ll examine how to combine information from two different data tables. Then we’ll move on to pivots, a relatively complex but powerful method for reshaping data.
4.3 Primary and Foreign Keys
Primary keys are a core concept we need to understand to combine two datasets in Python. A primary key consists of one or more columns that can uniquely and minimally identify a row of data. For example, in our country dataset the column iso is a primary key. The column full_name is also a primary key. In database theory a distinction is made between candidate keys (any set of one or more columns that could serve as a primary key) and the specific primary key chosen for a table. This distinction is less applicable to data science work, where we will use the term primary key to refer to any possible way of identifying a row in our dataset.
A foreign key is a primary key from one table that appears in another table. For example, we might have a dataset of customers, with one column indicating the iso of the country in which they live. We join multiple tables by associating the primary key in one table with the matching foreign key in another. Notice that the difference between primary and foreign is the table in which the key appears, not the information itself. The value iso is both a primary key in the countries dataset and a foreign key in the customers dataset. The connection between these two tables is called a relation. This is where the term relational database comes from.
4.4 Joining by Relation
We will start with an abstract example before turning to a specific example using actual DataFrames in Python. Assume we have two tables, table_one and table_two. The column key_col appears in both tables: it is a primary key in table_two and a foreign key in table_one. This means it has a single unique row associated with it in table_two but possibly many instances in table_one.
To join the information in these two tables, we use the .join() method of the table containing the foreign key. The arguments include the second dataset and the name of the key used to combine the tables.
(
table_one
.join(table_two, on=c.key_col)
)Assuming our data are correctly organized, the output will be a dataset with the same number of rows as table_one, where the first set of columns will also match table_one. After the original columns, there will be a version of all the columns from table_two that correspond to the matching values of our key.
Often some of the values in table_one will not have a matching value in table_two. In that case, the default behavior is to remove rows without matches; this is called an inner join. Alternatively, you can fill missing columns with null values, which results in a left join. We can indicate this to Python as follows:
(
table_one
.join(table_two, on=c.key_col, how="left")
)The terminology of “left” and “right” tables comes from thinking about how the join would look when written on a single line. Consider the code table_one.join(table_two); the first table is to the left of the join function and the second table is to the right of the join function. As a rule of thumb, we try to place the table with the foreign key on the left and the table with the primary key on the right. The reason for this is that the output should match the number of observations in the table with the foreign key, so it typically makes it easier to reason about treating it as a method of the foreign-key table. Note that usually the table with the foreign key will be much larger than the table with the primary key.
There are six different options that we can assign to the how parameter in the table join method. The default, inner, returns only rows that match in both tables. Left joins return all rows from the left DataFrame; right joins return all rows from the right DataFrame. Full joins (rarely used) return all rows from both tables. A semi join returns only rows from the left table that have a match in the right table; no additional columns are added. An anti join returns only rows in the left table that do not match any rows in the right table. These six options are called equi-joins because they work by finding where the key(s) in one table are equal to the key(s) in another table.
A cross join returns all combinations of rows from the two tables, with no key required. It has several interesting applications, which we will see in later chapters.
There are several other modifications that we can apply to the basic syntax. For example, if our key has multiple columns, we can pass a list to on instead of a single value. Here is the code to join on two matching columns from the two DataFrames.
(
table_one
.join(table_two, on=[c.key_col1, c.key_col2])
)Or, if the key has different names in the two tables, we can describe them with left_on (the first table) and right_on (the second table). These can be individual values or lists, where each element in one list corresponds to the element in the other.
(
table_one
.join(table_two, left_on=c.key_col1, right_on=c.key_col1)
)Note that Python will let you join on any set of columns that match between the two tables. The results can be surprising, and possibly very large, if we are not sure that the join column(s) are a primary key in one of the tables. Avoid joins that do not follow the pattern here (foreign key in the left table and a primary key in the right table).
4.5 Join Examples
The code outline in the previous section should make more sense once we see examples using actual datasets. To do this, we need multiple datasets that we can join together. We will create a smaller version of the country dataset that contains only two columns, which makes it easier to view the results without scrolling. We also load a dataset of country borders, where each row shows the relationship between a country and one of its neighbors.
c_sml = country.select(c.iso, c.gini)
border = pl.read_csv("data/countries_borders.csv")
border| iso | iso_neighbor |
|---|---|
| str | str |
| "AFG" | "IRN" |
| "AFG" | "PAK" |
| "AFG" | "CHN" |
| "AFG" | "TJK" |
| "AFG" | "TKM" |
| … | … |
| "ZMB" | "ZWE" |
| "ZWE" | "BWA" |
| "ZWE" | "MOZ" |
| "ZWE" | "ZAF" |
| "ZWE" | "ZMB" |
We can use a relational join to add the Gini coefficient for the country referenced in the border data. Note that iso is present in both tables; it’s a primary key in c_sml and a foreign key in border. We can join them using the code below.
(
border
.join(c_sml, on=c.iso)
)| iso | iso_neighbor | gini |
|---|---|---|
| str | str | f64 |
| "AFG" | "IRN" | 27.8 |
| "AFG" | "PAK" | 27.8 |
| "AFG" | "CHN" | 27.8 |
| "AFG" | "TJK" | 27.8 |
| "AFG" | "TKM" | 27.8 |
| … | … | … |
| "ZMB" | "ZWE" | 57.1 |
| "ZWE" | "BWA" | 50.3 |
| "ZWE" | "MOZ" | 50.3 |
| "ZWE" | "ZAF" | 50.3 |
| "ZWE" | "ZMB" | 50.3 |
Notice that the Gini coefficient of the base country has been added to the data. The first five rows all correspond to borders with Afghanistan and therefore the same value of the Gini coefficient (27.8) is duplicated for each of the rows. Similarly, the last four rows correspond to Zimbabwe and repeat the Gini coefficient value of 50.3 in each. As with other data methods we use, the .join() method returns a new dataset without modifying the original data. We would need to save the output if we wanted to work further with the results.
The DataFrame after the join has only 641 observations compared to the 829 observations in the original. The reason is that our country data only contains those countries for which we have full economic data. This means that some of the countries in border do not have an entry in c_sml. The default relational join is an inner join, so these missing corresponding rows are removed from the result. Changing to a left join restores the DataFrame to the same number of observations as the original border data.
(
border
.join(c_sml, on=c.iso, how="left")
)| iso | iso_neighbor | gini |
|---|---|---|
| str | str | f64 |
| "AFG" | "IRN" | 27.8 |
| "AFG" | "PAK" | 27.8 |
| "AFG" | "CHN" | 27.8 |
| "AFG" | "TJK" | 27.8 |
| "AFG" | "TKM" | 27.8 |
| … | … | … |
| "ZMB" | "ZWE" | 57.1 |
| "ZWE" | "BWA" | 50.3 |
| "ZWE" | "MOZ" | 50.3 |
| "ZWE" | "ZAF" | 50.3 |
| "ZWE" | "ZMB" | 50.3 |
So far, we have added the Gini coefficient of the base country. We could also add the coefficient of the bordering country by associating the column iso_neighbor with the column iso in our countries DataFrame. Here is the code to add the neighboring Gini coefficient.
(
border
.join(c_sml, left_on=c.iso_neighbor, right_on=c.iso)
)| iso | iso_neighbor | gini |
|---|---|---|
| str | str | f64 |
| "AFG" | "IRN" | 40.9 |
| "AFG" | "PAK" | 29.6 |
| "AFG" | "CHN" | 38.2 |
| "AFG" | "TJK" | 34.0 |
| "AFG" | "UZB" | 35.3 |
| … | … | … |
| "ZMB" | "ZWE" | 50.3 |
| "ZWE" | "BWA" | 53.3 |
| "ZWE" | "MOZ" | 54.0 |
| "ZWE" | "ZAF" | 63.1 |
| "ZWE" | "ZMB" | 57.1 |
What if we wanted to add the Gini coefficient to both countries in each row? This requires doing two separate joins, one after the other. In the second join, we add the suffix _neighbor to the Gini coefficient to distinguish the two scores in the output.
(
border
.join(c_sml, on=c.iso)
.join(c_sml, left_on=c.iso_neighbor, right_on=c.iso, suffix="_neighbor")
)| iso | iso_neighbor | gini | gini_neighbor |
|---|---|---|---|
| str | str | f64 | f64 |
| "AFG" | "IRN" | 27.8 | 40.9 |
| "AFG" | "PAK" | 27.8 | 29.6 |
| "AFG" | "CHN" | 27.8 | 38.2 |
| "AFG" | "TJK" | 27.8 | 34.0 |
| "AFG" | "UZB" | 27.8 | 35.3 |
| … | … | … | … |
| "ZMB" | "ZWE" | 57.1 | 50.3 |
| "ZWE" | "BWA" | 50.3 | 53.3 |
| "ZWE" | "MOZ" | 50.3 | 54.0 |
| "ZWE" | "ZAF" | 50.3 | 63.1 |
| "ZWE" | "ZMB" | 50.3 | 57.1 |
As noted above, we can add the option how= with the values inner (default), left, right, or outer to indicate what should happen when keys appear only in one table. Left joins retain all keys from the first (left) table, right joins retain all keys from the second (right) table, and outer joins retain all keys. When there is no matching value in the other table, null values are filled in. The default, inner, is often a good choice to avoid missing values.
Another use of a key-based join is to determine which elements match between one table and another without adding any extra columns. These are called filtering joins because they act like the .filter method in that they select a subset of rows. There are two variants: a semi-join retains only those rows that have a match in the second table, and an anti-join retains only those rows that do not. When there is only a single key, semi- and anti-joins can be replaced by a call to isin. The real power comes when there are multiple keys and we want to know whether a specific combination of keys matches (or does not match) those found in another table.
4.6 Conditional Joins
In this section, we introduce two additional kinds of joins that are helpful when working with certain types of complex data. Both of these allow us to join by relationships between keys that do not require values to be exactly equal. To start, let’s create two subsets of our countries data, separating out countries in Africa and in Asia, and sorting by the Gini coefficient. We remove any missing values from the output.
africa = (
country
.filter(c.region == "Africa")
.select(c.full_name, c.gini)
.sort(c.gini)
.drop_nulls()
)
asia = (
country
.filter(c.region == "Asia")
.select(c.full_name, c.gini)
.sort(c.gini)
.drop_nulls()
)The function .join_asof combines two datasets by matching each row in the first (left) dataset to the nearest value in the second (right) dataset based on numeric values in a specified pair of columns. It always returns a dataset with the same number of rows as the first (left) dataset. The function is optimized to do this quickly even for large datasets; to achieve this, it requires that both datasets be sorted by the join key before being called. Below is an example where we associate each African country with the nearest Asian country in terms of their Gini coefficient. We set coalesce=False to ensure the matching country’s Gini coefficient is retained in the output.
(
africa
.join_asof(asia, on=c.gini, strategy="nearest", coalesce=False)
)| full_name | gini | full_name_right | gini_right |
|---|---|---|---|
| str | f64 | str | f64 |
| "Guinea" | 29.6 | "Pakistan" | 29.6 |
| "Egypt" | 31.5 | "Korea, Republic of" | 31.4 |
| "Mauritania" | 32.6 | "Mongolia" | 32.7 |
| "Ethiopia" | 35.0 | "Thailand" | 35.0 |
| "Nigeria" | 35.1 | "Thailand" | 35.0 |
| … | … | … | … |
| "Botswana" | 53.3 | "Malaysia" | 46.2 |
| "Mozambique" | 54.0 | "Malaysia" | 46.2 |
| "Zambia" | 57.1 | "Malaysia" | 46.2 |
| "Namibia" | 59.1 | "Malaysia" | 46.2 |
| "South Africa" | 63.1 | "Malaysia" | 46.2 |
The original motivation for the .join_asof method was to join two time-series datasets. The idea is that we might want the current (most recent) value from one table relative to another. Hence the function’s name and the fact that it uses the backward strategy by default to find the nearest value in the second dataset that is less than (i.e., earlier than) the value in the first. You can set the strategy to forward to look in the other direction, or to nearest to find the closest value. You can also add additional keys with the by, by_left, or by_right options; these keys must match exactly before performing the nearest search. The tables must be pre-sorted by the join key(s) for the function to work.
If we need to do more exotic joins between two tables, we can use the .join_where method. It takes one or more expressions in the same format used by the .filter method to specify exactly which pairs of rows to keep in the joined dataset. To refer to columns that share the same name in both datasets, use the _right suffix to indicate the second dataset. Below is an example that finds all pairs of countries in Africa and Asia where the difference in the Gini coefficient is less than 5.
(
africa
.join_where(
asia,
(c.gini - c.gini_right).abs() < 5
)
)| full_name | gini | full_name_right | gini_right |
|---|---|---|---|
| str | f64 | str | f64 |
| "Guinea" | 29.6 | "Armenia" | 25.2 |
| "Guinea" | 29.6 | "United Arab Emirates" | 26.0 |
| "Guinea" | 29.6 | "Azerbaijan" | 26.6 |
| "Guinea" | 29.6 | "Afghanistan" | 27.8 |
| "Guinea" | 29.6 | "Kazakhstan" | 27.8 |
| … | … | … | … |
| "Cameroon" | 46.6 | "Philippines" | 42.3 |
| "Cameroon" | 46.6 | "Malaysia" | 46.2 |
| "Burkina Faso" | 47.3 | "Malaysia" | 46.2 |
| "Congo" | 48.9 | "Malaysia" | 46.2 |
| "Zimbabwe" | 50.3 | "Malaysia" | 46.2 |
This method can be slow because it needs to manually check every combination of rows in the two datasets. However, it is very powerful and useful when combining two reasonably sized datasets in a complex way. Notice that we could obtain the same result by starting with a cross join (which returns every combination of rows) and then applying a filter. The benefit of the .join_where is that it avoids having to store in memory all combinations of non-matching rows by effectively combining the cross join and filtering into a single step.
4.7 Units of Observation
When designing a way to store tabular data, we often must decide whether to favor many rows and few columns or many columns and few rows. A related decision, particularly when choosing a format with more columns, is what information to store in columns versus rows. In some cases the answers are clear and obvious. In others, there are trade-offs about which types of analyses will be easiest or hardest to perform with the data. Let’s work through an example.
Assume that you are a biologist tracking the heights of 100 plants over the next 14 days. There are several different ways that this data could be collected. We could have each day’s measurements in a column and each plant as a row. We could have each plant as a column and each day as a row. Or we could have a row for each combination of plant and day, with 100*14 = 1,400 rows and three columns (plant ID, day ID, and height).
A core data-science concept will help us distinguish the three cases above. A unit of observation is the entity about which data are collected and recorded in a dataset. In other words, it is the “who” or “what” that each row of the table represents. In the example with one row for each plant and one column for each day, the unit of observation is a plant. In the example with one row for each day and one column for each plant, the unit of observation is a day. In the final example, the one with 1,400 rows, we call the unit of observation plant × day (read “plant crossed with day”), which means each row corresponds to a unique combination of one plant and one day. The key takeaway is that once we have picked the unit of observation, there is only a single way to construct our data table. Choosing the right unit of observation for our data will be a central concern in the next chapter.
Note that the concept of a unit of observation is closely linked to that of a primary key. By definition, a primary key must uniquely identify the “what” of the unit of observation. For example, a primary key for a table whose unit of observation is a plant must uniquely identify each plant. Likewise, a table with a plant × day unit of observation needs a key that uniquely identifies both the plant and the day.
4.8 Pivot and Unpivot
Sometimes the unit of observation in which a data table is stored does not match the unit needed for our analysis. If the stored format uses a compound unit of observation, we can simplify it by using an aggregation to remove one or more components. For example, we could group a plant × day table by plant ID and calculate summary statistics (such as the maximum height over all days) to generate a dataset with plant as the unit of observation. While aggregations are useful, by design they remove information present in the original dataset. Other times, we want to restructure the unit of observation without losing or adding any information—we simply need the data in a different shape for a particular analysis or visualization. To do this, we can use one of two additional data manipulation functions.
Consider the cellphone dataset; we show its first few rows below. The unit of observation is year × country: each row gives the number of cellphones per 100 people for that year and country.
cellphone = pl.read_csv("data/countries_cellphone.csv")
cellphone| iso | year | cell |
|---|---|---|
| str | i64 | f64 |
| "AFG" | 2003 | 0.87978 |
| "AFG" | 2004 | 2.54662 |
| "AFG" | 2005 | 4.91711 |
| "AFG" | 2006 | 9.9133 |
| "AFG" | 2007 | 18.0167 |
| … | … | … |
| "ZWE" | 2018 | 85.8627 |
| "ZWE" | 2019 | 86.4094 |
| "ZWE" | 2020 | 84.9604 |
| "ZWE" | 2021 | 90.2538 |
| "ZWE" | 2022 | 88.9958 |
We could have stored this dataset in a format where the unit of observation was year and each country had its own column. What if we want to work with the data in that format? For example, if we wanted to compute the difference in cellphone ownership between the United States and Canada, we would need the data in this format. We can convert our dataset using the .pivot() method, which requires indicating which column will be used to index the output (this becomes the unit of observation), which column will be used to create the new columns, and which column contains the values that will fill the table.
cell_wide = (
cellphone
.pivot(index="year", on="iso", values="cell")
)
cell_wide| year | AFG | AGO | ALB | AND | ARE | ARG | ARM | AUS | AUT | AZE | BDI | BEL | BEN | BFA | BGD | BGR | BHR | BHS | BIH | BLR | BLZ | BOL | BRA | BRB | BRN | BTN | BWA | CAF | CAN | CHE | CHL | CHN | CIV | CMR | COD | COG | … | SEN | SGP | SLB | SLV | SMR | SOM | STP | SUR | SVK | SVN | SWE | SWZ | SYC | SYR | TCD | TGO | THA | TJK | TLS | TON | TTO | TUN | TUR | TZA | UGA | UKR | URY | USA | UZB | VCT | VEN | VNM | VUT | YEM | ZAF | ZMB | ZWE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | … | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| 2003 | 0.87978 | 1.95054 | 35.278 | 74.6812 | 70.8971 | 20.4096 | 3.76374 | 72.3885 | 89.5526 | 12.5837 | 0.90868 | 82.9377 | 2.98462 | 1.81994 | 0.96829 | 44.8576 | 54.6188 | 36.4188 | 25.7218 | 11.4045 | 22.899 | 14.1298 | 25.6741 | 52.6759 | 51.1674 | 0.35363 | 25.0588 | 0.97488 | 41.7477 | 84.3941 | 45.3566 | 20.8536 | 6.69626 | 6.66446 | 2.25179 | 9.57701 | … | 7.29792 | 86.4376 | 0.22739 | 19.1676 | 60.6975 | 2.01834 | 3.133 | 33.2775 | 68.4271 | 87.2851 | 98.2426 | 7.9685 | 56.563 | 6.62407 | 0.68509 | 4.38771 | 33.3244 | 0.71512 | 2.18764 | 10.7214 | 25.1251 | 19.0389 | 41.1128 | 3.49964 | 2.94538 | 13.6577 | 15.1631 | 55.4096 | 1.24776 | 55.6041 | 27.093 | 3.44629 | 3.88551 | 3.14667 | 34.7626 | 2.19418 | 2.97287 |
| 2004 | 2.54662 | 3.9784 | 40.6495 | 78.5281 | 83.1597 | 34.8114 | 6.72643 | 82.2081 | 97.7795 | 17.156 | 1.37553 | 87.6255 | 5.62957 | 2.93277 | 1.94647 | 61.0333 | 74.5619 | 54.743 | 33.9881 | 23.0022 | 27.6052 | 19.5647 | 35.9135 | 74.8745 | 57.3669 | 2.94185 | 28.9318 | 1.43102 | 46.7437 | 84.9756 | 57.2138 | 25.7141 | 8.54469 | 9.21627 | 3.49263 | 10.76 | … | 10.2163 | 95.6368 | 0.6321 | 30.5192 | 60.5787 | 4.86334 | 4.91107 | 41.4107 | 79.5226 | 92.7311 | 97.6783 | 13.5115 | 61.6736 | 12.7939 | 1.23991 | 5.83805 | 41.1989 | 1.98806 | 2.76856 | 15.6085 | 48.4019 | 36.7464 | 50.5995 | 5.09582 | 4.29039 | 29.0897 | 18.2681 | 63.1242 | 2.09033 | 63.81 | 31.9693 | 6.17384 | 5.10619 | 6.68019 | 42.536 | 4.09548 | 3.4429 |
| 2005 | 4.91711 | 8.35159 | 49.7453 | 83.3882 | 97.1993 | 56.4973 | 10.5791 | 90.7636 | 105.296 | 26.0781 | 2.01653 | 91.6576 | 7.07639 | 4.54743 | 6.21909 | 81.1551 | 82.2539 | 66.2329 | 38.9203 | 42.4035 | 34.3187 | 25.8676 | 46.6789 | 76.7344 | 64.8898 | 5.43918 | 30.6424 | 2.33165 | 52.4553 | 91.959 | 64.6505 | 30.0304 | 11.7071 | 13.1922 | 4.67216 | 15.101 | … | 15.3986 | 102.723 | 1.24203 | 40.1585 | 60.0238 | 4.69738 | 7.39487 | 44.8256 | 84.4454 | 88.0975 | 100.822 | 18.5353 | 65.8773 | 15.6797 | 2.03329 | 7.41937 | 46.1397 | 3.82687 | 3.48746 | 28.2689 | 68.3489 | 55.3921 | 62.901 | 7.56468 | 4.70367 | 64.0054 | 35.1492 | 68.8835 | 2.73172 | 62.8012 | 46.6505 | 11.8306 | 6.02202 | 9.99364 | 68.6198 | 8.10286 | 5.18375 |
| 2006 | 9.9133 | 15.2614 | 62.6026 | 86.7048 | 110.185 | 79.5273 | 42.1272 | 95.9663 | 112.226 | 38.1512 | 2.54273 | 93.3538 | 12.1442 | 7.07267 | 13.0671 | 108.025 | 91.2343 | 72.6192 | 46.4606 | 62.0314 | 40.9896 | 30.2062 | 53.5317 | 87.8212 | 82.6517 | 12.2359 | 43.9084 | 2.5128 | 57.2331 | 99.4337 | 75.4174 | 34.9802 | 19.7861 | 17.8682 | 7.28434 | 23.9093 | … | 25.9356 | 108.658 | 1.42394 | 64.1687 | 60.0857 | 5.01116 | 11.1231 | 60.9527 | 91.0056 | 90.8372 | 105.797 | 23.0535 | 77.833 | 23.8071 | 4.35335 | 11.7791 | 60.2777 | 30.4638 | 5.02766 | 28.2825 | 111.783 | 70.9333 | 75.1837 | 13.9199 | 6.97652 | 105.277 | 70.8316 | 76.8586 | 9.47736 | 78.2349 | 69.0163 | 22.9925 | 6.94692 | 12.6482 | 79.3267 | 13.713 | 6.71982 |
| 2007 | 18.0167 | 23.8782 | 76.8284 | 77.559 | 137.458 | 100.962 | 63.0765 | 101.458 | 119.466 | 51.1876 | 3.30806 | 101.06 | 22.9147 | 12.5405 | 23.2247 | 130.455 | 105.539 | 105.849 | 61.0557 | 72.7795 | 39.9574 | 33.5993 | 64.1626 | 94.953 | 98.7233 | 22.0203 | 60.2637 | 7.61859 | 61.2755 | 108.793 | 83.7065 | 41.2808 | 35.5088 | 25.1303 | 10.551 | 32.349 | … | 30.8471 | 128.882 | 2.17932 | 102.122 | 59.6944 | 5.29914 | 17.7366 | 71.5609 | 112.852 | 95.7428 | 110.588 | 34.8399 | 84.4864 | 29.8537 | 8.28717 | 19.2349 | 78.945 | 29.6445 | 7.78345 | 43.5929 | 110.578 | 75.1086 | 87.5976 | 19.9153 | 14.1477 | 119.088 | 91.174 | 82.5923 | 21.0203 | 99.0675 | 86.1507 | 53.835 | 11.75 | 17.8838 | 83.715 | 21.0029 | 9.57239 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 2018 | 59.8109 | 42.4589 | 93.8031 | 109.915 | 214.842 | 131.224 | 123.0 | 110.033 | 124.23 | 102.987 | 53.2737 | 100.164 | 76.408 | 94.622 | 98.9289 | 119.395 | 139.735 | 98.9789 | 102.187 | 122.57 | 64.7802 | 99.4566 | 101.602 | 117.319 | 129.268 | 92.6023 | 147.065 | 26.2216 | 89.0495 | 126.681 | 133.27 | 116.229 | 123.098 | 74.3995 | 40.5014 | 91.189 | … | 104.056 | 152.074 | 67.9775 | 151.669 | 111.5 | 49.5263 | 77.3644 | 125.48 | 133.04 | 119.153 | 124.086 | 89.2004 | 155.551 | 87.4952 | 43.2279 | 74.3997 | 175.266 | 107.442 | 115.089 | 102.246 | 134.143 | 125.533 | 94.9086 | 75.7302 | 58.8754 | 127.754 | 131.363 | 103.935 | 71.7444 | 107.164 | 69.55 | 146.138 | 88.1403 | 44.8811 | 157.692 | 86.0723 | 85.8627 |
| 2019 | 59.6471 | 45.8065 | 91.1635 | 114.953 | 209.033 | 125.303 | 124.614 | 109.753 | 120.776 | 106.328 | 53.2996 | 100.165 | 81.3235 | 97.1499 | 100.4 | 116.617 | 129.143 | 107.812 | 112.255 | 123.23 | 66.0705 | 100.149 | 97.3747 | 117.112 | 129.845 | 95.3473 | 160.662 | 32.2627 | 90.9589 | 126.921 | 130.493 | 122.67 | 132.574 | 83.9044 | 39.94 | 91.0654 | … | 109.342 | 159.347 | 65.662 | 167.089 | 111.823 | 51.3194 | 77.6791 | 134.349 | 135.756 | 120.569 | 125.457 | 91.0519 | 164.617 | 87.1297 | 45.9379 | 73.7225 | 181.222 | 115.373 | 109.555 | 58.7722 | 145.716 | 124.387 | 94.6498 | 80.5836 | 59.0598 | 130.629 | 133.237 | 105.321 | 101.26 | 92.5178 | 63.8474 | 140.193 | 90.8331 | 43.399 | 162.739 | 93.0148 | 86.4094 |
| 2020 | 58.0461 | 43.7804 | 91.1881 | 103.877 | 194.468 | 121.181 | 120.682 | 104.93 | 120.132 | 101.597 | 52.5571 | 99.9101 | 85.2391 | 102.973 | 106.002 | 114.597 | 119.412 | 117.717 | 106.375 | 125.165 | 67.5517 | 99.8988 | 98.6456 | 104.722 | 119.934 | 96.7703 | 161.859 | 36.426 | 84.7748 | 127.376 | 129.414 | 120.497 | 138.664 | 85.2722 | 42.5028 | 97.6561 | … | 113.638 | 150.259 | 63.6671 | 159.575 | 112.169 | 53.1133 | 80.1173 | 146.875 | 133.633 | 121.331 | 123.549 | 104.747 | 152.545 | 79.1482 | 50.4343 | 75.164 | 162.328 | 118.819 | 103.911 | 58.6544 | 134.231 | 124.042 | 95.3961 | 84.0051 | 62.2824 | 129.341 | 133.902 | 103.855 | 99.4064 | 96.9438 | 58.1626 | 141.656 | 82.391 | 42.0038 | 158.447 | 100.235 | 84.9604 |
| 2021 | 56.6945 | 44.3869 | 92.4843 | 119.688 | 186.301 | 130.353 | 125.395 | 105.119 | 121.358 | 105.694 | 59.7008 | 101.463 | 94.9183 | 112.198 | 110.012 | 114.912 | 128.079 | 100.257 | 114.912 | 127.117 | 66.7769 | 100.809 | 104.825 | 112.223 | 133.597 | 100.331 | 173.252 | 35.817 | 87.4056 | 123.212 | 136.572 | 121.492 | 150.344 | 80.8081 | 47.2883 | 95.8615 | … | 115.325 | 157.978 | 62.1565 | 176.973 | 115.701 | 51.2059 | 85.2578 | 146.654 | 135.268 | 123.363 | 124.961 | 118.627 | 149.479 | 78.5558 | 58.0323 | 70.4497 | 168.485 | 116.225 | 102.669 | 60.9802 | 134.222 | 129.846 | 99.5415 | 86.0163 | 65.6221 | 135.026 | 138.113 | 106.325 | 102.41 | 95.5174 | 60.2345 | 136.806 | 81.6914 | 47.809 | 163.128 | 103.283 | 90.2538 |
| 2022 | 56.2655 | 66.6008 | 98.395 | 142.426 | 195.625 | 131.484 | 130.54 | 109.469 | 121.166 | 107.506 | 56.26 | 101.68 | 105.745 | 119.463 | 109.867 | 116.679 | 139.636 | 101.624 | 118.944 | 128.315 | 67.175 | 101.572 | 101.245 | 114.581 | 116.108 | 95.1298 | 178.205 | 38.8218 | 91.2387 | 123.971 | 135.114 | 124.195 | 161.232 | 89.8274 | 48.6774 | 93.6259 | … | 118.15 | 173.005 | 62.0735 | 183.249 | 120.27 | 49.6801 | 86.7785 | 149.052 | 136.029 | 126.459 | 140.782 | 121.492 | 163.43 | 78.6568 | 65.4378 | 72.2148 | 176.223 | 126.161 | 108.158 | 61.7258 | 133.632 | 131.875 | 103.721 | 93.016 | 70.1401 | 120.261 | 139.829 | 108.786 | 105.472 | 102.334 | 66.4668 | 137.852 | 81.6203 | 46.4534 | 160.729 | 98.4373 | 88.9958 |
It is possible that some combinations of the new rows and columns do not exist in the original DataFrame. When this happens, Python will fill in null values for the missing combinations. Sometimes, particularly if these are counts that we are looking at, we may want to convert all of these missing values to zeros (or another default). To do that, we can call the method .fill_null(0). The value zero can be changed to any other value that makes sense. This can also be used on individual columns as needed in other analyses.
Now, what if we want to go the other way? That is, suppose we stored the data originally in the format above (cell_wide) but want to return to the version we started with. This can be achieved by using the .unpivot() method, which requires specifying the index column to use. The new columns created to store the data are given the default names variable and value. These can be changed by setting the variable_name and value_name arguments.
(
cell_wide
.unpivot(index="year")
)| year | variable | value |
|---|---|---|
| i64 | str | f64 |
| 2003 | "AFG" | 0.87978 |
| 2004 | "AFG" | 2.54662 |
| 2005 | "AFG" | 4.91711 |
| 2006 | "AFG" | 9.9133 |
| 2007 | "AFG" | 18.0167 |
| … | … | … |
| 2018 | "ZWE" | 85.8627 |
| 2019 | "ZWE" | 86.4094 |
| 2020 | "ZWE" | 84.9604 |
| 2021 | "ZWE" | 90.2538 |
| 2022 | "ZWE" | 88.9958 |
The current version of Polars on Google Colab does not allow using the c. notation when specifying columns as we do with other Polars methods. Instead, you must specify the column as a string, similar to how we do it in plotnine graphics. This has already been changed in the development version of Polars for pivot and will likely change for unpivot in the near future. For now, remember that both pivot and unpivot require a different way of specifying columns compared to other Polars methods.
The .pivot() and .unpivot() methods are powerful tools for data manipulation and constitute the final major methods we need for manipulating tabular datasets. Although they may at first glance seem straightforward (and should be, at least in theory), care is needed when using them to reshape data, because they can easily introduce subtle errors. The best practice is to decide on the unit of observation up front and structure your data accordingly. Use these reshaping functions when necessary, but avoid unnecessary back-and-forth transformations that can introduce mistakes.
4.9 Expand List Columns
A final type of transformation arises when multiple elements occupy a single cell in a table. This usually occurs only after a data-cleaning step we’ve performed, rather than as raw input from a CSV file. For example, the function str.split, which we’ll learn more about in Chapter 7, can split the text in a DataFrame column into parts. We can use this to split the multiple language codes found in the lang column of our data.
(
country
.select(c.full_name, c.lang)
.with_columns(
lang_codes = c.lang.str.split("|")
)
)| full_name | lang | lang_codes |
|---|---|---|
| str | str | list[str] |
| "Senegal" | "pbp|fra|wol" | ["pbp", "fra", "wol"] |
| "Venezuela, Bolivarian Republic… | "spa|vsl" | ["spa", "vsl"] |
| "Finland" | "fin|swe" | ["fin", "swe"] |
| "United States of America" | "eng" | ["eng"] |
| "Sri Lanka" | "sin|sin|tam|tam" | ["sin", "sin", … "tam"] |
| … | … | … |
| "Albania" | "sqi" | ["sqi"] |
| "Malaysia" | "msa" | ["msa"] |
| "El Salvador" | "spa" | ["spa"] |
| "Cyprus" | "ell|tur" | ["ell", "tur"] |
| "Pakistan" | "eng|urd" | ["eng", "urd"] |
It is hard to work directly with the embedding list structure. Often the easiest way to simplify a data frame that contains list elements is to unravel the lists, duplicating the other columns. Use the .explode() method, passing the column(s) you want to unravel.
(
country
.select(c.full_name, c.lang)
.with_columns(
lang_codes = c.lang.str.split("|")
)
.explode(c.lang_codes)
)| full_name | lang | lang_codes |
|---|---|---|
| str | str | str |
| "Senegal" | "pbp|fra|wol" | "pbp" |
| "Senegal" | "pbp|fra|wol" | "fra" |
| "Senegal" | "pbp|fra|wol" | "wol" |
| "Venezuela, Bolivarian Republic… | "spa|vsl" | "spa" |
| "Venezuela, Bolivarian Republic… | "spa|vsl" | "vsl" |
| … | … | … |
| "El Salvador" | "spa" | "spa" |
| "Cyprus" | "ell|tur" | "ell" |
| "Cyprus" | "ell|tur" | "tur" |
| "Pakistan" | "eng|urd" | "eng" |
| "Pakistan" | "eng|urd" | "urd" |
The term “explode” warns us that the resulting dataset can become very large, particularly if we unravel multiple columns simultaneously.
4.10 Coming from R or Pandas
If you are familiar with joins and pivots in R’s dplyr or in Pandas, the code here should be relatively familiar, except for the specific defaults and argument names. What makes these operations difficult are the concepts, not the code, and those do not change across implementations.