import numpy as np
import polars as pl
from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())
meta = pl.read_csv("data/wiki_uk_meta.csv.gz")
page_views = pl.read_csv("data/wiki_uk_page_views.csv")
page_revisions = pl.read_csv("data/wiki_uk_page_revisions.csv")17 Temporal Data
17.1 Setup
Load all of the modules and datasets needed for the chapter.
17.2 Introduction
In this chapter, we discuss techniques for working with data that has some temporal component. This includes any kind of data that has one or more variables that record dates or times, as well as any dataset that has a meaningful ordering of its rows. For example, the annotation object that we created for textual data in Chapter 19 has a meaningful ordering and can be treated as having a temporal ordering even if it is not associated with fixed timestamps. We will start by focusing specifically on datasets that contain explicit information about dates and times. In the later sections, we will illustrate window functions and range joins, both of which have a wider set of applications to all ordered datasets.
As we saw in Chapter 4, it is possible to store information about dates and times in a tabular dataset. There are many different formats for storing this information; we recommend that most users start by recording these with separate columns for each numeric component of the date or time. This makes it easier to avoid errors and to record partial information, the latter being a common complication of many humanities datasets. We will begin by looking at a dataset related to the Wikipedia pages we saw in the previous two chapters that has date information stored in such a format.
In showing the application of line graphs in Chapter 3, we saw how to visualize a dataset of food prices over a 140-year period. This visualization was fairly straightforward. There was exactly one row for each year. We were able to treat the year variable as any other continuous measurement, with the only change being that it made sense to connect dots with a line when building the visualization. Here we will work with a slightly more complex example corresponding to the Wikipedia pages from the preceding chapters.
17.3 Temporal Data and Ordering
Let’s start by loading some data with a temporal component. Below, we read in data related to the 75 Wikipedia pages from a selection of British authors. Here, we have a different set of information about the pages than we used in the text and network analysis chapters. For each page, we have grabbed page view statistics for a 60-day period from Wikipedia. In other words, we have a record of how many people looked at a particular page each day, for each author. The data are organized with one row for each combination of item and day.
page_views| doc_id | year | month | day | views |
|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 |
| "Marie de France" | 2023 | 8 | 1 | 121 |
| "Marie de France" | 2023 | 8 | 2 | 138 |
| "Marie de France" | 2023 | 8 | 3 | 138 |
| "Marie de France" | 2023 | 8 | 4 | 129 |
| "Marie de France" | 2023 | 8 | 5 | 104 |
| … | … | … | … | … |
| "Seamus Heaney" | 2023 | 9 | 25 | 719 |
| "Seamus Heaney" | 2023 | 9 | 26 | 802 |
| "Seamus Heaney" | 2023 | 9 | 27 | 694 |
| "Seamus Heaney" | 2023 | 9 | 28 | 655 |
| "Seamus Heaney" | 2023 | 9 | 29 | 719 |
The time variables are given the way we recommended in Chapter 4, with individual columns for year, month, and day. Here, our dataset is already ordered (within each item type) from the earliest records to the latest. If this were not the case, because all of our variables are stored as numbers, we could use the .sort() method to sort by year, followed by month, followed by day, to get the same ordering.
How could we show the change in page views over time for a particular variable? One approach is to add a numeric column running down the dataset using a row number. Below is an example of the code to create a line plot using this approach:
(
page_views
.filter(c.doc_id == "Geoffrey Chaucer")
.with_row_index("row_number")
.pipe(ggplot, aes("row_number", "views"))
+ geom_line(color="red")
+ labs(
title="Page Views for Geoffrey Chaucer (by Day Number)",
x="Day Number",
y="Views"
)
)In this case, our starting plot is not a bad place to begin. The x-axis corresponds to the day number, and in many applications that may be exactly what we need. We can clearly see that the number of page views for Chaucer has a relatively stable count, possibly with some periodic swings over the course of the week. There is one day about two-thirds of the way through the plot in which the count spikes. Notice, though, that it is very hard to tell anything from the plot about exactly what days of the year are being represented. We cannot easily see which day has the spike in views, for example. Also, note that the correspondence between the row number and day only works because the data are uniformly sampled (one observation each day) and there is no missing data.
Another way to work with dates is to convert the data to a fractional year format. Here, the months and days are added to form a fractional day. A quick way to do this is to compute the following fractional year:
\[ year_{frac} = year + \frac{month - 1}{12} + \frac{day - 1}{12 \cdot 31}\]
We are subtracting one from the month and day so, for example, on a date such as 1 July 2020 (halfway through the year) we have the fractional year equal to 2020.5. We could make this even more exact by accounting for the fact that some months have fewer than 31 days, but as a first pass this works relatively well.
(
page_views
.filter(c.doc_id == "Geoffrey Chaucer")
.with_columns(
year_frac = c.year + (c.month - 1) / 12 + (c.day - 1) / (12 * 31)
)
.pipe(ggplot, aes("year_frac", "views"))
+ geom_line(color="red")
+ labs(
title="Page Views for Geoffrey Chaucer (Fractional Year)",
x="Fractional Year",
y="Views"
)
)This revised visualization improves on several aspects of the original plot. For one thing, we can roughly see exactly what dates correspond to each data point. Also, the code will work fine regardless of whether the data are sorted, evenly distributed, or contain any missing values. As a downside, the axis labels take some explaining. We can extend the same approach to working with time data. For example, if we also had the (24-hour) time of our data points the formula would become:
\[ year_{frac} = year + \frac{month - 1}{12} + \frac{day - 1}{12 \cdot 31} + \frac{hour - 1}{24 \cdot 12 \cdot 31}\]
If we are only interested in the time since a specific event, say the start of an experiment, we can use the same approach but take the difference relative to a specific fractional year.
Fractional times have a number of important applications. Fractional times are convenient because they can represent an arbitrarily precise date or date-time with an ordinary number. This means that they can be used in other models and applications without any special treatment. They may require different model assumptions, but at least the code should work with minimal effort. This is a great way to explore our data. However, particularly when we want to create nice publishable visualizations, it can be useful to work with specific functions for manipulating dates and times.
17.4 Date Objects
Most of the variables that we have worked with so far are either strings or numbers. Dates are in some ways like numbers: they have a natural ordering, we can talk about the difference between two numbers, and it makes sense to color and plot them on a continuous scale. However, they do have some unique properties, particularly when we want to extract information such as the day of the week from a date, that require a unique data type. To create a date object in Polars, we can use the pl.date() function to construct a date from its components.
chaucer_with_dates = (
page_views
.filter(c.doc_id == "Geoffrey Chaucer")
.with_columns(
date = pl.date(c.year, c.month, c.day)
)
)
chaucer_with_dates| doc_id | year | month | day | views | date |
|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | date |
| "Geoffrey Chaucer" | 2023 | 8 | 1 | 2133 | 2023-08-01 |
| "Geoffrey Chaucer" | 2023 | 8 | 2 | 1977 | 2023-08-02 |
| "Geoffrey Chaucer" | 2023 | 8 | 3 | 1860 | 2023-08-03 |
| "Geoffrey Chaucer" | 2023 | 8 | 4 | 1739 | 2023-08-04 |
| "Geoffrey Chaucer" | 2023 | 8 | 5 | 1849 | 2023-08-05 |
| … | … | … | … | … | … |
| "Geoffrey Chaucer" | 2023 | 9 | 25 | 2384 | 2023-09-25 |
| "Geoffrey Chaucer" | 2023 | 9 | 26 | 2188 | 2023-09-26 |
| "Geoffrey Chaucer" | 2023 | 9 | 27 | 1950 | 2023-09-27 |
| "Geoffrey Chaucer" | 2023 | 9 | 28 | 2590 | 2023-09-28 |
| "Geoffrey Chaucer" | 2023 | 9 | 29 | 2082 | 2023-09-29 |
Notice that the new column has the special data type Date. If we build a visualization using a date object, plotnine is able to make helpful built-in choices about how to label the axis. For example, the following code will make a line plot that has nicely labeled values on the x-axis.
(
chaucer_with_dates
.pipe(ggplot, aes("date", "views"))
+ geom_line(color="red")
+ labs(
title="Page Views for Geoffrey Chaucer",
x="Date",
y="Views"
)
)The output shows that the algorithm decided to label the dates appropriately. We can manually change the frequency of the labels using scale_x_date() and setting the date_breaks option. For example, the code below will display one label for each week:
(
chaucer_with_dates
.pipe(ggplot, aes("date", "views"))
+ geom_line(color="red")
+ scale_x_date(date_breaks="1 week", date_labels="%Y-%m-%d")
+ theme(axis_text_x=element_text(angle=45, hjust=1))
+ labs(
title="Page Views for Geoffrey Chaucer (Weekly Labels)",
x="Date",
y="Views"
)
)Once we have a date object, we can also extract useful information from it. For example, we can extract the weekday of the date using the .dt namespace in Polars. Here, we will compute the weekday and then calculate the average number of page views for each day of the week:
(
chaucer_with_dates
.with_columns(
weekday = c.date.dt.weekday()
)
.group_by(c.weekday)
.agg(
views_avg = c.views.mean()
)
.sort(c.views_avg, descending=True)
)| weekday | views_avg |
|---|---|
| i8 | f64 |
| 2 | 2273.111111 |
| 4 | 2098.555556 |
| 3 | 1988.111111 |
| 1 | 1975.75 |
| 7 | 1865.5 |
| 5 | 1828.222222 |
| 6 | 1762.375 |
Here we see the average number of page views by day of the week, where Monday is 1 and Sunday is 7. We can also use the date objects to filter the dataset. For example, we can filter the dataset to only include those dates after 15 January 2020, about two-thirds of the way through our dataset:
(
chaucer_with_dates
.filter(c.date > pl.date(2020, 1, 15))
)| doc_id | year | month | day | views | date |
|---|---|---|---|---|---|
| str | i64 | i64 | i64 | i64 | date |
| "Geoffrey Chaucer" | 2023 | 8 | 1 | 2133 | 2023-08-01 |
| "Geoffrey Chaucer" | 2023 | 8 | 2 | 1977 | 2023-08-02 |
| "Geoffrey Chaucer" | 2023 | 8 | 3 | 1860 | 2023-08-03 |
| "Geoffrey Chaucer" | 2023 | 8 | 4 | 1739 | 2023-08-04 |
| "Geoffrey Chaucer" | 2023 | 8 | 5 | 1849 | 2023-08-05 |
| … | … | … | … | … | … |
| "Geoffrey Chaucer" | 2023 | 9 | 25 | 2384 | 2023-09-25 |
| "Geoffrey Chaucer" | 2023 | 9 | 26 | 2188 | 2023-09-26 |
| "Geoffrey Chaucer" | 2023 | 9 | 27 | 1950 | 2023-09-27 |
| "Geoffrey Chaucer" | 2023 | 9 | 28 | 2590 | 2023-09-28 |
| "Geoffrey Chaucer" | 2023 | 9 | 29 | 2082 | 2023-09-29 |
Note that we use the pl.date() function to create a date literal for comparison. Polars will also accept string representations of dates in ISO format for filtering.
Polars provides many methods in the .dt namespace for extracting components from date and datetime objects. Common examples include .dt.year(), .dt.month(), .dt.day(), .dt.weekday(), .dt.ordinal_day() (day of year), and .dt.quarter(). These are useful for creating features for analysis or for grouping data by time periods.
17.5 Datetime Objects
The page_views dataset records the date of each observation. Sometimes we have data that describes the time of an event more specifically in terms of hours, minutes, and even possibly seconds. We will use the term datetime to describe an object that stores the precise time that an event occurs. The idea is that to describe the time that something happens we need to specify a date and a time. Later in the chapter, we will see an object that stores time without reference to a particular day. Whereas dates have a natural precision (a single day), we might desire to work with datetime objects of different levels of granularity. In some cases we might have just hours of the day and in others we might have access to records at the level of a millisecond such as data from radio and TV. In Polars, datetime objects are stored with microsecond precision by default, but we can regard the precision as whatever granularity we have given in our data for all practical purposes.
Datetime objects largely function the same as date objects. Let’s grab another dataset from Wikipedia that has precise timestamps. Below, we read in a dataset consisting of the last 500 edits made to each of the 75 British author pages in our collection.
page_revisions = (
page_revisions
.with_columns(
datetime = c.datetime.str.to_datetime(time_zone="UTC")
)
)
page_revisions| doc_id | user | datetime | size | comment |
|---|---|---|---|---|
| str | str | datetime[μs, UTC] | i64 | str |
| "Marie de France" | "YurikBot" | 2006-07-06 23:24:54 UTC | 3061 | "robot Adding: [[it:Maria di F… |
| "Marie de France" | "YurikBot" | 2006-07-24 18:39:08 UTC | 3085 | "robot Adding: [[pt:Maria de F… |
| "Marie de France" | "Ccarroll" | 2006-09-10 13:36:44 UTC | 3085 | null |
| "Marie de France" | "63.231.20.88" | 2006-09-18 04:59:53 UTC | 3105 | null |
| "Marie de France" | "ExplicitImplicity" | 2006-09-23 22:27:20 UTC | 3104 | "i believe it is stupid to tran… |
| … | … | … | … | … |
| "Seamus Heaney" | "InternetArchiveBot" | 2023-09-21 04:51:45 UTC | 85491 | "Rescuing 1 sources and tagging… |
| "Seamus Heaney" | "81.110.56.233" | 2023-09-24 12:10:49 UTC | 85493 | null |
| "Seamus Heaney" | "149.50.162.177" | 2023-09-25 16:06:01 UTC | 85482 | "/* Death */" |
| "Seamus Heaney" | "--WikiUser1234945--" | 2023-09-25 16:06:11 UTC | 85493 | "Reverted edits by [[Special:Co… |
| "Seamus Heaney" | "BD2412" | 2023-09-27 02:39:46 UTC | 85493 | "/* 1957–1969 */Clean up spacin… |
Notice that each row has a record in the column datetime that provides a precise datetime object giving the second at which the page was modified. The data were stored using the ISO-8601 format (“YYYY-MM-DD HH:MM:SS”), which Polars can automatically parse with the .str.to_datetime() method.
Our page_revisions dataset includes several pieces of information about each of the edits. We have a username for the person who made the edit (recall that anyone can edit a Wikipedia page), the size in bytes of the page after the edit was made, and a short comment describing what was done in the change. Looking at the page size over time shows when large additions and deletions were made to each record. The code below yields a temporal visualization:
selected_authors = ["Geoffrey Chaucer", "Emily Brontë"]
(
page_revisions
.filter(c.doc_id.is_in(selected_authors))
.pipe(ggplot, aes("datetime", "size", color="doc_id"))
+ geom_line()
+ scale_color_manual(values=["red", "blue"])
+ labs(
title="Wikipedia Page Size Over Time",
x="Date",
y="Page Size (bytes)",
color="Author"
)
)Looking at the plot, we can see that there are a few very large edits (both deletions and additions), likely consisting of large sections added and subtracted from the page. If we want to visualize when these large changes occurred, it would be useful to include a more granular set of labels on the x-axis. We can do this using scale_x_datetime() with custom formatting:
(
page_revisions
.filter(c.doc_id.is_in(selected_authors))
.pipe(ggplot, aes("datetime", "size", color="doc_id"))
+ geom_line()
+ scale_color_manual(values=["red", "blue"])
+ scale_x_datetime(date_breaks="6 months", date_labels="%b %Y")
+ theme(axis_text_x=element_text(angle=90, hjust=1))
+ labs(
title="Wikipedia Page Size Over Time (Custom Labels)",
x="Date",
y="Page Size (bytes)",
color="Author"
)
)We can also filter our dataset by a particular range of dates or times. This is useful to zoom into a specific region of our data to investigate patterns that may be otherwise lost. For example, if we wanted to see all of the page sizes for two authors from 2021 onward:
(
page_revisions
.filter(c.doc_id.is_in(selected_authors))
.filter(c.datetime > pl.datetime(2021, 1, 1, time_zone="UTC"))
)| doc_id | user | datetime | size | comment |
|---|---|---|---|---|
| str | str | datetime[μs, UTC] | i64 | str |
| "Geoffrey Chaucer" | "86.31.15.58" | 2021-01-18 09:42:33 UTC | 68291 | "/* Origin */" |
| "Geoffrey Chaucer" | "86.31.15.58" | 2021-01-18 09:43:16 UTC | 68288 | "/* Taco Bell */" |
| "Geoffrey Chaucer" | "88.108.207.22" | 2021-01-18 11:15:33 UTC | 68284 | "/* Career */" |
| "Geoffrey Chaucer" | "Pahunkat" | 2021-01-18 11:16:09 UTC | 68288 | "Rollback edit(s) by [[Special:… |
| "Geoffrey Chaucer" | "88.108.207.22" | 2021-01-18 11:16:40 UTC | 68293 | "/* Origin */" |
| … | … | … | … | … |
| "Emily Brontë" | "HeyElliott" | 2023-09-16 20:05:37 UTC | 41598 | "[[WP:LQ]]" |
| "Emily Brontë" | "Qwerfjkl (bot)" | 2023-09-21 16:34:46 UTC | 41586 | "Converting Gutenberg author ID… |
| "Emily Brontë" | "Ficaia" | 2023-09-27 12:47:22 UTC | 41717 | null |
| "Emily Brontë" | "Keith D" | 2023-09-27 20:36:18 UTC | 41688 | "Remove namespace | Replaced cu… |
| "Emily Brontë" | "Keith D" | 2023-09-27 20:39:51 UTC | 41693 | "/* Adulthood */ Cite fix" |
Notice that the filter includes data from 2021, even though we use a strictly greater than condition. The reason for this is that pl.datetime(2021, 1, 1) is interpreted as the exact time corresponding to 1 January 2021 at 00:00. Any record that comes at any other time during the year of 2021 will be included in the filter.
17.6 Language and Time Zones
So far, we have primarily worked with numeric summaries of the date and datetime objects. In the previous sections, notice that our example of working with the days of the week used numeric values (1-7) rather than names. Polars provides methods to extract string names for weekdays and months, though the output depends on your system locale. Below, for example, we extract various datetime components:
(
page_revisions
.head(10)
.with_columns(
weekday_num = c.datetime.dt.weekday(),
month_num = c.datetime.dt.month(),
hour = c.datetime.dt.hour(),
date_only = c.datetime.dt.date()
)
.select(c.datetime, c.weekday_num, c.month_num, c.hour, c.date_only)
)| datetime | weekday_num | month_num | hour | date_only |
|---|---|---|---|---|
| datetime[μs, UTC] | i8 | i8 | i8 | date |
| 2006-07-06 23:24:54 UTC | 4 | 7 | 23 | 2006-07-06 |
| 2006-07-24 18:39:08 UTC | 1 | 7 | 18 | 2006-07-24 |
| 2006-09-10 13:36:44 UTC | 7 | 9 | 13 | 2006-09-10 |
| 2006-09-18 04:59:53 UTC | 1 | 9 | 4 | 2006-09-18 |
| 2006-09-23 22:27:20 UTC | 6 | 9 | 22 | 2006-09-23 |
| 2006-09-23 22:31:03 UTC | 6 | 9 | 22 | 2006-09-23 |
| 2006-10-03 18:18:09 UTC | 2 | 10 | 18 | 2006-10-03 |
| 2006-10-03 18:19:32 UTC | 2 | 10 | 18 | 2006-10-03 |
| 2006-10-10 18:48:26 UTC | 2 | 10 | 18 | 2006-10-10 |
| 2006-10-11 05:29:32 UTC | 3 | 10 | 5 | 2006-10-11 |
If you need weekday or month names as strings, you can create a mapping from the numeric values. For example, you could use pl.when() chains or join with a lookup table. The numeric representation is often more convenient for analysis and avoids locale-dependent issues.
Another regional issue that arises when working with dates and times are time zones. While seemingly not too difficult a concept, getting time zones to work correctly with complex datasets can be incredibly complicated. A wide range of programming bugs have been attributed to all sorts of edge-cases surrounding the processing of time zones.
All times stored in Polars can be timezone-aware or timezone-naive. By default, times are stored as timezone-naive (no timezone information). The data can be localized to a specific timezone using the .dt.replace_time_zone() method, and converted between time zones using .dt.convert_time_zone(). All of the times recorded in the dataset page_revisions are given in UTC. This is not surprising; most technical sources with a global focus will use this convention.
We can convert between time zones using Polars datetime methods. Since we parsed our data with time_zone="UTC", the datetime column is already timezone-aware. We can convert to another timezone using .dt.convert_time_zone(). Let’s convert our UTC times to New York time:
(
page_revisions
.head(10)
.with_columns(
datetime_utc = c.datetime,
datetime_nyc = c.datetime.dt.convert_time_zone("America/New_York")
)
.with_columns(
hour_utc = c.datetime_utc.dt.hour(),
hour_nyc = c.datetime_nyc.dt.hour()
)
.select(c.datetime_utc, c.datetime_nyc, c.hour_utc, c.hour_nyc)
)| datetime_utc | datetime_nyc | hour_utc | hour_nyc |
|---|---|---|---|
| datetime[μs, UTC] | datetime[μs, America/New_York] | i8 | i8 |
| 2006-07-06 23:24:54 UTC | 2006-07-06 19:24:54 EDT | 23 | 19 |
| 2006-07-24 18:39:08 UTC | 2006-07-24 14:39:08 EDT | 18 | 14 |
| 2006-09-10 13:36:44 UTC | 2006-09-10 09:36:44 EDT | 13 | 9 |
| 2006-09-18 04:59:53 UTC | 2006-09-18 00:59:53 EDT | 4 | 0 |
| 2006-09-23 22:27:20 UTC | 2006-09-23 18:27:20 EDT | 22 | 18 |
| 2006-09-23 22:31:03 UTC | 2006-09-23 18:31:03 EDT | 22 | 18 |
| 2006-10-03 18:18:09 UTC | 2006-10-03 14:18:09 EDT | 18 | 14 |
| 2006-10-03 18:19:32 UTC | 2006-10-03 14:19:32 EDT | 18 | 14 |
| 2006-10-10 18:48:26 UTC | 2006-10-10 14:48:26 EDT | 18 | 14 |
| 2006-10-11 05:29:32 UTC | 2006-10-11 01:29:32 EDT | 5 | 1 |
We can use the timezone information to display data in a useful way to a local audience. For example, the code below displays the frequency of updates as a function of the hour of the day in New York City:
hourly_edits = (
page_revisions
.with_columns(
datetime_nyc = c.datetime.dt.convert_time_zone("America/New_York")
)
.with_columns(
hour_nyc = c.datetime_nyc.dt.hour()
)
.group_by(c.hour_nyc)
.agg(
count = pl.len()
)
.sort(c.hour_nyc)
)
(
hourly_edits
.pipe(ggplot, aes("hour_nyc", "count"))
+ geom_col()
+ labs(
title="Wikipedia Edits by Hour (New York Time)",
x="Hour of Day",
y="Number of Edits"
)
)While certainly many editors are living in other English-speaking cities (London, Los Angeles, or Mumbai), it is generally easier for people to do the mental math for what times correspond relative to their own time zone than relative to UTC.
17.7 Truncating Dates Data
We have shown above how to create date and datetime objects using Polars functions. Also, we have seen how to extract the components from date and datetime objects with the .dt namespace. There are a variety of other functions that help us create and manipulate these temporal objects. While we will not give an entire list of all the available functions in Polars datetime functionality, let’s look at a few of the most useful and representative examples for converting between different ways of representing information about time.
The page_revisions dataset has revisions recorded with the precision of a second. This is likely overly granular for many applications; it might be better to have the data in a format that is only at the level of an hour, for example. We can truncate any date or datetime object by using the .dt.truncate() method along with a specific duration. Setting the duration to “1h” (one hour), for example, will remove all of the minutes and seconds from the time:
(
page_revisions
.head(10)
.with_columns(
datetime_hour = c.datetime.dt.truncate("1h"),
datetime_day = c.datetime.dt.truncate("1d")
)
.select(c.datetime, c.datetime_hour, c.datetime_day)
)| datetime | datetime_hour | datetime_day |
|---|---|---|
| datetime[μs, UTC] | datetime[μs, UTC] | datetime[μs, UTC] |
| 2006-07-06 23:24:54 UTC | 2006-07-06 23:00:00 UTC | 2006-07-06 00:00:00 UTC |
| 2006-07-24 18:39:08 UTC | 2006-07-24 18:00:00 UTC | 2006-07-24 00:00:00 UTC |
| 2006-09-10 13:36:44 UTC | 2006-09-10 13:00:00 UTC | 2006-09-10 00:00:00 UTC |
| 2006-09-18 04:59:53 UTC | 2006-09-18 04:00:00 UTC | 2006-09-18 00:00:00 UTC |
| 2006-09-23 22:27:20 UTC | 2006-09-23 22:00:00 UTC | 2006-09-23 00:00:00 UTC |
| 2006-09-23 22:31:03 UTC | 2006-09-23 22:00:00 UTC | 2006-09-23 00:00:00 UTC |
| 2006-10-03 18:18:09 UTC | 2006-10-03 18:00:00 UTC | 2006-10-03 00:00:00 UTC |
| 2006-10-03 18:19:32 UTC | 2006-10-03 18:00:00 UTC | 2006-10-03 00:00:00 UTC |
| 2006-10-10 18:48:26 UTC | 2006-10-10 18:00:00 UTC | 2006-10-10 00:00:00 UTC |
| 2006-10-11 05:29:32 UTC | 2006-10-11 05:00:00 UTC | 2006-10-11 00:00:00 UTC |
The benefit of using the .dt.truncate() method is that we could then group, join, or summarize the data in a way that treats each value of datetime the same as long as they occur during the same hour. There is also a .dt.round() method for rounding the datetime object to the nearest desired unit. In the special case in which we want to extract just the date part of a datetime, we can use the .dt.date() method. The code below illustrates this process, as well as showing how reducing the temporal granularity can be a useful first step before grouping and summarizing:
(
page_revisions
.with_columns(
date = c.datetime.dt.date()
)
.group_by(c.date)
.agg(
count = pl.len()
)
.sort(c.date, descending=True)
)| date | count |
|---|---|
| date | u32 |
| 2023-09-30 | 10 |
| 2023-09-29 | 11 |
| 2023-09-28 | 9 |
| 2023-09-27 | 14 |
| 2023-09-26 | 12 |
| … | … |
| 2003-02-01 | 1 |
| 2003-01-29 | 2 |
| 2003-01-14 | 1 |
| 2003-01-04 | 1 |
| 2002-10-26 | 1 |
Above, we effectively remove the time component of the datetime object and treat the variable as having only a date element. Occasionally, we might want to do the opposite: considering only the time component of a datetime object without worrying about the specific date. For example, we might want to summarize the number of edits that are made based on the time of the day. We can do this by extracting the hour component:
(
page_revisions
.with_columns(
hour = c.datetime.dt.hour()
)
.group_by(c.hour)
.agg(
count = pl.len()
)
.sort(c.hour)
)| hour | count |
|---|---|
| i8 | u32 |
| 0 | 1316 |
| 1 | 1229 |
| 2 | 1149 |
| 3 | 1010 |
| 4 | 982 |
| … | … |
| 19 | 1837 |
| 20 | 1838 |
| 21 | 1797 |
| 22 | 1675 |
| 23 | 1349 |
This creates a variable that stores the hour without any date information, which is useful for analyzing patterns that repeat daily.
17.8 Window Functions
At the start of this chapter, we considered time series to be a sequence of events without too much focus on the specific dates and times. This viewpoint can be a useful construct when we want to look at changes over time. For example, we have the overall size of each Wikipedia page after an edit. A measurement that would be useful is the difference in page size made by an edit. To add a variable to a dataset, we usually use the .with_columns() method, and that will again work here. However, in this case we need to reference values that come before or after a certain value. This requires the use of window functions.
A window function transforms a variable in a dataset into a new variable with the same length in a way that takes into account the entire ordering of the data. Two examples of window functions that are useful when working with time series data are .shift() with positive and negative values, which give access to rows preceding or following a row, respectively. Let’s apply this to our dataset of page revisions to get the previous and next values of the page size variable.
(
page_revisions
.with_columns(
size_last = c.size.shift(1),
size_next = c.size.shift(-1)
)
.select(c.doc_id, c.datetime, c.size, c.size_last, c.size_next)
)| doc_id | datetime | size | size_last | size_next |
|---|---|---|---|---|
| str | datetime[μs, UTC] | i64 | i64 | i64 |
| "Marie de France" | 2006-07-06 23:24:54 UTC | 3061 | null | 3085 |
| "Marie de France" | 2006-07-24 18:39:08 UTC | 3085 | 3061 | 3085 |
| "Marie de France" | 2006-09-10 13:36:44 UTC | 3085 | 3085 | 3105 |
| "Marie de France" | 2006-09-18 04:59:53 UTC | 3105 | 3085 | 3104 |
| "Marie de France" | 2006-09-23 22:27:20 UTC | 3104 | 3105 | 3132 |
| … | … | … | … | … |
| "Seamus Heaney" | 2023-09-21 04:51:45 UTC | 85491 | 85341 | 85493 |
| "Seamus Heaney" | 2023-09-24 12:10:49 UTC | 85493 | 85491 | 85482 |
| "Seamus Heaney" | 2023-09-25 16:06:01 UTC | 85482 | 85493 | 85493 |
| "Seamus Heaney" | 2023-09-25 16:06:11 UTC | 85493 | 85482 | 85493 |
| "Seamus Heaney" | 2023-09-27 02:39:46 UTC | 85493 | 85493 | null |
Notice that the first value of size_last is missing because there is no last value for the first item in our data. Similarly, the variable size_next will have a missing value at the end of the dataset. As written above, the code incorrectly crosses the time points at the boundary of each page. That is, for the first row of the second page (Geoffrey Chaucer) it thinks that the size of the last page is the size of the final page of the Marie de France record. To fix this, we can use the .over() method to apply the window function within groups. Window functions respect the grouping of the data:
(
page_revisions
.sort(c.doc_id, c.datetime)
.with_columns(
size_last = c.size.shift(1).over(c.doc_id),
size_next = c.size.shift(-1).over(c.doc_id)
)
.select(c.doc_id, c.datetime, c.size, c.size_last, c.size_next)
.slice(495, 10)
)| doc_id | datetime | size | size_last | size_next |
|---|---|---|---|---|
| str | datetime[μs, UTC] | i64 | i64 | i64 |
| "A. A. Milne" | 2023-08-05 20:59:54 UTC | 44019 | 43990 | 44038 |
| "A. A. Milne" | 2023-08-18 09:26:21 UTC | 44038 | 44019 | 44044 |
| "A. A. Milne" | 2023-08-21 22:05:31 UTC | 44044 | 44038 | 44045 |
| "A. A. Milne" | 2023-08-21 22:07:46 UTC | 44045 | 44044 | 44077 |
| "A. A. Milne" | 2023-08-31 20:41:54 UTC | 44077 | 44045 | null |
| "Alexander Pope" | 2016-04-17 04:03:45 UTC | 31345 | null | 31531 |
| "Alexander Pope" | 2016-05-05 22:17:54 UTC | 31531 | 31345 | 31532 |
| "Alexander Pope" | 2016-05-05 22:24:29 UTC | 31532 | 31531 | 31585 |
| "Alexander Pope" | 2016-05-07 19:05:49 UTC | 31585 | 31532 | 31614 |
| "Alexander Pope" | 2016-05-09 17:22:21 UTC | 31614 | 31585 | 31585 |
Notice that now, correctly, the dataset has a missing size_next for the final Marie de France record and a missing size_last for the first Geoffrey Chaucer record. Now, let’s use this to compute the change in the page sizes for each of the revisions:
(
page_revisions
.sort(c.doc_id, c.datetime)
.with_columns(
size_diff = c.size - c.size.shift(1).over(c.doc_id)
)
.select(c.doc_id, c.datetime, c.size, c.size_diff)
)| doc_id | datetime | size | size_diff |
|---|---|---|---|
| str | datetime[μs, UTC] | i64 | i64 |
| "A. A. Milne" | 2015-09-04 16:30:26 UTC | 30702 | null |
| "A. A. Milne" | 2015-09-22 14:51:08 UTC | 30400 | -302 |
| "A. A. Milne" | 2015-11-07 10:23:31 UTC | 30389 | -11 |
| "A. A. Milne" | 2015-11-08 00:48:23 UTC | 30397 | 8 |
| "A. A. Milne" | 2015-11-16 03:12:19 UTC | 30427 | 30 |
| … | … | … | … |
| "William Wordsworth" | 2023-09-25 11:19:40 UTC | 41466 | 0 |
| "William Wordsworth" | 2023-09-27 15:10:27 UTC | 41464 | -2 |
| "William Wordsworth" | 2023-09-27 15:11:44 UTC | 41466 | 2 |
| "William Wordsworth" | 2023-09-27 15:13:48 UTC | 41467 | 1 |
| "William Wordsworth" | 2023-09-27 15:18:10 UTC | 41466 | -1 |
In the above output, we can see the changes in page sizes. If we wanted to find reversions in the dataset, we could apply the .shift() function several times. As an alternative, we can also give a parameter to .shift() to indicate that we want to go back (or forward) more than one row. Let’s put this together to indicate which commits seem to be a reversion (the page size exactly matches the page size from two commits prior) as well as the overall size of the reversion:
(
page_revisions
.sort(c.doc_id, c.datetime)
.with_columns(
size_diff = c.size - c.size.shift(1).over(c.doc_id),
size_two_back = c.size.shift(2).over(c.doc_id),
is_reversion = c.size == c.size.shift(2).over(c.doc_id)
)
.filter(c.is_reversion)
.select(c.doc_id, c.datetime, c.size_diff, c.is_reversion)
)| doc_id | datetime | size_diff | is_reversion |
|---|---|---|---|
| str | datetime[μs, UTC] | i64 | bool |
| "A. A. Milne" | 2015-12-17 18:13:11 UTC | -6 | true |
| "A. A. Milne" | 2016-01-07 18:55:26 UTC | 30407 | true |
| "A. A. Milne" | 2016-04-14 17:38:45 UTC | 36 | true |
| "A. A. Milne" | 2016-04-17 15:06:58 UTC | -77 | true |
| "A. A. Milne" | 2016-04-27 20:04:19 UTC | -1 | true |
| … | … | … | … |
| "William Wordsworth" | 2023-09-18 06:35:43 UTC | -1 | true |
| "William Wordsworth" | 2023-09-24 14:54:33 UTC | 8 | true |
| "William Wordsworth" | 2023-09-25 11:19:40 UTC | 0 | true |
| "William Wordsworth" | 2023-09-27 15:11:44 UTC | 2 | true |
| "William Wordsworth" | 2023-09-27 15:18:10 UTC | -1 | true |
These reversions can be studied to see the nature of the Wikipedia editing process. For example, how long do these reversions tend to take? Are certain pages more likely to undergo reversions? Do these take place during a certain time of the day? These are all questions that we should now be able to address using this dataset and the tools described above.
Polars provides many other window functions beyond .shift(). Common examples include .cum_sum() for cumulative sums, .rolling_mean() for moving averages, .rank() for ranking values, and .diff() which computes the difference between consecutive values (equivalent to x - x.shift(1)). All of these can be combined with .over() to apply within groups.
17.9 Range Joins
We will finish this chapter by looking at range joins, which allow for combining datasets based on inequalities between keys contained in two different datasets. Range joins are functionality that can greatly simplify some operations that arise when working with temporal data. Recall that all of the join functions that we saw in Chapter 4 work by finding a correspondence where keys from one dataset equal the values of keys in another dataset. In some cases it happens that we want to join two tables on inequalities rather than exact values.
Take, for example, the metadata table for the 75 authors in our Wikipedia collection. Recall that this dataset contains the years that each author was born and the years each author died. What if we wanted to make a dataset by joining the metadata to itself, matching each author with other authors that would have been alive in overlapping years? We can do this using Polars’ .join_where() method:
(
meta
.join_where(
meta,
(c.born <= c.born_right) & (c.died > c.born_right) & (c.doc_id != c.doc_id_right)
)
.select(c.doc_id, c.doc_id_right, c.born, c.died, c.born_right, c.died_right)
)| doc_id | doc_id_right | born | died | born_right | died_right |
|---|---|---|---|---|---|
| str | str | i64 | i64 | i64 | i64 |
| "Stephen Spender" | "Seamus Heaney" | 1909 | 1995 | 1939 | 1939 |
| "Daphne du Maurier" | "Seamus Heaney" | 1907 | 1989 | 1939 | 1939 |
| "Daphne du Maurier" | "Stephen Spender" | 1907 | 1989 | 1909 | 1995 |
| "Daphne du Maurier" | "W. H. Auden" | 1907 | 1989 | 1907 | 1973 |
| "Daphne du Maurier" | "Louis MacNeice" | 1907 | 1989 | 1907 | 1963 |
| … | … | … | … | … | … |
| "William Langland" | "Geoffrey Chaucer" | 1332 | 1386 | 1343 | 1400 |
| "John Gower" | "Thomas Malory" | 1330 | 1408 | 1405 | 1471 |
| "John Gower" | "Margery Kempe" | 1330 | 1408 | 1373 | 1438 |
| "John Gower" | "Geoffrey Chaucer" | 1330 | 1408 | 1343 | 1400 |
| "John Gower" | "William Langland" | 1330 | 1408 | 1332 | 1386 |
The .join_where() method takes one or more expressions to specify which pairs of rows to keep in the joined dataset. To refer to columns in the second (right) dataset that share the same name, use the _right suffix. In the example above, we find all pairs where the first author’s lifespan overlaps with the second author’s birth, excluding self-matches.
The resulting dataset would make an interesting type of network, showing temporal overlap of authors in the dataset. We will investigate this in the following chapter.
The .join_where() method can be slow because it needs to check every combination of rows in the two datasets. For large datasets, consider whether you can use .join_asof() instead, which is optimized for nearest-value matching and requires the data to be sorted. The .join_asof() method is particularly useful when you want to match each row to the nearest (or nearest preceding/following) value in another dataset.
17.10 Temporal Durations
Sometimes we want to work not with specific dates or times, but with durations—the amount of time between two events. Polars represents durations with the Duration data type. We can compute durations by subtracting one datetime from another, or by using the pl.duration() function to create specific intervals.
(
page_revisions
.sort(c.doc_id, c.datetime)
.with_columns(
time_since_last = c.datetime - c.datetime.shift(1).over(c.doc_id)
)
.select(c.doc_id, c.datetime, c.time_since_last)
.drop_nulls()
)| doc_id | datetime | time_since_last |
|---|---|---|
| str | datetime[μs, UTC] | duration[μs] |
| "A. A. Milne" | 2015-09-22 14:51:08 UTC | 17d 22h 20m 42s |
| "A. A. Milne" | 2015-11-07 10:23:31 UTC | 45d 19h 32m 23s |
| "A. A. Milne" | 2015-11-08 00:48:23 UTC | 14h 24m 52s |
| "A. A. Milne" | 2015-11-16 03:12:19 UTC | 8d 2h 23m 56s |
| "A. A. Milne" | 2015-11-17 07:52:16 UTC | 1d 4h 39m 57s |
| … | … | … |
| "William Wordsworth" | 2023-09-25 11:19:40 UTC | 1h 17m |
| "William Wordsworth" | 2023-09-27 15:10:27 UTC | 2d 3h 50m 47s |
| "William Wordsworth" | 2023-09-27 15:11:44 UTC | 1m 17s |
| "William Wordsworth" | 2023-09-27 15:13:48 UTC | 2m 4s |
| "William Wordsworth" | 2023-09-27 15:18:10 UTC | 4m 22s |
The resulting duration can be converted to various units using methods like .dt.total_days(), .dt.total_hours(), .dt.total_minutes(), or .dt.total_seconds(). This is useful for analyzing how frequently events occur or how long processes take.
(
page_revisions
.sort(c.doc_id, c.datetime)
.with_columns(
time_since_last = c.datetime - c.datetime.shift(1).over(c.doc_id)
)
.with_columns(
days_since_last = c.time_since_last.dt.total_days()
)
.select(c.doc_id, c.datetime, c.days_since_last)
.drop_nulls()
.group_by(c.doc_id)
.agg(
avg_days_between_edits = c.days_since_last.mean()
)
.sort(c.avg_days_between_edits)
)| doc_id | avg_days_between_edits |
|---|---|
| str | f64 |
| "George Orwell" | 1.138277 |
| "James Joyce" | 1.224449 |
| "Oscar Wilde" | 1.58517 |
| "Thomas More" | 1.791583 |
| "Percy Bysshe Shelley" | 1.815631 |
| … | … |
| "John Gower" | 24.467105 |
| "Edward Upward" | 27.0 |
| "Katherine Philipps" | 31.182692 |
| "Rex Warner" | 39.488506 |
| "Charlotte Smith" | 168.222222 |
This shows us which Wikipedia pages are edited most frequently (smallest average days between edits) and which are edited less often.
17.11 Summary
In this chapter, we explored techniques for working with temporal data in Polars. We learned how to create date and datetime objects from their components, extract useful information from them, and visualize temporal patterns. We saw how to work with time zones and how to truncate temporal data to different levels of granularity. Window functions allowed us to compare values across time within groups, and range joins provided a way to combine datasets based on temporal overlap. These tools are essential for any analysis involving time-based data.






