7 Strings

Practice Notebooks

Notebook07a [Colab↗]
Notebook07b [Colab↗]

7.1 Setup

Load all of the modules and datasets needed for the chapter.

import numpy as np
import polars as pl

from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())

wiki = pl.read_csv("data/wiki_uk_meta.csv.gz", ignore_errors=True)

7.2 Introduction

Text data appears everywhere in data science: names, addresses, product descriptions, social media posts, and scraped web content. Working effectively with strings requires learning a small set of operations that allow us to search, extract, and transform text within a DataFrame. In this chapter, we explore the string methods available in Polars, which provide a consistent and efficient way to manipulate text columns. We will also introduce regular expressions, a powerful pattern-matching language that underlies many of these operations.

To illustrate these methods, we use a dataset of British writers compiled from Wikipedia. The dataset contains metadata about 75 authors spanning from the medieval period to the present day.

wiki.glimpse()

Rows: 75
Columns: 7
$ doc_id <str> 'Marie de France', 'Geoffrey Chaucer', 'John Gower', 'William Langland', 'Margery Kempe', 'Thomas Malory', 'Thomas More', 'Edmund Spenser', 'Walter Raleigh', 'Philip Sidney'
$ born   <i64> 1160, 1343, 1330, 1332, 1373, 1405, 1478, 1552, 1552, 1554
$ died   <i64> 1215, 1400, 1408, 1386, 1438, 1471, 1535, 1599, 1618, 1586
$ era    <str> 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Sixteenth C', 'Sixteenth C', 'Sixteenth C', 'Sixteenth C'
$ gender <str> 'female', 'male', 'male', 'male', 'female', 'male', 'male', 'male', 'male', 'male'
$ link   <str> 'Marie_de_France', 'Geoffrey_Chaucer', 'John_Gower', 'William_Langland', 'Margery_Kempe', 'Thomas_Malory', 'Thomas_More', 'Edmund_Spenser', 'Walter_Raleigh', 'Philip_Sidney'
$ short  <str> 'Marie d. F.', 'Chaucer', 'Gower', 'Langland', 'Kempe', 'Malory', 'More', 'Spenser', 'Raleigh', 'Sidney'

The doc_id column contains each author’s full name, while short provides a shortened version. The link column gives the Wikipedia URL suffix for each author’s page. We also have birth and death years, a categorical era column, and gender. This mix of structured and semi-structured text gives us plenty of opportunities to practice string manipulation.

7.3 Filtering with Contains

A common task is selecting rows where a text column contains a particular pattern. The contains method checks whether each value in a string column matches a given pattern. Let’s find all authors whose names include “William”:

(
    wiki
    .filter(c.doc_id.str.contains("William"))
)

shape: (4, 7)

doc_id	born	died	era	gender	link	short
str	i64	i64	str	str	str	str
"William Langland"	1332	1386	"Early"	"male"	"William_Langland"	"Langland"
"William Shakespeare"	1564	1616	"Sixteenth C"	"male"	"William_Shakespeare"	"Shakespeare"
"William Blake"	1757	1827	"Romantic"	"male"	"William_Blake"	"Blake"
"William Wordsworth"	1770	1850	"Romantic"	"male"	"William_Wordsworth"	"Wordsworth"

The pattern matching is case-sensitive. To find authors from the sixteenth century, we filter on the era column:

(
    wiki
    .filter(c.era.str.contains("Sixteenth"))
)

shape: (6, 7)

doc_id	born	died	era	gender	link	short
str	i64	i64	str	str	str	str
"Thomas More"	1478	1535	"Sixteenth C"	"male"	"Thomas_More"	"More"
"Edmund Spenser"	1552	1599	"Sixteenth C"	"male"	"Edmund_Spenser"	"Spenser"
"Walter Raleigh"	1552	1618	"Sixteenth C"	"male"	"Walter_Raleigh"	"Raleigh"
"Philip Sidney"	1554	1586	"Sixteenth C"	"male"	"Philip_Sidney"	"Sidney"
"Christopher Marlowe"	1564	1593	"Sixteenth C"	"male"	"Christopher_Marlowe"	"Marlowe"
"William Shakespeare"	1564	1616	"Sixteenth C"	"male"	"William_Shakespeare"	"Shakespeare"

We can combine string filters with other conditions using the standard logical operators. Below we find female authors born before 1800:

(
    wiki
    .filter(
        (c.gender == "female") & (c.born < 1800)
    )
)

shape: (11, 7)

doc_id	born	died	era	gender	link	short
str	i64	i64	str	str	str	str
"Marie de France"	1160	1215	"Early"	"female"	"Marie_de_France"	"Marie d. F."
"Margery Kempe"	1373	1438	"Early"	"female"	"Margery_Kempe"	"Kempe"
"Emilia Lanier"	1569	1645	"Seventeenth C"	"female"	"Emilia_Lanier"	"Lanier"
"Katherine Philipps"	1632	1664	"Seventeenth C"	"female"	"Katherine_Philipps"	"Philipps"
"Margaret Cavendish"	1623	1673	"Seventeenth C"	"female"	"Margaret_Cavendish"	"Cavendish"
…	…	…	…	…	…	…
"Mary Robinson"	1757	1800	"Romantic"	"female"	"Mary_Robinson_(poet)"	"Robinson"
"Mary Wollstonecraft"	1759	1797	"Romantic"	"female"	"Mary_Wollstonecraft"	"Wollstonecraft"
"Ann Radcliffe"	1764	1823	"Romantic"	"female"	"Ann_Radcliffe"	"Radcliffe"
"Jane Austen"	1775	1817	"Romantic"	"female"	"Jane_Austen"	"Austen"
"Felicia Hemans"	1793	1835	"Romantic"	"female"	"Felicia_Hemans"	"Hemans"

7.4 Regular Expression Basics

So far we have searched for literal text, but the contains method (and most other string methods in Polars) actually accepts regular expressions by default. Regular expressions are patterns that describe sets of strings. They give us far more flexibility than searching for exact text.

Let’s start with two simple but powerful patterns: ^ matches the start of a string, and $ matches the end. To find all authors whose names start with the letter “M”:

(
    wiki
    .filter(c.doc_id.str.contains(r"^M"))
)

shape: (6, 7)

doc_id	born	died	era	gender	link	short
str	i64	i64	str	str	str	str
"Marie de France"	1160	1215	"Early"	"female"	"Marie_de_France"	"Marie d. F."
"Margery Kempe"	1373	1438	"Early"	"female"	"Margery_Kempe"	"Kempe"
"Margaret Cavendish"	1623	1673	"Seventeenth C"	"female"	"Margaret_Cavendish"	"Cavendish"
"Mary Robinson"	1757	1800	"Romantic"	"female"	"Mary_Robinson_(poet)"	"Robinson"
"Mary Wollstonecraft"	1759	1797	"Romantic"	"female"	"Mary_Wollstonecraft"	"Wollstonecraft"
"Matthew Arnold"	1822	1888	"Victorian"	"male"	"Matthew_Arnold"	"Arnold"

The r before the string creates a raw string literal, which is a Python convention for regular expressions. It prevents Python from treating backslashes as escape characters, which becomes important with more complex patterns.

To find names that end with the letter “e”:

(
    wiki
    .filter(c.doc_id.str.contains(r"e$"))
)

shape: (17, 7)

doc_id	born	died	era	gender	link	short
str	i64	i64	str	str	str	str
"Marie de France"	1160	1215	"Early"	"female"	"Marie_de_France"	"Marie d. F."
"Margery Kempe"	1373	1438	"Early"	"female"	"Margery_Kempe"	"Kempe"
"Thomas More"	1478	1535	"Sixteenth C"	"male"	"Thomas_More"	"More"
"Christopher Marlowe"	1564	1593	"Sixteenth C"	"male"	"Christopher_Marlowe"	"Marlowe"
"William Shakespeare"	1564	1616	"Sixteenth C"	"male"	"William_Shakespeare"	"Shakespeare"
…	…	…	…	…	…	…
"Oscar Wilde"	1854	1900	"Victorian"	"male"	"Oscar_Wilde"	"Wilde"
"James Joyce"	1882	1941	"Twentieth C"	"male"	"James_Joyce"	"Joyce"
"D. H. Lawrence"	1885	1930	"Twentieth C"	"male"	"D._H._Lawrence"	"Lawrence"
"A. A. Milne"	1882	1956	"Twentieth C"	"male"	"A._A._Milne"	"Milne"
"Louis MacNeice"	1907	1963	"Twentieth C"	"male"	"Louis_MacNeice"	"MacNeice"

We can combine these anchors with character classes. Square brackets define a set of characters to match. The pattern [A-Z] matches any uppercase letter, while [a-z] matches any lowercase letter. To find names where the first character is lowercase (unusual for English names), we write:

(
    wiki
    .filter(c.doc_id.str.contains(r"^[a-z]"))
)

shape: (0, 7)

doc_id	born	died	era	gender	link	short
str	i64	i64	str	str	str	str

No results—every name starts with a capital letter, as expected. But what about names that contain a word starting with a lowercase letter? The pattern \s[a-z] matches a space followed by a lowercase letter:

(
    wiki
    .filter(c.doc_id.str.contains(r"\s[a-z]"))
)

shape: (2, 7)

doc_id	born	died	era	gender	link	short
str	i64	i64	str	str	str	str
"Marie de France"	1160	1215	"Early"	"female"	"Marie_de_France"	"Marie d. F."
"Daphne du Maurier"	1907	1989	"Twentieth C"	"female"	"Daphne_du_Maurier"	"Maurier"

This finds “Marie de France,” whose name includes the lowercase “de.” The \s is a shorthand character class meaning “any whitespace character.” There are several useful shorthands: \d matches any digit, \w matches any “word character” (letters, digits, and underscore), and \S, \D, and \W match the opposite of their lowercase counterparts.

Quantifiers let us specify repetition. The + means “one or more” of the preceding element. The pattern \d+ matches one or more consecutive digits:

(
    wiki
    .filter(c.link.str.contains(r"\d+"))
)

shape: (0, 7)

doc_id	born	died	era	gender	link	short
str	i64	i64	str	str	str	str

This finds authors whose Wikipedia link contains numbers, such as “Mary_I_of_England.” Other quantifiers include * (zero or more) and ? (zero or one).

If you want to search for literal text and avoid interpreting special characters as regex operators, set literal=True:

(
    wiki
    .filter(c.era.str.contains("C", literal=True))
)

shape: (36, 7)

doc_id	born	died	era	gender	link	short
str	i64	i64	str	str	str	str
"Thomas More"	1478	1535	"Sixteenth C"	"male"	"Thomas_More"	"More"
"Edmund Spenser"	1552	1599	"Sixteenth C"	"male"	"Edmund_Spenser"	"Spenser"
"Walter Raleigh"	1552	1618	"Sixteenth C"	"male"	"Walter_Raleigh"	"Raleigh"
"Philip Sidney"	1554	1586	"Sixteenth C"	"male"	"Philip_Sidney"	"Sidney"
"Christopher Marlowe"	1564	1593	"Sixteenth C"	"male"	"Christopher_Marlowe"	"Marlowe"
…	…	…	…	…	…	…
"Stephen Spender"	1909	1995	"Twentieth C"	"male"	"Stephen_Spender"	"Spender"
"Christopher Isherwood"	1904	1986	"Twentieth C"	"male"	"Christopher_Isherwood"	"Isherwood"
"Edward Upward"	1903	2009	"Twentieth C"	"male"	"Edward_Upward"	"Upward"
"Rex Warner"	1905	1986	"Twentieth C"	"male"	"Rex_Warner"	"Warner"
"Seamus Heaney"	1939	1939	"Twentieth C"	"male"	"Seamus_Heaney"	"Heaney"

7.5 Aggregating Strings

When working with grouped data, we sometimes want to combine text values into a single string. The join method concatenates all values in a string column, separated by a delimiter you specify. This is particularly useful after grouping.

For example, we can list all authors in each era, separated by commas:

(
    wiki
    .group_by(c.era)
    .agg(
        authors = c.short.sort().str.join(", ")
    )
)

shape: (7, 2)

era	authors
str	str
"Romantic"	"Austen, Blake, Burns, Byron, C…
"Twentieth C"	"Auden, Beckett, C. S. Lewis, C…
"Seventeenth C"	"Cavendish, Donne, Herbert, Hob…
"Early"	"Chaucer, Gower, Kempe, Langlan…
"Sixteenth C"	"Marlowe, More, Raleigh, Shakes…
"Restoration"	"Boswell, Dryden, Johnson, Lock…
"Victorian"	"Arnold, Browning, C. Brontë, D…

Notice that we sorted the names before joining to produce alphabetized lists. You can also call unique() before joining if you want to eliminate duplicates. The combination of sort, unique, and join is a common pattern for producing clean, readable summaries.

7.6 Extracting Text with Patterns

The extract method pulls out the first match of a regular expression pattern. This is used inside with_columns to create new columns. For example, we can extract the first word from each author’s full name using the pattern ^\w+, which means “one or more word characters at the start of the string”:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        first_name = c.doc_id.str.extract(r"^\w+")
    )
)

shape: (75, 2)

doc_id	first_name
str	str
"Marie de France"	null
"Geoffrey Chaucer"	null
"John Gower"	null
"William Langland"	null
"Margery Kempe"	null
…	…
"Stephen Spender"	null
"Christopher Isherwood"	null
"Edward Upward"	null
"Rex Warner"	null
"Seamus Heaney"	null

Similarly, the pattern \w+$ extracts the last word:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        last_word = c.doc_id.str.extract(r"\w+$")
    )
)

shape: (75, 2)

doc_id	last_word
str	str
"Marie de France"	null
"Geoffrey Chaucer"	null
"John Gower"	null
"William Langland"	null
"Margery Kempe"	null
…	…
"Stephen Spender"	null
"Christopher Isherwood"	null
"Edward Upward"	null
"Rex Warner"	null
"Seamus Heaney"	null

7.7 Capture Groups

When a pattern contains parentheses, they create capture groups that let you extract specific parts of a match. By default, extract returns group 1 (the first set of parentheses). You can specify which group to extract by passing a second argument.

Consider a pattern that matches a first name, then anything in the middle, then a last word: ^(\w+).*\s(\w+)$. The first group captures the opening word, and the second captures the final word. We can extract each separately:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        first_name = c.doc_id.str.extract(r"^(\w+).*\s(\w+)$", 1),
        last_word = c.doc_id.str.extract(r"^(\w+).*\s(\w+)$", 2)
    )
)

shape: (75, 3)

doc_id	first_name	last_word
str	str	str
"Marie de France"	"Marie"	"France"
"Geoffrey Chaucer"	"Geoffrey"	"Chaucer"
"John Gower"	"John"	"Gower"
"William Langland"	"William"	"Langland"
"Margery Kempe"	"Margery"	"Kempe"
…	…	…
"Stephen Spender"	"Stephen"	"Spender"
"Christopher Isherwood"	"Christopher"	"Isherwood"
"Edward Upward"	"Edward"	"Upward"
"Rex Warner"	"Rex"	"Warner"
"Seamus Heaney"	"Seamus"	"Heaney"

The . in the pattern matches any character, and .* matches zero or more of any character—this is how we skip over the middle portion of the name.

If you want all matches of a pattern rather than just the first, use extract_all, which returns a list:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        capital_letters = c.doc_id.str.extract_all(r"[A-Z]")
    )
)

shape: (75, 2)

doc_id	capital_letters
str	list[str]
"Marie de France"	["M", "F"]
"Geoffrey Chaucer"	["G", "C"]
"John Gower"	["J", "G"]
"William Langland"	["W", "L"]
"Margery Kempe"	["M", "K"]
…	…
"Stephen Spender"	["S", "S"]
"Christopher Isherwood"	["C", "I"]
"Edward Upward"	["E", "U"]
"Rex Warner"	["R", "W"]
"Seamus Heaney"	["S", "H"]

7.8 Replacing Text

The replace method substitutes the first occurrence of a pattern with a new string, while replace_all substitutes every occurrence. Below we standardize the era labels by removing the ” C” suffix:

(
    wiki
    .select(c.era)
    .with_columns(
        era_clean = c.era.str.replace(" C", "")
    )
    .unique()
)

shape: (7, 2)

era	era_clean
str	str
"Victorian"	"Victorian"
"Sixteenth C"	"Sixteenth"
"Restoration"	"Restoration"
"Twentieth C"	"Twentieth"
"Romantic"	"Romantic"
"Early"	"Early"
"Seventeenth C"	"Seventeenth"

Replacements can use capture groups from the pattern. In the replacement string, $1 refers to the first capture group, $2 to the second, and so on. Here we swap the first and last words of each name:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        swapped = c.doc_id.str.replace(r"^(\w+)(.*\s)(\w+)$", "$3$2$1")
    )
)

shape: (75, 2)

doc_id	swapped
str	str
"Marie de France"	"France de Marie"
"Geoffrey Chaucer"	"Chaucer Geoffrey"
"John Gower"	"Gower John"
"William Langland"	"Langland William"
"Margery Kempe"	"Kempe Margery"
…	…
"Stephen Spender"	"Spender Stephen"
"Christopher Isherwood"	"Isherwood Christopher"
"Edward Upward"	"Upward Edward"
"Rex Warner"	"Warner Rex"
"Seamus Heaney"	"Heaney Seamus"

7.9 Working with Substrings

Sometimes we need to extract a fixed portion of a string rather than matching a pattern. The slice method returns a substring given a starting position and length. Positions are zero-indexed, meaning the first character is at position 0.

(
    wiki
    .select(c.doc_id)
    .with_columns(
        initials = c.doc_id.str.slice(0, 1)
    )
)

shape: (75, 2)

doc_id	initials
str	str
"Marie de France"	"M"
"Geoffrey Chaucer"	"G"
"John Gower"	"J"
"William Langland"	"W"
"Margery Kempe"	"M"
…	…
"Stephen Spender"	"S"
"Christopher Isherwood"	"C"
"Edward Upward"	"E"
"Rex Warner"	"R"
"Seamus Heaney"	"S"

For cleaning whitespace, strip_chars removes leading and trailing spaces (or other characters you specify). The case conversion methods to_lowercase, to_uppercase, and to_titlecase are useful for standardizing text:

(
    wiki
    .select(c.era)
    .with_columns(
        era_upper = c.era.str.to_uppercase(),
        era_lower = c.era.str.to_lowercase()
    )
    .unique()
)

shape: (7, 3)

era	era_upper	era_lower
str	str	str
"Seventeenth C"	"SEVENTEENTH C"	"seventeenth c"
"Early"	"EARLY"	"early"
"Sixteenth C"	"SIXTEENTH C"	"sixteenth c"
"Victorian"	"VICTORIAN"	"victorian"
"Romantic"	"ROMANTIC"	"romantic"
"Restoration"	"RESTORATION"	"restoration"
"Twentieth C"	"TWENTIETH C"	"twentieth c"

7.10 Counting Strings

Several string methods return numeric values rather than text. The len_chars method returns the number of characters in each string:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        name_length = c.doc_id.str.len_chars()
    )
    .sort(c.name_length, descending=True)
)

shape: (75, 2)

doc_id	name_length
str	u32
"Elizabeth Barrett Browning"	26
"Samuel Taylor Coleridge"	23
"Alfred, Lord Tennyson"	21
"Christopher Isherwood"	21
"Percy Bysshe Shelley"	20
…	…
"John Locke"	10
"Lord Byron"	10
"John Clare"	10
"John Keats"	10
"Rex Warner"	10

The count_matches method counts how many times a pattern appears in each string. Here we count the number of spaces in each name, which tells us how many words it contains (plus one):

(
    wiki
    .select(c.doc_id)
    .with_columns(
        spaces = c.doc_id.str.count_matches(" "),
        word_count = c.doc_id.str.count_matches(" ") + 1
    )
)

shape: (75, 3)

doc_id	spaces	word_count
str	u32	u32
"Marie de France"	2	3
"Geoffrey Chaucer"	1	2
"John Gower"	1	2
"William Langland"	1	2
"Margery Kempe"	1	2
…	…	…
"Stephen Spender"	1	2
"Christopher Isherwood"	1	2
"Edward Upward"	1	2
"Rex Warner"	1	2
"Seamus Heaney"	1	2

The find method returns the starting index of the first match, or -1 if not found:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        space_pos = c.doc_id.str.find(" ")
    )
)

shape: (75, 2)

doc_id	space_pos
str	u32
"Marie de France"	5
"Geoffrey Chaucer"	8
"John Gower"	4
"William Langland"	7
"Margery Kempe"	7
…	…
"Stephen Spender"	7
"Christopher Isherwood"	11
"Edward Upward"	6
"Rex Warner"	3
"Seamus Heaney"	6

7.11 Splitting Strings

The split method divides a string into a list of substrings based on a delimiter. This creates a list column, which often needs further processing. The most common follow-up is to use explode to expand each list element into its own row.

(
    wiki
    .select(c.doc_id)
    .with_columns(
        name_parts = c.doc_id.str.split(" ")
    )
)

shape: (75, 2)

doc_id	name_parts
str	list[str]
"Marie de France"	["Marie", "de", "France"]
"Geoffrey Chaucer"	["Geoffrey", "Chaucer"]
"John Gower"	["John", "Gower"]
"William Langland"	["William", "Langland"]
"Margery Kempe"	["Margery", "Kempe"]
…	…
"Stephen Spender"	["Stephen", "Spender"]
"Christopher Isherwood"	["Christopher", "Isherwood"]
"Edward Upward"	["Edward", "Upward"]
"Rex Warner"	["Rex", "Warner"]
"Seamus Heaney"	["Seamus", "Heaney"]

When we explode the list column, each word becomes a separate row:

(
    wiki
    .select(c.doc_id, c.short)
    .with_columns(
        name_parts = c.doc_id.str.split(" ")
    )
    .explode(c.name_parts)
)

shape: (164, 3)

doc_id	short	name_parts
str	str	str
"Marie de France"	"Marie d. F."	"Marie"
"Marie de France"	"Marie d. F."	"de"
"Marie de France"	"Marie d. F."	"France"
"Geoffrey Chaucer"	"Chaucer"	"Geoffrey"
"Geoffrey Chaucer"	"Chaucer"	"Chaucer"
…	…	…
"Edward Upward"	"Upward"	"Upward"
"Rex Warner"	"Warner"	"Rex"
"Rex Warner"	"Warner"	"Warner"
"Seamus Heaney"	"Heaney"	"Seamus"
"Seamus Heaney"	"Heaney"	"Heaney"

This pattern is useful for tasks like building a vocabulary of unique words or counting word frequencies across a corpus.

7.12 RegEx Reference

Polars uses Rust-based regular expressions. The full syntax is documented on the regex crate page. Here is a summary of the patterns we have used in this chapter, along with a few additional ones. A full summary is provided in Chapter 21.

7.13 Coming from R or Pandas

If you have used the stringi package in R or the .str accessor in Pandas, the methods here will feel familiar. Polars uses the same general approach of namespacing string operations under .str. The main differences are in function names and the specific regular expression engine (Rust’s regex crate, which is similar to but not identical to PCRE). The core concepts of pattern matching, extraction, and replacement transfer directly.

7.1 Setup

7.2 Introduction

7.3 Filtering with Contains

7.4 Regular Expression Basics

7.5 Aggregating Strings

7.6 Extracting Text with Patterns

7.7 Capture Groups

7.8 Replacing Text

7.9 Working with Substrings

7.10 Counting Strings

7.11 Splitting Strings

7.12 RegEx Reference

7.13 Coming from R or Pandas

References