7  Strings

7.1 Setup

Load all of the modules and datasets needed for the chapter.

import numpy as np
import polars as pl

from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())

wiki = pl.read_csv("data/wiki_uk_meta.csv.gz", ignore_errors=True)

7.2 Introduction

Text data appears everywhere in data science: names, addresses, product descriptions, social media posts, and scraped web content. Working effectively with strings requires learning a small set of operations that allow us to search, extract, and transform text within a DataFrame. In this chapter, we explore the string methods available in Polars, which provide a consistent and efficient way to manipulate text columns. We will also introduce regular expressions, a powerful pattern-matching language that underlies many of these operations.

To illustrate these methods, we use a dataset of British writers compiled from Wikipedia. The dataset contains metadata about 75 authors spanning from the medieval period to the present day.

wiki.glimpse()
Rows: 75
Columns: 7
$ doc_id <str> 'Marie de France', 'Geoffrey Chaucer', 'John Gower', 'William Langland', 'Margery Kempe', 'Thomas Malory', 'Thomas More', 'Edmund Spenser', 'Walter Raleigh', 'Philip Sidney'
$ born   <i64> 1160, 1343, 1330, 1332, 1373, 1405, 1478, 1552, 1552, 1554
$ died   <i64> 1215, 1400, 1408, 1386, 1438, 1471, 1535, 1599, 1618, 1586
$ era    <str> 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Sixteenth C', 'Sixteenth C', 'Sixteenth C', 'Sixteenth C'
$ gender <str> 'female', 'male', 'male', 'male', 'female', 'male', 'male', 'male', 'male', 'male'
$ link   <str> 'Marie_de_France', 'Geoffrey_Chaucer', 'John_Gower', 'William_Langland', 'Margery_Kempe', 'Thomas_Malory', 'Thomas_More', 'Edmund_Spenser', 'Walter_Raleigh', 'Philip_Sidney'
$ short  <str> 'Marie d. F.', 'Chaucer', 'Gower', 'Langland', 'Kempe', 'Malory', 'More', 'Spenser', 'Raleigh', 'Sidney'

The doc_id column contains each author’s full name, while short provides a shortened version. The link column gives the Wikipedia URL suffix for each author’s page. We also have birth and death years, a categorical era column, and gender. This mix of structured and semi-structured text gives us plenty of opportunities to practice string manipulation.

7.3 Filtering with Contains

A common task is selecting rows where a text column contains a particular pattern. The contains method checks whether each value in a string column matches a given pattern. Let’s find all authors whose names include “William”:

(
    wiki
    .filter(c.doc_id.str.contains("William"))
)
shape: (4, 7)
doc_id born died era gender link short
str i64 i64 str str str str
"William Langland" 1332 1386 "Early" "male" "William_Langland" "Langland"
"William Shakespeare" 1564 1616 "Sixteenth C" "male" "William_Shakespeare" "Shakespeare"
"William Blake" 1757 1827 "Romantic" "male" "William_Blake" "Blake"
"William Wordsworth" 1770 1850 "Romantic" "male" "William_Wordsworth" "Wordsworth"

The pattern matching is case-sensitive. To find authors from the sixteenth century, we filter on the era column:

(
    wiki
    .filter(c.era.str.contains("Sixteenth"))
)
shape: (6, 7)
doc_id born died era gender link short
str i64 i64 str str str str
"Thomas More" 1478 1535 "Sixteenth C" "male" "Thomas_More" "More"
"Edmund Spenser" 1552 1599 "Sixteenth C" "male" "Edmund_Spenser" "Spenser"
"Walter Raleigh" 1552 1618 "Sixteenth C" "male" "Walter_Raleigh" "Raleigh"
"Philip Sidney" 1554 1586 "Sixteenth C" "male" "Philip_Sidney" "Sidney"
"Christopher Marlowe" 1564 1593 "Sixteenth C" "male" "Christopher_Marlowe" "Marlowe"
"William Shakespeare" 1564 1616 "Sixteenth C" "male" "William_Shakespeare" "Shakespeare"

We can combine string filters with other conditions using the standard logical operators. Below we find female authors born before 1800:

(
    wiki
    .filter(
        (c.gender == "female") & (c.born < 1800)
    )
)
shape: (11, 7)
doc_id born died era gender link short
str i64 i64 str str str str
"Marie de France" 1160 1215 "Early" "female" "Marie_de_France" "Marie d. F."
"Margery Kempe" 1373 1438 "Early" "female" "Margery_Kempe" "Kempe"
"Emilia Lanier" 1569 1645 "Seventeenth C" "female" "Emilia_Lanier" "Lanier"
"Katherine Philipps" 1632 1664 "Seventeenth C" "female" "Katherine_Philipps" "Philipps"
"Margaret Cavendish" 1623 1673 "Seventeenth C" "female" "Margaret_Cavendish" "Cavendish"
"Mary Robinson" 1757 1800 "Romantic" "female" "Mary_Robinson_(poet)" "Robinson"
"Mary Wollstonecraft" 1759 1797 "Romantic" "female" "Mary_Wollstonecraft" "Wollstonecraft"
"Ann Radcliffe" 1764 1823 "Romantic" "female" "Ann_Radcliffe" "Radcliffe"
"Jane Austen" 1775 1817 "Romantic" "female" "Jane_Austen" "Austen"
"Felicia Hemans" 1793 1835 "Romantic" "female" "Felicia_Hemans" "Hemans"

7.4 Regular Expression Basics

So far we have searched for literal text, but the contains method (and most other string methods in Polars) actually accepts regular expressions by default. Regular expressions are patterns that describe sets of strings. They give us far more flexibility than searching for exact text.

Let’s start with two simple but powerful patterns: ^ matches the start of a string, and $ matches the end. To find all authors whose names start with the letter “M”:

(
    wiki
    .filter(c.doc_id.str.contains(r"^M"))
)
shape: (6, 7)
doc_id born died era gender link short
str i64 i64 str str str str
"Marie de France" 1160 1215 "Early" "female" "Marie_de_France" "Marie d. F."
"Margery Kempe" 1373 1438 "Early" "female" "Margery_Kempe" "Kempe"
"Margaret Cavendish" 1623 1673 "Seventeenth C" "female" "Margaret_Cavendish" "Cavendish"
"Mary Robinson" 1757 1800 "Romantic" "female" "Mary_Robinson_(poet)" "Robinson"
"Mary Wollstonecraft" 1759 1797 "Romantic" "female" "Mary_Wollstonecraft" "Wollstonecraft"
"Matthew Arnold" 1822 1888 "Victorian" "male" "Matthew_Arnold" "Arnold"

The r before the string creates a raw string literal, which is a Python convention for regular expressions. It prevents Python from treating backslashes as escape characters, which becomes important with more complex patterns.

To find names that end with the letter “e”:

(
    wiki
    .filter(c.doc_id.str.contains(r"e$"))
)
shape: (17, 7)
doc_id born died era gender link short
str i64 i64 str str str str
"Marie de France" 1160 1215 "Early" "female" "Marie_de_France" "Marie d. F."
"Margery Kempe" 1373 1438 "Early" "female" "Margery_Kempe" "Kempe"
"Thomas More" 1478 1535 "Sixteenth C" "male" "Thomas_More" "More"
"Christopher Marlowe" 1564 1593 "Sixteenth C" "male" "Christopher_Marlowe" "Marlowe"
"William Shakespeare" 1564 1616 "Sixteenth C" "male" "William_Shakespeare" "Shakespeare"
"Oscar Wilde" 1854 1900 "Victorian" "male" "Oscar_Wilde" "Wilde"
"James Joyce" 1882 1941 "Twentieth C" "male" "James_Joyce" "Joyce"
"D. H. Lawrence" 1885 1930 "Twentieth C" "male" "D._H._Lawrence" "Lawrence"
"A. A. Milne" 1882 1956 "Twentieth C" "male" "A._A._Milne" "Milne"
"Louis MacNeice" 1907 1963 "Twentieth C" "male" "Louis_MacNeice" "MacNeice"

We can combine these anchors with character classes. Square brackets define a set of characters to match. The pattern [A-Z] matches any uppercase letter, while [a-z] matches any lowercase letter. To find names where the first character is lowercase (unusual for English names), we write:

(
    wiki
    .filter(c.doc_id.str.contains(r"^[a-z]"))
)
shape: (0, 7)
doc_id born died era gender link short
str i64 i64 str str str str

No results—every name starts with a capital letter, as expected. But what about names that contain a word starting with a lowercase letter? The pattern \s[a-z] matches a space followed by a lowercase letter:

(
    wiki
    .filter(c.doc_id.str.contains(r"\s[a-z]"))
)
shape: (2, 7)
doc_id born died era gender link short
str i64 i64 str str str str
"Marie de France" 1160 1215 "Early" "female" "Marie_de_France" "Marie d. F."
"Daphne du Maurier" 1907 1989 "Twentieth C" "female" "Daphne_du_Maurier" "Maurier"

This finds “Marie de France,” whose name includes the lowercase “de.” The \s is a shorthand character class meaning “any whitespace character.” There are several useful shorthands: \d matches any digit, \w matches any “word character” (letters, digits, and underscore), and \S, \D, and \W match the opposite of their lowercase counterparts.

Quantifiers let us specify repetition. The + means “one or more” of the preceding element. The pattern \d+ matches one or more consecutive digits:

(
    wiki
    .filter(c.link.str.contains(r"\d+"))
)
shape: (0, 7)
doc_id born died era gender link short
str i64 i64 str str str str

This finds authors whose Wikipedia link contains numbers, such as “Mary_I_of_England.” Other quantifiers include * (zero or more) and ? (zero or one).

If you want to search for literal text and avoid interpreting special characters as regex operators, set literal=True:

(
    wiki
    .filter(c.era.str.contains("C", literal=True))
)
shape: (36, 7)
doc_id born died era gender link short
str i64 i64 str str str str
"Thomas More" 1478 1535 "Sixteenth C" "male" "Thomas_More" "More"
"Edmund Spenser" 1552 1599 "Sixteenth C" "male" "Edmund_Spenser" "Spenser"
"Walter Raleigh" 1552 1618 "Sixteenth C" "male" "Walter_Raleigh" "Raleigh"
"Philip Sidney" 1554 1586 "Sixteenth C" "male" "Philip_Sidney" "Sidney"
"Christopher Marlowe" 1564 1593 "Sixteenth C" "male" "Christopher_Marlowe" "Marlowe"
"Stephen Spender" 1909 1995 "Twentieth C" "male" "Stephen_Spender" "Spender"
"Christopher Isherwood" 1904 1986 "Twentieth C" "male" "Christopher_Isherwood" "Isherwood"
"Edward Upward" 1903 2009 "Twentieth C" "male" "Edward_Upward" "Upward"
"Rex Warner" 1905 1986 "Twentieth C" "male" "Rex_Warner" "Warner"
"Seamus Heaney" 1939 1939 "Twentieth C" "male" "Seamus_Heaney" "Heaney"

7.5 Aggregating Strings

When working with grouped data, we sometimes want to combine text values into a single string. The join method concatenates all values in a string column, separated by a delimiter you specify. This is particularly useful after grouping.

For example, we can list all authors in each era, separated by commas:

(
    wiki
    .group_by(c.era)
    .agg(
        authors = c.short.sort().str.join(", ")
    )
)
shape: (7, 2)
era authors
str str
"Romantic" "Austen, Blake, Burns, Byron, C…
"Twentieth C" "Auden, Beckett, C. S. Lewis, C…
"Seventeenth C" "Cavendish, Donne, Herbert, Hob…
"Early" "Chaucer, Gower, Kempe, Langlan…
"Sixteenth C" "Marlowe, More, Raleigh, Shakes…
"Restoration" "Boswell, Dryden, Johnson, Lock…
"Victorian" "Arnold, Browning, C. Brontë, D…

Notice that we sorted the names before joining to produce alphabetized lists. You can also call unique() before joining if you want to eliminate duplicates. The combination of sort, unique, and join is a common pattern for producing clean, readable summaries.

7.6 Extracting Text with Patterns

The extract method pulls out the first match of a regular expression pattern. This is used inside with_columns to create new columns. For example, we can extract the first word from each author’s full name using the pattern ^\w+, which means “one or more word characters at the start of the string”:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        first_name = c.doc_id.str.extract(r"^\w+")
    )
)
shape: (75, 2)
doc_id first_name
str str
"Marie de France" null
"Geoffrey Chaucer" null
"John Gower" null
"William Langland" null
"Margery Kempe" null
"Stephen Spender" null
"Christopher Isherwood" null
"Edward Upward" null
"Rex Warner" null
"Seamus Heaney" null

Similarly, the pattern \w+$ extracts the last word:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        last_word = c.doc_id.str.extract(r"\w+$")
    )
)
shape: (75, 2)
doc_id last_word
str str
"Marie de France" null
"Geoffrey Chaucer" null
"John Gower" null
"William Langland" null
"Margery Kempe" null
"Stephen Spender" null
"Christopher Isherwood" null
"Edward Upward" null
"Rex Warner" null
"Seamus Heaney" null

7.7 Capture Groups

When a pattern contains parentheses, they create capture groups that let you extract specific parts of a match. By default, extract returns group 1 (the first set of parentheses). You can specify which group to extract by passing a second argument.

Consider a pattern that matches a first name, then anything in the middle, then a last word: ^(\w+).*\s(\w+)$. The first group captures the opening word, and the second captures the final word. We can extract each separately:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        first_name = c.doc_id.str.extract(r"^(\w+).*\s(\w+)$", 1),
        last_word = c.doc_id.str.extract(r"^(\w+).*\s(\w+)$", 2)
    )
)
shape: (75, 3)
doc_id first_name last_word
str str str
"Marie de France" "Marie" "France"
"Geoffrey Chaucer" "Geoffrey" "Chaucer"
"John Gower" "John" "Gower"
"William Langland" "William" "Langland"
"Margery Kempe" "Margery" "Kempe"
"Stephen Spender" "Stephen" "Spender"
"Christopher Isherwood" "Christopher" "Isherwood"
"Edward Upward" "Edward" "Upward"
"Rex Warner" "Rex" "Warner"
"Seamus Heaney" "Seamus" "Heaney"

The . in the pattern matches any character, and .* matches zero or more of any character—this is how we skip over the middle portion of the name.

If you want all matches of a pattern rather than just the first, use extract_all, which returns a list:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        capital_letters = c.doc_id.str.extract_all(r"[A-Z]")
    )
)
shape: (75, 2)
doc_id capital_letters
str list[str]
"Marie de France" ["M", "F"]
"Geoffrey Chaucer" ["G", "C"]
"John Gower" ["J", "G"]
"William Langland" ["W", "L"]
"Margery Kempe" ["M", "K"]
"Stephen Spender" ["S", "S"]
"Christopher Isherwood" ["C", "I"]
"Edward Upward" ["E", "U"]
"Rex Warner" ["R", "W"]
"Seamus Heaney" ["S", "H"]

7.8 Replacing Text

The replace method substitutes the first occurrence of a pattern with a new string, while replace_all substitutes every occurrence. Below we standardize the era labels by removing the ” C” suffix:

(
    wiki
    .select(c.era)
    .with_columns(
        era_clean = c.era.str.replace(" C", "")
    )
    .unique()
)
shape: (7, 2)
era era_clean
str str
"Victorian" "Victorian"
"Sixteenth C" "Sixteenth"
"Restoration" "Restoration"
"Twentieth C" "Twentieth"
"Romantic" "Romantic"
"Early" "Early"
"Seventeenth C" "Seventeenth"

Replacements can use capture groups from the pattern. In the replacement string, $1 refers to the first capture group, $2 to the second, and so on. Here we swap the first and last words of each name:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        swapped = c.doc_id.str.replace(r"^(\w+)(.*\s)(\w+)$", "$3$2$1")
    )
)
shape: (75, 2)
doc_id swapped
str str
"Marie de France" "France de Marie"
"Geoffrey Chaucer" "Chaucer Geoffrey"
"John Gower" "Gower John"
"William Langland" "Langland William"
"Margery Kempe" "Kempe Margery"
"Stephen Spender" "Spender Stephen"
"Christopher Isherwood" "Isherwood Christopher"
"Edward Upward" "Upward Edward"
"Rex Warner" "Warner Rex"
"Seamus Heaney" "Heaney Seamus"

7.9 Working with Substrings

Sometimes we need to extract a fixed portion of a string rather than matching a pattern. The slice method returns a substring given a starting position and length. Positions are zero-indexed, meaning the first character is at position 0.

(
    wiki
    .select(c.doc_id)
    .with_columns(
        initials = c.doc_id.str.slice(0, 1)
    )
)
shape: (75, 2)
doc_id initials
str str
"Marie de France" "M"
"Geoffrey Chaucer" "G"
"John Gower" "J"
"William Langland" "W"
"Margery Kempe" "M"
"Stephen Spender" "S"
"Christopher Isherwood" "C"
"Edward Upward" "E"
"Rex Warner" "R"
"Seamus Heaney" "S"

For cleaning whitespace, strip_chars removes leading and trailing spaces (or other characters you specify). The case conversion methods to_lowercase, to_uppercase, and to_titlecase are useful for standardizing text:

(
    wiki
    .select(c.era)
    .with_columns(
        era_upper = c.era.str.to_uppercase(),
        era_lower = c.era.str.to_lowercase()
    )
    .unique()
)
shape: (7, 3)
era era_upper era_lower
str str str
"Seventeenth C" "SEVENTEENTH C" "seventeenth c"
"Early" "EARLY" "early"
"Sixteenth C" "SIXTEENTH C" "sixteenth c"
"Victorian" "VICTORIAN" "victorian"
"Romantic" "ROMANTIC" "romantic"
"Restoration" "RESTORATION" "restoration"
"Twentieth C" "TWENTIETH C" "twentieth c"

7.10 Counting Strings

Several string methods return numeric values rather than text. The len_chars method returns the number of characters in each string:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        name_length = c.doc_id.str.len_chars()
    )
    .sort(c.name_length, descending=True)
)
shape: (75, 2)
doc_id name_length
str u32
"Elizabeth Barrett Browning" 26
"Samuel Taylor Coleridge" 23
"Alfred, Lord Tennyson" 21
"Christopher Isherwood" 21
"Percy Bysshe Shelley" 20
"John Locke" 10
"Lord Byron" 10
"John Clare" 10
"John Keats" 10
"Rex Warner" 10

The count_matches method counts how many times a pattern appears in each string. Here we count the number of spaces in each name, which tells us how many words it contains (plus one):

(
    wiki
    .select(c.doc_id)
    .with_columns(
        spaces = c.doc_id.str.count_matches(" "),
        word_count = c.doc_id.str.count_matches(" ") + 1
    )
)
shape: (75, 3)
doc_id spaces word_count
str u32 u32
"Marie de France" 2 3
"Geoffrey Chaucer" 1 2
"John Gower" 1 2
"William Langland" 1 2
"Margery Kempe" 1 2
"Stephen Spender" 1 2
"Christopher Isherwood" 1 2
"Edward Upward" 1 2
"Rex Warner" 1 2
"Seamus Heaney" 1 2

The find method returns the starting index of the first match, or -1 if not found:

(
    wiki
    .select(c.doc_id)
    .with_columns(
        space_pos = c.doc_id.str.find(" ")
    )
)
shape: (75, 2)
doc_id space_pos
str u32
"Marie de France" 5
"Geoffrey Chaucer" 8
"John Gower" 4
"William Langland" 7
"Margery Kempe" 7
"Stephen Spender" 7
"Christopher Isherwood" 11
"Edward Upward" 6
"Rex Warner" 3
"Seamus Heaney" 6

7.11 Splitting Strings

The split method divides a string into a list of substrings based on a delimiter. This creates a list column, which often needs further processing. The most common follow-up is to use explode to expand each list element into its own row.

(
    wiki
    .select(c.doc_id)
    .with_columns(
        name_parts = c.doc_id.str.split(" ")
    )
)
shape: (75, 2)
doc_id name_parts
str list[str]
"Marie de France" ["Marie", "de", "France"]
"Geoffrey Chaucer" ["Geoffrey", "Chaucer"]
"John Gower" ["John", "Gower"]
"William Langland" ["William", "Langland"]
"Margery Kempe" ["Margery", "Kempe"]
"Stephen Spender" ["Stephen", "Spender"]
"Christopher Isherwood" ["Christopher", "Isherwood"]
"Edward Upward" ["Edward", "Upward"]
"Rex Warner" ["Rex", "Warner"]
"Seamus Heaney" ["Seamus", "Heaney"]

When we explode the list column, each word becomes a separate row:

(
    wiki
    .select(c.doc_id, c.short)
    .with_columns(
        name_parts = c.doc_id.str.split(" ")
    )
    .explode(c.name_parts)
)
shape: (164, 3)
doc_id short name_parts
str str str
"Marie de France" "Marie d. F." "Marie"
"Marie de France" "Marie d. F." "de"
"Marie de France" "Marie d. F." "France"
"Geoffrey Chaucer" "Chaucer" "Geoffrey"
"Geoffrey Chaucer" "Chaucer" "Chaucer"
"Edward Upward" "Upward" "Upward"
"Rex Warner" "Warner" "Rex"
"Rex Warner" "Warner" "Warner"
"Seamus Heaney" "Heaney" "Seamus"
"Seamus Heaney" "Heaney" "Heaney"

This pattern is useful for tasks like building a vocabulary of unique words or counting word frequencies across a corpus.

7.12 RegEx Reference

Polars uses Rust-based regular expressions. The full syntax is documented on the regex crate page. Here is a summary of the patterns we have used in this chapter, along with a few additional ones. A full summary is provided in Chapter 21.

7.13 Coming from R or Pandas

If you have used the stringi package in R or the .str accessor in Pandas, the methods here will feel familiar. Polars uses the same general approach of namespacing string operations under .str. The main differences are in function names and the specific regular expression engine (Rust’s regex crate, which is similar to but not identical to PCRE). The core concepts of pattern matching, extraction, and replacement transfer directly.

References