import numpy as np
import polars as pl
from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())
wiki = pl.read_csv("data/wiki_uk_meta.csv.gz", ignore_errors=True)7 Strings
7.1 Setup
Load all of the modules and datasets needed for the chapter.
7.2 Introduction
Text data appears everywhere in data science: names, addresses, product descriptions, social media posts, and scraped web content. Working effectively with strings requires learning a small set of operations that allow us to search, extract, and transform text within a DataFrame. In this chapter, we explore the string methods available in Polars, which provide a consistent and efficient way to manipulate text columns. We will also introduce regular expressions, a powerful pattern-matching language that underlies many of these operations.
To illustrate these methods, we use a dataset of British writers compiled from Wikipedia. The dataset contains metadata about 75 authors spanning from the medieval period to the present day.
wiki.glimpse()Rows: 75
Columns: 7
$ doc_id <str> 'Marie de France', 'Geoffrey Chaucer', 'John Gower', 'William Langland', 'Margery Kempe', 'Thomas Malory', 'Thomas More', 'Edmund Spenser', 'Walter Raleigh', 'Philip Sidney'
$ born <i64> 1160, 1343, 1330, 1332, 1373, 1405, 1478, 1552, 1552, 1554
$ died <i64> 1215, 1400, 1408, 1386, 1438, 1471, 1535, 1599, 1618, 1586
$ era <str> 'Early', 'Early', 'Early', 'Early', 'Early', 'Early', 'Sixteenth C', 'Sixteenth C', 'Sixteenth C', 'Sixteenth C'
$ gender <str> 'female', 'male', 'male', 'male', 'female', 'male', 'male', 'male', 'male', 'male'
$ link <str> 'Marie_de_France', 'Geoffrey_Chaucer', 'John_Gower', 'William_Langland', 'Margery_Kempe', 'Thomas_Malory', 'Thomas_More', 'Edmund_Spenser', 'Walter_Raleigh', 'Philip_Sidney'
$ short <str> 'Marie d. F.', 'Chaucer', 'Gower', 'Langland', 'Kempe', 'Malory', 'More', 'Spenser', 'Raleigh', 'Sidney'
The doc_id column contains each author’s full name, while short provides a shortened version. The link column gives the Wikipedia URL suffix for each author’s page. We also have birth and death years, a categorical era column, and gender. This mix of structured and semi-structured text gives us plenty of opportunities to practice string manipulation.
7.3 Filtering with Contains
A common task is selecting rows where a text column contains a particular pattern. The contains method checks whether each value in a string column matches a given pattern. Let’s find all authors whose names include “William”:
(
wiki
.filter(c.doc_id.str.contains("William"))
)| doc_id | born | died | era | gender | link | short |
|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str |
| "William Langland" | 1332 | 1386 | "Early" | "male" | "William_Langland" | "Langland" |
| "William Shakespeare" | 1564 | 1616 | "Sixteenth C" | "male" | "William_Shakespeare" | "Shakespeare" |
| "William Blake" | 1757 | 1827 | "Romantic" | "male" | "William_Blake" | "Blake" |
| "William Wordsworth" | 1770 | 1850 | "Romantic" | "male" | "William_Wordsworth" | "Wordsworth" |
The pattern matching is case-sensitive. To find authors from the sixteenth century, we filter on the era column:
(
wiki
.filter(c.era.str.contains("Sixteenth"))
)| doc_id | born | died | era | gender | link | short |
|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str |
| "Thomas More" | 1478 | 1535 | "Sixteenth C" | "male" | "Thomas_More" | "More" |
| "Edmund Spenser" | 1552 | 1599 | "Sixteenth C" | "male" | "Edmund_Spenser" | "Spenser" |
| "Walter Raleigh" | 1552 | 1618 | "Sixteenth C" | "male" | "Walter_Raleigh" | "Raleigh" |
| "Philip Sidney" | 1554 | 1586 | "Sixteenth C" | "male" | "Philip_Sidney" | "Sidney" |
| "Christopher Marlowe" | 1564 | 1593 | "Sixteenth C" | "male" | "Christopher_Marlowe" | "Marlowe" |
| "William Shakespeare" | 1564 | 1616 | "Sixteenth C" | "male" | "William_Shakespeare" | "Shakespeare" |
We can combine string filters with other conditions using the standard logical operators. Below we find female authors born before 1800:
(
wiki
.filter(
(c.gender == "female") & (c.born < 1800)
)
)| doc_id | born | died | era | gender | link | short |
|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str |
| "Marie de France" | 1160 | 1215 | "Early" | "female" | "Marie_de_France" | "Marie d. F." |
| "Margery Kempe" | 1373 | 1438 | "Early" | "female" | "Margery_Kempe" | "Kempe" |
| "Emilia Lanier" | 1569 | 1645 | "Seventeenth C" | "female" | "Emilia_Lanier" | "Lanier" |
| "Katherine Philipps" | 1632 | 1664 | "Seventeenth C" | "female" | "Katherine_Philipps" | "Philipps" |
| "Margaret Cavendish" | 1623 | 1673 | "Seventeenth C" | "female" | "Margaret_Cavendish" | "Cavendish" |
| … | … | … | … | … | … | … |
| "Mary Robinson" | 1757 | 1800 | "Romantic" | "female" | "Mary_Robinson_(poet)" | "Robinson" |
| "Mary Wollstonecraft" | 1759 | 1797 | "Romantic" | "female" | "Mary_Wollstonecraft" | "Wollstonecraft" |
| "Ann Radcliffe" | 1764 | 1823 | "Romantic" | "female" | "Ann_Radcliffe" | "Radcliffe" |
| "Jane Austen" | 1775 | 1817 | "Romantic" | "female" | "Jane_Austen" | "Austen" |
| "Felicia Hemans" | 1793 | 1835 | "Romantic" | "female" | "Felicia_Hemans" | "Hemans" |
7.4 Regular Expression Basics
So far we have searched for literal text, but the contains method (and most other string methods in Polars) actually accepts regular expressions by default. Regular expressions are patterns that describe sets of strings. They give us far more flexibility than searching for exact text.
Let’s start with two simple but powerful patterns: ^ matches the start of a string, and $ matches the end. To find all authors whose names start with the letter “M”:
(
wiki
.filter(c.doc_id.str.contains(r"^M"))
)| doc_id | born | died | era | gender | link | short |
|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str |
| "Marie de France" | 1160 | 1215 | "Early" | "female" | "Marie_de_France" | "Marie d. F." |
| "Margery Kempe" | 1373 | 1438 | "Early" | "female" | "Margery_Kempe" | "Kempe" |
| "Margaret Cavendish" | 1623 | 1673 | "Seventeenth C" | "female" | "Margaret_Cavendish" | "Cavendish" |
| "Mary Robinson" | 1757 | 1800 | "Romantic" | "female" | "Mary_Robinson_(poet)" | "Robinson" |
| "Mary Wollstonecraft" | 1759 | 1797 | "Romantic" | "female" | "Mary_Wollstonecraft" | "Wollstonecraft" |
| "Matthew Arnold" | 1822 | 1888 | "Victorian" | "male" | "Matthew_Arnold" | "Arnold" |
The r before the string creates a raw string literal, which is a Python convention for regular expressions. It prevents Python from treating backslashes as escape characters, which becomes important with more complex patterns.
To find names that end with the letter “e”:
(
wiki
.filter(c.doc_id.str.contains(r"e$"))
)| doc_id | born | died | era | gender | link | short |
|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str |
| "Marie de France" | 1160 | 1215 | "Early" | "female" | "Marie_de_France" | "Marie d. F." |
| "Margery Kempe" | 1373 | 1438 | "Early" | "female" | "Margery_Kempe" | "Kempe" |
| "Thomas More" | 1478 | 1535 | "Sixteenth C" | "male" | "Thomas_More" | "More" |
| "Christopher Marlowe" | 1564 | 1593 | "Sixteenth C" | "male" | "Christopher_Marlowe" | "Marlowe" |
| "William Shakespeare" | 1564 | 1616 | "Sixteenth C" | "male" | "William_Shakespeare" | "Shakespeare" |
| … | … | … | … | … | … | … |
| "Oscar Wilde" | 1854 | 1900 | "Victorian" | "male" | "Oscar_Wilde" | "Wilde" |
| "James Joyce" | 1882 | 1941 | "Twentieth C" | "male" | "James_Joyce" | "Joyce" |
| "D. H. Lawrence" | 1885 | 1930 | "Twentieth C" | "male" | "D._H._Lawrence" | "Lawrence" |
| "A. A. Milne" | 1882 | 1956 | "Twentieth C" | "male" | "A._A._Milne" | "Milne" |
| "Louis MacNeice" | 1907 | 1963 | "Twentieth C" | "male" | "Louis_MacNeice" | "MacNeice" |
We can combine these anchors with character classes. Square brackets define a set of characters to match. The pattern [A-Z] matches any uppercase letter, while [a-z] matches any lowercase letter. To find names where the first character is lowercase (unusual for English names), we write:
(
wiki
.filter(c.doc_id.str.contains(r"^[a-z]"))
)| doc_id | born | died | era | gender | link | short |
|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str |
No results—every name starts with a capital letter, as expected. But what about names that contain a word starting with a lowercase letter? The pattern \s[a-z] matches a space followed by a lowercase letter:
(
wiki
.filter(c.doc_id.str.contains(r"\s[a-z]"))
)| doc_id | born | died | era | gender | link | short |
|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str |
| "Marie de France" | 1160 | 1215 | "Early" | "female" | "Marie_de_France" | "Marie d. F." |
| "Daphne du Maurier" | 1907 | 1989 | "Twentieth C" | "female" | "Daphne_du_Maurier" | "Maurier" |
This finds “Marie de France,” whose name includes the lowercase “de.” The \s is a shorthand character class meaning “any whitespace character.” There are several useful shorthands: \d matches any digit, \w matches any “word character” (letters, digits, and underscore), and \S, \D, and \W match the opposite of their lowercase counterparts.
Quantifiers let us specify repetition. The + means “one or more” of the preceding element. The pattern \d+ matches one or more consecutive digits:
(
wiki
.filter(c.link.str.contains(r"\d+"))
)| doc_id | born | died | era | gender | link | short |
|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str |
This finds authors whose Wikipedia link contains numbers, such as “Mary_I_of_England.” Other quantifiers include * (zero or more) and ? (zero or one).
If you want to search for literal text and avoid interpreting special characters as regex operators, set literal=True:
(
wiki
.filter(c.era.str.contains("C", literal=True))
)| doc_id | born | died | era | gender | link | short |
|---|---|---|---|---|---|---|
| str | i64 | i64 | str | str | str | str |
| "Thomas More" | 1478 | 1535 | "Sixteenth C" | "male" | "Thomas_More" | "More" |
| "Edmund Spenser" | 1552 | 1599 | "Sixteenth C" | "male" | "Edmund_Spenser" | "Spenser" |
| "Walter Raleigh" | 1552 | 1618 | "Sixteenth C" | "male" | "Walter_Raleigh" | "Raleigh" |
| "Philip Sidney" | 1554 | 1586 | "Sixteenth C" | "male" | "Philip_Sidney" | "Sidney" |
| "Christopher Marlowe" | 1564 | 1593 | "Sixteenth C" | "male" | "Christopher_Marlowe" | "Marlowe" |
| … | … | … | … | … | … | … |
| "Stephen Spender" | 1909 | 1995 | "Twentieth C" | "male" | "Stephen_Spender" | "Spender" |
| "Christopher Isherwood" | 1904 | 1986 | "Twentieth C" | "male" | "Christopher_Isherwood" | "Isherwood" |
| "Edward Upward" | 1903 | 2009 | "Twentieth C" | "male" | "Edward_Upward" | "Upward" |
| "Rex Warner" | 1905 | 1986 | "Twentieth C" | "male" | "Rex_Warner" | "Warner" |
| "Seamus Heaney" | 1939 | 1939 | "Twentieth C" | "male" | "Seamus_Heaney" | "Heaney" |
7.5 Aggregating Strings
When working with grouped data, we sometimes want to combine text values into a single string. The join method concatenates all values in a string column, separated by a delimiter you specify. This is particularly useful after grouping.
For example, we can list all authors in each era, separated by commas:
(
wiki
.group_by(c.era)
.agg(
authors = c.short.sort().str.join(", ")
)
)| era | authors |
|---|---|
| str | str |
| "Romantic" | "Austen, Blake, Burns, Byron, C… |
| "Twentieth C" | "Auden, Beckett, C. S. Lewis, C… |
| "Seventeenth C" | "Cavendish, Donne, Herbert, Hob… |
| "Early" | "Chaucer, Gower, Kempe, Langlan… |
| "Sixteenth C" | "Marlowe, More, Raleigh, Shakes… |
| "Restoration" | "Boswell, Dryden, Johnson, Lock… |
| "Victorian" | "Arnold, Browning, C. Brontë, D… |
Notice that we sorted the names before joining to produce alphabetized lists. You can also call unique() before joining if you want to eliminate duplicates. The combination of sort, unique, and join is a common pattern for producing clean, readable summaries.
7.6 Extracting Text with Patterns
The extract method pulls out the first match of a regular expression pattern. This is used inside with_columns to create new columns. For example, we can extract the first word from each author’s full name using the pattern ^\w+, which means “one or more word characters at the start of the string”:
(
wiki
.select(c.doc_id)
.with_columns(
first_name = c.doc_id.str.extract(r"^\w+")
)
)| doc_id | first_name |
|---|---|
| str | str |
| "Marie de France" | null |
| "Geoffrey Chaucer" | null |
| "John Gower" | null |
| "William Langland" | null |
| "Margery Kempe" | null |
| … | … |
| "Stephen Spender" | null |
| "Christopher Isherwood" | null |
| "Edward Upward" | null |
| "Rex Warner" | null |
| "Seamus Heaney" | null |
Similarly, the pattern \w+$ extracts the last word:
(
wiki
.select(c.doc_id)
.with_columns(
last_word = c.doc_id.str.extract(r"\w+$")
)
)| doc_id | last_word |
|---|---|
| str | str |
| "Marie de France" | null |
| "Geoffrey Chaucer" | null |
| "John Gower" | null |
| "William Langland" | null |
| "Margery Kempe" | null |
| … | … |
| "Stephen Spender" | null |
| "Christopher Isherwood" | null |
| "Edward Upward" | null |
| "Rex Warner" | null |
| "Seamus Heaney" | null |
7.7 Capture Groups
When a pattern contains parentheses, they create capture groups that let you extract specific parts of a match. By default, extract returns group 1 (the first set of parentheses). You can specify which group to extract by passing a second argument.
Consider a pattern that matches a first name, then anything in the middle, then a last word: ^(\w+).*\s(\w+)$. The first group captures the opening word, and the second captures the final word. We can extract each separately:
(
wiki
.select(c.doc_id)
.with_columns(
first_name = c.doc_id.str.extract(r"^(\w+).*\s(\w+)$", 1),
last_word = c.doc_id.str.extract(r"^(\w+).*\s(\w+)$", 2)
)
)| doc_id | first_name | last_word |
|---|---|---|
| str | str | str |
| "Marie de France" | "Marie" | "France" |
| "Geoffrey Chaucer" | "Geoffrey" | "Chaucer" |
| "John Gower" | "John" | "Gower" |
| "William Langland" | "William" | "Langland" |
| "Margery Kempe" | "Margery" | "Kempe" |
| … | … | … |
| "Stephen Spender" | "Stephen" | "Spender" |
| "Christopher Isherwood" | "Christopher" | "Isherwood" |
| "Edward Upward" | "Edward" | "Upward" |
| "Rex Warner" | "Rex" | "Warner" |
| "Seamus Heaney" | "Seamus" | "Heaney" |
The . in the pattern matches any character, and .* matches zero or more of any character—this is how we skip over the middle portion of the name.
If you want all matches of a pattern rather than just the first, use extract_all, which returns a list:
(
wiki
.select(c.doc_id)
.with_columns(
capital_letters = c.doc_id.str.extract_all(r"[A-Z]")
)
)| doc_id | capital_letters |
|---|---|
| str | list[str] |
| "Marie de France" | ["M", "F"] |
| "Geoffrey Chaucer" | ["G", "C"] |
| "John Gower" | ["J", "G"] |
| "William Langland" | ["W", "L"] |
| "Margery Kempe" | ["M", "K"] |
| … | … |
| "Stephen Spender" | ["S", "S"] |
| "Christopher Isherwood" | ["C", "I"] |
| "Edward Upward" | ["E", "U"] |
| "Rex Warner" | ["R", "W"] |
| "Seamus Heaney" | ["S", "H"] |
7.8 Replacing Text
The replace method substitutes the first occurrence of a pattern with a new string, while replace_all substitutes every occurrence. Below we standardize the era labels by removing the ” C” suffix:
(
wiki
.select(c.era)
.with_columns(
era_clean = c.era.str.replace(" C", "")
)
.unique()
)| era | era_clean |
|---|---|
| str | str |
| "Victorian" | "Victorian" |
| "Sixteenth C" | "Sixteenth" |
| "Restoration" | "Restoration" |
| "Twentieth C" | "Twentieth" |
| "Romantic" | "Romantic" |
| "Early" | "Early" |
| "Seventeenth C" | "Seventeenth" |
Replacements can use capture groups from the pattern. In the replacement string, $1 refers to the first capture group, $2 to the second, and so on. Here we swap the first and last words of each name:
(
wiki
.select(c.doc_id)
.with_columns(
swapped = c.doc_id.str.replace(r"^(\w+)(.*\s)(\w+)$", "$3$2$1")
)
)| doc_id | swapped |
|---|---|
| str | str |
| "Marie de France" | "France de Marie" |
| "Geoffrey Chaucer" | "Chaucer Geoffrey" |
| "John Gower" | "Gower John" |
| "William Langland" | "Langland William" |
| "Margery Kempe" | "Kempe Margery" |
| … | … |
| "Stephen Spender" | "Spender Stephen" |
| "Christopher Isherwood" | "Isherwood Christopher" |
| "Edward Upward" | "Upward Edward" |
| "Rex Warner" | "Warner Rex" |
| "Seamus Heaney" | "Heaney Seamus" |
7.9 Working with Substrings
Sometimes we need to extract a fixed portion of a string rather than matching a pattern. The slice method returns a substring given a starting position and length. Positions are zero-indexed, meaning the first character is at position 0.
(
wiki
.select(c.doc_id)
.with_columns(
initials = c.doc_id.str.slice(0, 1)
)
)| doc_id | initials |
|---|---|
| str | str |
| "Marie de France" | "M" |
| "Geoffrey Chaucer" | "G" |
| "John Gower" | "J" |
| "William Langland" | "W" |
| "Margery Kempe" | "M" |
| … | … |
| "Stephen Spender" | "S" |
| "Christopher Isherwood" | "C" |
| "Edward Upward" | "E" |
| "Rex Warner" | "R" |
| "Seamus Heaney" | "S" |
For cleaning whitespace, strip_chars removes leading and trailing spaces (or other characters you specify). The case conversion methods to_lowercase, to_uppercase, and to_titlecase are useful for standardizing text:
(
wiki
.select(c.era)
.with_columns(
era_upper = c.era.str.to_uppercase(),
era_lower = c.era.str.to_lowercase()
)
.unique()
)| era | era_upper | era_lower |
|---|---|---|
| str | str | str |
| "Seventeenth C" | "SEVENTEENTH C" | "seventeenth c" |
| "Early" | "EARLY" | "early" |
| "Sixteenth C" | "SIXTEENTH C" | "sixteenth c" |
| "Victorian" | "VICTORIAN" | "victorian" |
| "Romantic" | "ROMANTIC" | "romantic" |
| "Restoration" | "RESTORATION" | "restoration" |
| "Twentieth C" | "TWENTIETH C" | "twentieth c" |
7.10 Counting Strings
Several string methods return numeric values rather than text. The len_chars method returns the number of characters in each string:
(
wiki
.select(c.doc_id)
.with_columns(
name_length = c.doc_id.str.len_chars()
)
.sort(c.name_length, descending=True)
)| doc_id | name_length |
|---|---|
| str | u32 |
| "Elizabeth Barrett Browning" | 26 |
| "Samuel Taylor Coleridge" | 23 |
| "Alfred, Lord Tennyson" | 21 |
| "Christopher Isherwood" | 21 |
| "Percy Bysshe Shelley" | 20 |
| … | … |
| "John Locke" | 10 |
| "Lord Byron" | 10 |
| "John Clare" | 10 |
| "John Keats" | 10 |
| "Rex Warner" | 10 |
The count_matches method counts how many times a pattern appears in each string. Here we count the number of spaces in each name, which tells us how many words it contains (plus one):
(
wiki
.select(c.doc_id)
.with_columns(
spaces = c.doc_id.str.count_matches(" "),
word_count = c.doc_id.str.count_matches(" ") + 1
)
)| doc_id | spaces | word_count |
|---|---|---|
| str | u32 | u32 |
| "Marie de France" | 2 | 3 |
| "Geoffrey Chaucer" | 1 | 2 |
| "John Gower" | 1 | 2 |
| "William Langland" | 1 | 2 |
| "Margery Kempe" | 1 | 2 |
| … | … | … |
| "Stephen Spender" | 1 | 2 |
| "Christopher Isherwood" | 1 | 2 |
| "Edward Upward" | 1 | 2 |
| "Rex Warner" | 1 | 2 |
| "Seamus Heaney" | 1 | 2 |
The find method returns the starting index of the first match, or -1 if not found:
(
wiki
.select(c.doc_id)
.with_columns(
space_pos = c.doc_id.str.find(" ")
)
)| doc_id | space_pos |
|---|---|
| str | u32 |
| "Marie de France" | 5 |
| "Geoffrey Chaucer" | 8 |
| "John Gower" | 4 |
| "William Langland" | 7 |
| "Margery Kempe" | 7 |
| … | … |
| "Stephen Spender" | 7 |
| "Christopher Isherwood" | 11 |
| "Edward Upward" | 6 |
| "Rex Warner" | 3 |
| "Seamus Heaney" | 6 |
7.11 Splitting Strings
The split method divides a string into a list of substrings based on a delimiter. This creates a list column, which often needs further processing. The most common follow-up is to use explode to expand each list element into its own row.
(
wiki
.select(c.doc_id)
.with_columns(
name_parts = c.doc_id.str.split(" ")
)
)| doc_id | name_parts |
|---|---|
| str | list[str] |
| "Marie de France" | ["Marie", "de", "France"] |
| "Geoffrey Chaucer" | ["Geoffrey", "Chaucer"] |
| "John Gower" | ["John", "Gower"] |
| "William Langland" | ["William", "Langland"] |
| "Margery Kempe" | ["Margery", "Kempe"] |
| … | … |
| "Stephen Spender" | ["Stephen", "Spender"] |
| "Christopher Isherwood" | ["Christopher", "Isherwood"] |
| "Edward Upward" | ["Edward", "Upward"] |
| "Rex Warner" | ["Rex", "Warner"] |
| "Seamus Heaney" | ["Seamus", "Heaney"] |
When we explode the list column, each word becomes a separate row:
(
wiki
.select(c.doc_id, c.short)
.with_columns(
name_parts = c.doc_id.str.split(" ")
)
.explode(c.name_parts)
)| doc_id | short | name_parts |
|---|---|---|
| str | str | str |
| "Marie de France" | "Marie d. F." | "Marie" |
| "Marie de France" | "Marie d. F." | "de" |
| "Marie de France" | "Marie d. F." | "France" |
| "Geoffrey Chaucer" | "Chaucer" | "Geoffrey" |
| "Geoffrey Chaucer" | "Chaucer" | "Chaucer" |
| … | … | … |
| "Edward Upward" | "Upward" | "Upward" |
| "Rex Warner" | "Warner" | "Rex" |
| "Rex Warner" | "Warner" | "Warner" |
| "Seamus Heaney" | "Heaney" | "Seamus" |
| "Seamus Heaney" | "Heaney" | "Heaney" |
This pattern is useful for tasks like building a vocabulary of unique words or counting word frequencies across a corpus.
7.12 RegEx Reference
Polars uses Rust-based regular expressions. The full syntax is documented on the regex crate page. Here is a summary of the patterns we have used in this chapter, along with a few additional ones. A full summary is provided in Chapter 21.
7.13 Coming from R or Pandas
If you have used the stringi package in R or the .str accessor in Pandas, the methods here will feel familiar. Polars uses the same general approach of namespacing string operations under .str. The main differences are in function names and the specific regular expression engine (Rust’s regex crate, which is similar to but not identical to PCRE). The core concepts of pattern matching, extraction, and replacement transfer directly.