from siuba.data import penguins
from siuba import _, mutate, summarize, group_by, filter
= pd.Series([
fruits "apple",
"apricot",
"avocado",
"banana",
"bell pepper"
])
= pd.DataFrame({"name": fruits}) df_fruits
String operations (str
) đź“ť
Overview
String operations allow you to perform actions like:
- Match: detect when a string matches a pattern.
- Transform: e.g. convert something from mIxED to lower case, or replace part of it.
- Extract: grab specific parts of string value (e.g. a matching pattern).
This page will cover different methods for performing these actions, but will ultimately focus on str.contains()
, str.replace()
, and str.extract()
for common match, transform, and extract tasks.
Using string methods
siuba uses Pandas methods, so can use any of the string methods it makes available, like .str.upper()
.
str.upper() fruits.
0 APPLE
1 APRICOT
2 AVOCADO
3 BANANA
4 BELL PEPPER
dtype: object
Note that most string methods use .str.<method_name>()
syntax. These are called “string accessor methods”, since they are accessed from a special place (.str
).
Using in verbs
Use string methods as you would any other methods inside verbs.
= _.name.str.upper()) mutate(df_fruits, loud
name | loud | |
---|---|---|
0 | apple | APPLE |
1 | apricot | APRICOT |
2 | avocado | AVOCADO |
3 | banana | BANANA |
4 | bell pepper | BELL PEPPER |
Matching patterns
Fixed text
There are three common approaches for simple string matches:
- An exact match with
==
. - A match from an anchor point, using
str.startswith()
orstr.endswith()
. - A match from any point, using
str.contains()
# exact match
== "banana"
fruits
# starts with "ap"
str.startswith("ap")
fruits.
# ends with "cado"
str.endswith("cado")
fruits.
# has an "e" anywhere
str.contains("e", regex=False) fruits.
0 True
1 False
2 False
3 False
4 True
dtype: bool
All these operations return a boolean Series, so can be used to filter rows.
filter(df_fruits, _.name.str.startswith("ap"))
name | |
---|---|
0 | apple |
1 | apricot |
str.contains()
patterns
Use str.contains(...)
to perform matches with regular expressions—a special string syntax for specifying patterns to match.
For example, you can use "^"
or "$"
to match the start or end of a string, respectively.
# check if starts with "ap" ----
str.contains("^ap") penguins.species.
0 False
1 False
...
342 False
343 False
Name: species, Length: 344, dtype: bool
# check if endswith with "a" ----
str.contains("a$") penguins.species.
0 False
1 False
...
342 False
343 False
Name: species, Length: 344, dtype: bool
Note that "$"
and "^"
are called anchor points.
Transforming strings
String transformations take a string and return a new, changed version. For example, by converting all the letters to lower, upper, or title case.
Simple transformations
str.lower()
fruits.
str.upper() fruits.
0 APPLE
1 APRICOT
2 AVOCADO
3 BANANA
4 BELL PEPPER
dtype: object
str.replace()
patterns
Use .str.replace(..., regex=True)
with regular expressions to replace patterns in strings.
For example, the code below uses "p."
, where .
is called a wildcard–which matches any character.
str.replace("p.", "XX", regex=True) fruits.
0 aXXle
1 aXXicot
2 avocado
3 banana
4 bell XXXXer
dtype: object
Extracting parts
.str[]
to slice
.str.extract()
patterns
Use str.extract()
with a regular expression to pull out a matching piece of text.
For example the regular expression “^(.*) ” contains the following pieces:
a
matches the literal letter “a”.*
has a.
which matches anything, and*
which modifies it to apply 0 or more times.
str.extract("a(.*)") fruits.
0 | |
---|---|
0 | pple |
1 | pricot |
2 | vocado |
3 | nana |
4 | NaN |
Split and flatten
.str.split()
into list-entries
Use .str.split()
to split each entry on a character, producing a list per row of split strings.
str.split("pp") fruits.
0 [a, le]
1 [apricot]
2 [avocado]
3 [banana]
4 [bell pe, er]
dtype: object
Seeing each entry be a list may surprising, and is fairly rare in pandas.
.str.join()
is the inverse of split
str.split("e").str.join("e") penguins.species.
0 Adelie
1 Adelie
...
342 Chinstrap
343 Chinstrap
Name: species, Length: 344, dtype: object
.explode()
to unnest entries
Use .str.explode()
to take a column with list-entries (like those returned by .str.split()
) and unnest each entry, so there is 1 row per each element in each list.
= fruits.str.split("pp")
splits splits
0 [a, le]
1 [apricot]
2 [avocado]
3 [banana]
4 [bell pe, er]
dtype: object
Notice that the result above has 4 list-entries (rows). The first and last rows are the splits ["a", "le"]
and ["bell pe", "er"]
, so there are 7 elements total.
The .explode()
method makes each of the 7 elements its own row.
splits.explode()
0 a
0 le
...
4 bell pe
4 er
Length: 7, dtype: object
Be careful to note that it’s .explode()
and not .str.explode()
, since it can be used on lists of other things as well!
.str.findall()
for advanced splitting
For example, the code below uses "pp?"
, where ?
means the preceding character (“p”) is optional for matching:
str.findall("pp?") fruits.
0 [pp]
1 [p]
2 []
3 []
4 [p, pp]
dtype: object
More regular expressions
Anchor points
^
- matches the beginning of a string.$
- matches the end of a string.
Repetition qualifiers
*
- matches 0 or more+
- matches 1 or more?
- matches 0 or 1
Grouping
()
{}
[]
Alternatives
|