String operations (str) đź“ť

Warning

This page is largely complete, but is actively being refined / improved.

Overview

String operations allow you to perform actions like:

  • Match: detect when a string matches a pattern.
  • Transform: e.g. convert something from mIxED to lower case, or replace part of it.
  • Extract: grab specific parts of string value (e.g. a matching pattern).

This page will cover different methods for performing these actions, but will ultimately focus on str.contains(), str.replace(), and str.extract() for common match, transform, and extract tasks.

from siuba.data import penguins
from siuba import _, mutate, summarize, group_by, filter

fruits = pd.Series([
        "apple",
        "apricot",
        "avocado",
        "banana",
        "bell pepper"
])

df_fruits = pd.DataFrame({"name": fruits})

Using string methods

siuba uses Pandas methods, so can use any of the string methods it makes available, like .str.upper().

fruits.str.upper()
0          APPLE
1        APRICOT
2        AVOCADO
3         BANANA
4    BELL PEPPER
dtype: object

Note that most string methods use .str.<method_name>() syntax. These are called “string accessor methods”, since they are accessed from a special place (.str).

Using in verbs

Use string methods as you would any other methods inside verbs.

mutate(df_fruits, loud = _.name.str.upper())
name loud
0 apple APPLE
1 apricot APRICOT
2 avocado AVOCADO
3 banana BANANA
4 bell pepper BELL PEPPER

Matching patterns

Fixed text

There are three common approaches for simple string matches:

  • An exact match with ==.
  • A match from an anchor point, using str.startswith() or str.endswith().
  • A match from any point, using str.contains()
# exact match
fruits == "banana"

# starts with "ap"
fruits.str.startswith("ap")

# ends with "cado"
fruits.str.endswith("cado")

# has an "e" anywhere
fruits.str.contains("e", regex=False)
0     True
1    False
2    False
3    False
4     True
dtype: bool

All these operations return a boolean Series, so can be used to filter rows.

filter(df_fruits, _.name.str.startswith("ap"))
name
0 apple
1 apricot
Warning

Note that for str.contains() we set the regex=False argument. This is because—unlike operations like str.startswith()—pandas by default assumes you are passing something called a regular expression to str.contains().

str.contains() patterns

Use str.contains(...) to perform matches with regular expressions—a special string syntax for specifying patterns to match.

For example, you can use "^" or "$" to match the start or end of a string, respectively.

# check if starts with "ap" ----

penguins.species.str.contains("^ap")
0      False
1      False
       ...  
342    False
343    False
Name: species, Length: 344, dtype: bool
# check if endswith with "a" ----

penguins.species.str.contains("a$")
0      False
1      False
       ...  
342    False
343    False
Name: species, Length: 344, dtype: bool

Note that "$" and "^" are called anchor points.

Transforming strings

String transformations take a string and return a new, changed version. For example, by converting all the letters to lower, upper, or title case.

Simple transformations

fruits.str.lower()

fruits.str.upper()
0          APPLE
1        APRICOT
2        AVOCADO
3         BANANA
4    BELL PEPPER
dtype: object

str.replace() patterns

Use .str.replace(..., regex=True) with regular expressions to replace patterns in strings.

For example, the code below uses "p.", where . is called a wildcard–which matches any character.

fruits.str.replace("p.", "XX", regex=True)
0          aXXle
1        aXXicot
2        avocado
3         banana
4    bell XXXXer
dtype: object

Extracting parts

.str[] to slice

Warning

It is currently not possible to apply a sequence of slices to .str. You can only apply the same slice to every string in the Series.

.str.extract() patterns

Use str.extract() with a regular expression to pull out a matching piece of text.

For example the regular expression “^(.*) ” contains the following pieces:

  • a matches the literal letter “a”
  • .* has a . which matches anything, and * which modifies it to apply 0 or more times.
fruits.str.extract("a(.*)")
0
0 pple
1 pricot
2 vocado
3 nana
4 NaN

Split and flatten

.str.split() into list-entries

Use .str.split() to split each entry on a character, producing a list per row of split strings.

fruits.str.split("pp")
0          [a, le]
1        [apricot]
2        [avocado]
3         [banana]
4    [bell pe, er]
dtype: object

Seeing each entry be a list may surprising, and is fairly rare in pandas.

.str.join() is the inverse of split

penguins.species.str.split("e").str.join("e")
0         Adelie
1         Adelie
         ...    
342    Chinstrap
343    Chinstrap
Name: species, Length: 344, dtype: object

.explode() to unnest entries

Use .str.explode() to take a column with list-entries (like those returned by .str.split()) and unnest each entry, so there is 1 row per each element in each list.

splits = fruits.str.split("pp")
splits
0          [a, le]
1        [apricot]
2        [avocado]
3         [banana]
4    [bell pe, er]
dtype: object

Notice that the result above has 4 list-entries (rows). The first and last rows are the splits ["a", "le"] and ["bell pe", "er"], so there are 7 elements total.

The .explode() method makes each of the 7 elements its own row.

splits.explode()
0          a
0         le
      ...   
4    bell pe
4         er
Length: 7, dtype: object

Be careful to note that it’s .explode() and not .str.explode(), since it can be used on lists of other things as well!

.str.findall() for advanced splitting

For example, the code below uses "pp?", where ? means the preceding character (“p”) is optional for matching:

fruits.str.findall("pp?")
0       [pp]
1        [p]
2         []
3         []
4    [p, pp]
dtype: object

More regular expressions

Anchor points

  • ^ - matches the beginning of a string.
  • $ - matches the end of a string.

Repetition qualifiers

  • * - matches 0 or more
  • + - matches 1 or more
  • ? - matches 0 or 1

Grouping

  • ()
  • {}
  • []

Alternatives

  • |