Mutate to transform

The mutate() function creates a new column of data, or overwrite an existing one.

We’ll use a subset of the mtcars dataset for examples.

from siuba import _, group_by, mutate, select
from siuba.data import mtcars

small_cars = mtcars[["mpg", "cyl", "hp"]]

Basics

small_cars >> mutate(mpg_per_cyl = _.mpg / _.cyl)
mpg cyl hp mpg_per_cyl
0 21.0 6 110 3.500
1 21.0 6 110 3.500
... ... ... ... ...
30 15.0 8 335 1.875
31 21.4 4 109 5.350

32 rows × 4 columns

Replacing columns

When a created column is given the same name as an existing column, it replaces that column in the data.

small_cars >> mutate(mpg = _.mpg - _.mpg.mean(), new_column = 1)
mpg cyl hp new_column
0 0.909375 6 110 1
1 0.909375 6 110 1
... ... ... ... ...
30 -5.090625 8 335 1
31 1.309375 4 109 1

32 rows × 4 columns

Note that replacement columns are put in the same position as the original columns. For example, in the result above, the mpg column is still in the first position on the left.

Using previous arguments

Arguments can refer to columns that were created in earlier arguments.

small_cars >> mutate(cyl2 = _.cyl * 2, cyl4 = _.cyl2 * 2)
mpg cyl hp cyl2 cyl4
0 21.0 6 110 12 24
1 21.0 6 110 12 24
... ... ... ... ... ...
30 15.0 8 335 16 32
31 21.4 4 109 8 16

32 rows × 5 columns

In the code above, cyl4 uses the earlier argument cyl2.

Grouped mutates

(small_cars
  >> group_by(_.cyl)
  >> mutate(
       hp_mean = _.hp.mean(),
       demeaned_hp = _.hp - _.hp_mean
     )
  )

(grouped data frame)

mpg cyl hp hp_mean demeaned_hp
0 21.0 6 110 122.285714 -12.285714
1 21.0 6 110 122.285714 -12.285714
... ... ... ... ... ...
30 15.0 8 335 209.214286 125.785714
31 21.4 4 109 82.636364 26.363636

32 rows × 5 columns

(small_cars
  >> group_by(_.cyl)
  >> mutate(
       hp_per_cyl = _.hp / _.cyl,
       diff = _.hp_per_cyl - _.hp_per_cyl.shift(1)
     )
  )

(grouped data frame)

mpg cyl hp hp_per_cyl diff
0 21.0 6 110 18.333333 NaN
1 21.0 6 110 18.333333 0.000
... ... ... ... ... ...
30 15.0 8 335 41.875000 8.875
31 21.4 4 109 27.250000 -1.000

32 rows × 5 columns