from siuba import _, group_by, mutate, select
from siuba.data import mtcars
= mtcars[["mpg", "cyl", "hp"]] small_cars
Mutate to transform
The mutate()
function creates a new column of data, or overwrite an existing one.
We’ll use a subset of the mtcars
dataset for examples.
Basics
>> mutate(mpg_per_cyl = _.mpg / _.cyl) small_cars
mpg | cyl | hp | mpg_per_cyl | |
---|---|---|---|---|
0 | 21.0 | 6 | 110 | 3.500 |
1 | 21.0 | 6 | 110 | 3.500 |
... | ... | ... | ... | ... |
30 | 15.0 | 8 | 335 | 1.875 |
31 | 21.4 | 4 | 109 | 5.350 |
32 rows × 4 columns
Replacing columns
When a created column is given the same name as an existing column, it replaces that column in the data.
>> mutate(mpg = _.mpg - _.mpg.mean(), new_column = 1) small_cars
mpg | cyl | hp | new_column | |
---|---|---|---|---|
0 | 0.909375 | 6 | 110 | 1 |
1 | 0.909375 | 6 | 110 | 1 |
... | ... | ... | ... | ... |
30 | -5.090625 | 8 | 335 | 1 |
31 | 1.309375 | 4 | 109 | 1 |
32 rows × 4 columns
Note that replacement columns are put in the same position as the original columns. For example, in the result above, the mpg
column is still in the first position on the left.
Using previous arguments
Arguments can refer to columns that were created in earlier arguments.
>> mutate(cyl2 = _.cyl * 2, cyl4 = _.cyl2 * 2) small_cars
mpg | cyl | hp | cyl2 | cyl4 | |
---|---|---|---|---|---|
0 | 21.0 | 6 | 110 | 12 | 24 |
1 | 21.0 | 6 | 110 | 12 | 24 |
... | ... | ... | ... | ... | ... |
30 | 15.0 | 8 | 335 | 16 | 32 |
31 | 21.4 | 4 | 109 | 8 | 16 |
32 rows × 5 columns
In the code above, cyl4
uses the earlier argument cyl2
.
Grouped mutates
(small_cars>> group_by(_.cyl)
>> mutate(
= _.hp.mean(),
hp_mean = _.hp - _.hp_mean
demeaned_hp
) )
(grouped data frame)
mpg | cyl | hp | hp_mean | demeaned_hp | |
---|---|---|---|---|---|
0 | 21.0 | 6 | 110 | 122.285714 | -12.285714 |
1 | 21.0 | 6 | 110 | 122.285714 | -12.285714 |
... | ... | ... | ... | ... | ... |
30 | 15.0 | 8 | 335 | 209.214286 | 125.785714 |
31 | 21.4 | 4 | 109 | 82.636364 | 26.363636 |
32 rows × 5 columns
(small_cars>> group_by(_.cyl)
>> mutate(
= _.hp / _.cyl,
hp_per_cyl = _.hp_per_cyl - _.hp_per_cyl.shift(1)
diff
) )
(grouped data frame)
mpg | cyl | hp | hp_per_cyl | diff | |
---|---|---|---|---|---|
0 | 21.0 | 6 | 110 | 18.333333 | NaN |
1 | 21.0 | 6 | 110 | 18.333333 | 0.000 |
... | ... | ... | ... | ... | ... |
30 | 15.0 | 8 | 335 | 41.875000 | 8.875 |
31 | 21.4 | 4 | 109 | 27.250000 | -1.000 |
32 rows × 5 columns