Group by

This function is used to specify groups in your data for verbs—like mutate(), filter(), and summarize()—to perform operations over.

For example, in the mtcars dataset, there are 3 possible values for cylinders (cyl). You could use group_by to say that you want to perform operations separately for each of these 3 groups of values.

An important compliment to group_by() is ungroup(), which removes all current groupings.

from siuba import _, group_by, ungroup, filter, mutate, summarize
from siuba.data import mtcars

small_cars = mtcars[["cyl", "gear", "hp"]]

small_cars
cyl gear hp
0 6 4 110
1 6 4 110
... ... ... ...
30 8 5 335
31 4 4 109

32 rows × 3 columns

Group by column

The simplest way to use group by is to specify your grouping column directly. This is shown below, by grouping mtcars according to its 3 groups of cylinder values (4, 6, or 8 cylinders).

g_cyl = small_cars >> group_by(_.cyl)

g_cyl

(grouped data frame)

cyl gear hp
0 6 4 110
1 6 4 110
... ... ... ...
30 8 5 335
31 4 4 109

32 rows × 3 columns

Note that the result is simply a pandas GroupedDataFrame, which is what is returned if you use mtcars.groupby('cyl'). Normally, a GroupedDataFrame doesn’t print out a preview of itself, but siuba modifies it to do so, since this is very handy.

The group_by function is most often used with filter, mutate, and summarize.

Filter

# keep rows where hp is greater than mean hp within cyl group
g_cyl >> filter(_.hp > _.hp.mean())

(grouped data frame)

cyl gear hp
2 4 4 93
6 8 3 245
... ... ... ...
30 8 5 335
31 4 4 109

15 rows × 3 columns

Mutate

g_cyl >> mutate(avg_hp = _.hp.mean())

(grouped data frame)

cyl gear hp avg_hp
0 6 4 110 122.285714
1 6 4 110 122.285714
... ... ... ... ...
30 8 5 335 209.214286
31 4 4 109 82.636364

32 rows × 4 columns

Summarize

g_cyl >> summarize(avg_hp = _.hp.mean())
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286

Group by multiple columns

In order to group by multiple columns, simply specify them all as arguments to group_by.

small_cars >> group_by(_.cyl, _.gear)

(grouped data frame)

cyl gear hp
0 6 4 110
1 6 4 110
... ... ... ...
30 8 5 335
31 4 4 109

32 rows × 3 columns

Group by an expression

small_cars >> group_by(high_hp = _.hp > 300)

(grouped data frame)

cyl gear hp high_hp
0 6 4 110 False
1 6 4 110 False
... ... ... ... ...
30 8 5 335 True
31 4 4 109 False

32 rows × 4 columns

Count rows

from siuba import _, group_by, count

# count number of rows per group
mtcars >> group_by(_.cyl, _.gear) >> summarize(n = _.shape[0])

# equivalent
mtcars >> count(_.cyl, _.gear)
cyl gear n
0 4 3 1
1 4 4 8
... ... ... ...
6 8 3 12
7 8 5 2

8 rows × 3 columns

Ungroup

small_cars >> group_by(_.cyl) >> ungroup()
cyl gear hp
0 6 4 110
1 6 4 110
... ... ... ...
30 8 5 335
31 4 4 109

32 rows × 3 columns