Group by

This function is used to specify groups in your data for verbs—like mutate(), filter(), and summarize()—to perform operations over.

For example, in the mtcars dataset, there are 3 possible values for cylinders (cyl). You could use group_by to say that you want to perform operations separately for each of these 3 groups of values.

An important compliment to group_by() is ungroup(), which removes all current groupings.

from siuba import _, group_by, ungroup, filter, mutate, summarize
from siuba.data import mtcars

small_cars = mtcars[["cyl", "gear", "hp"]]

small_cars

	cyl	gear	hp
0	6	4	110
1	6	4	110
...	...	...	...
30	8	5	335
31	4	4	109

32 rows × 3 columns

Group by column

The simplest way to use group by is to specify your grouping column directly. This is shown below, by grouping mtcars according to its 3 groups of cylinder values (4, 6, or 8 cylinders).

g_cyl = small_cars >> group_by(_.cyl)

g_cyl

(grouped data frame)

	cyl	gear	hp
0	6	4	110
1	6	4	110
...	...	...	...
30	8	5	335
31	4	4	109

32 rows × 3 columns

Note that the result is simply a pandas GroupedDataFrame, which is what is returned if you use mtcars.groupby('cyl'). Normally, a GroupedDataFrame doesn’t print out a preview of itself, but siuba modifies it to do so, since this is very handy.

The group_by function is most often used with filter, mutate, and summarize.

Filter

# keep rows where hp is greater than mean hp within cyl group
g_cyl >> filter(_.hp > _.hp.mean())

(grouped data frame)

	cyl	gear	hp
2	4	4	93
6	8	3	245
...	...	...	...
30	8	5	335
31	4	4	109

15 rows × 3 columns

Mutate

g_cyl >> mutate(avg_hp = _.hp.mean())

(grouped data frame)

	cyl	gear	hp	avg_hp
0	6	4	110	122.285714
1	6	4	110	122.285714
...	...	...	...	...
30	8	5	335	209.214286
31	4	4	109	82.636364

32 rows × 4 columns

Summarize

g_cyl >> summarize(avg_hp = _.hp.mean())

	cyl	avg_hp
0	4	82.636364
1	6	122.285714
2	8	209.214286

Group by multiple columns

In order to group by multiple columns, simply specify them all as arguments to group_by.

small_cars >> group_by(_.cyl, _.gear)

(grouped data frame)

	cyl	gear	hp
0	6	4	110
1	6	4	110
...	...	...	...
30	8	5	335
31	4	4	109

32 rows × 3 columns

Group by an expression

small_cars >> group_by(high_hp = _.hp > 300)

(grouped data frame)

	cyl	gear	hp	high_hp
0	6	4	110	False
1	6	4	110	False
...	...	...	...	...
30	8	5	335	True
31	4	4	109	False

32 rows × 4 columns

Count rows

from siuba import _, group_by, count

# count number of rows per group
mtcars >> group_by(_.cyl, _.gear) >> summarize(n = _.shape[0])

# equivalent
mtcars >> count(_.cyl, _.gear)

	cyl	gear	n
0	4	3	1
1	4	4	8
...	...	...	...
6	8	3	12
7	8	5	2

8 rows × 3 columns

Ungroup

small_cars >> group_by(_.cyl) >> ungroup()

	cyl	gear	hp
0	6	4	110
1	6	4	110
...	...	...	...
30	8	5	335
31	4	4	109

32 rows × 3 columns