This function is used to specify groups in your data for verbs—like mutate(), filter(), and summarize()—to perform operations over.
For example, in the mtcars dataset, there are 3 possible values for cylinders (cyl). You could use group_by to say that you want to perform operations separately for each of these 3 groups of values.
An important compliment to group_by() is ungroup(), which removes all current groupings.
The simplest way to use group by is to specify your grouping column directly. This is shown below, by grouping mtcars according to its 3 groups of cylinder values (4, 6, or 8 cylinders).
g_cyl = small_cars >> group_by(_.cyl)g_cyl
(grouped data frame)
cyl
gear
hp
0
6
4
110
1
6
4
110
...
...
...
...
30
8
5
335
31
4
4
109
32 rows × 3 columns
Note that the result is simply a pandas GroupedDataFrame, which is what is returned if you use mtcars.groupby('cyl'). Normally, a GroupedDataFrame doesn’t print out a preview of itself, but siuba modifies it to do so, since this is very handy.
The group_by function is most often used with filter, mutate, and summarize.
Filter
# keep rows where hp is greater than mean hp within cyl groupg_cyl >>filter(_.hp > _.hp.mean())
(grouped data frame)
cyl
gear
hp
2
4
4
93
6
8
3
245
...
...
...
...
30
8
5
335
31
4
4
109
15 rows × 3 columns
Mutate
g_cyl >> mutate(avg_hp = _.hp.mean())
(grouped data frame)
cyl
gear
hp
avg_hp
0
6
4
110
122.285714
1
6
4
110
122.285714
...
...
...
...
...
30
8
5
335
209.214286
31
4
4
109
82.636364
32 rows × 4 columns
Summarize
g_cyl >> summarize(avg_hp = _.hp.mean())
cyl
avg_hp
0
4
82.636364
1
6
122.285714
2
8
209.214286
Group by multiple columns
In order to group by multiple columns, simply specify them all as arguments to group_by.
small_cars >> group_by(_.cyl, _.gear)
(grouped data frame)
cyl
gear
hp
0
6
4
110
1
6
4
110
...
...
...
...
30
8
5
335
31
4
4
109
32 rows × 3 columns
Group by an expression
small_cars >> group_by(high_hp = _.hp >300)
(grouped data frame)
cyl
gear
hp
high_hp
0
6
4
110
False
1
6
4
110
False
...
...
...
...
...
30
8
5
335
True
31
4
4
109
False
32 rows × 4 columns
Count rows
from siuba import _, group_by, count# count number of rows per groupmtcars >> group_by(_.cyl, _.gear) >> summarize(n = _.shape[0])# equivalentmtcars >> count(_.cyl, _.gear)