from siuba.data import mtcars
from siuba import _, summarize, group_by, select
Summarize to aggregate
The summarize()
creates new columns in your table, based on an aggregation. Aggregations take data and reduces it to a single number. When applied to grouped data, this function returns one row per grouping.
Summarize over all rows
>> summarize(avg_mpg = _.mpg.mean())
mtcars mtcars
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
32 rows × 11 columns
Summarize over groups
Use group_by()
to split the data up, apply some aggregation, and then combine results.
(mtcars>> group_by(_.cyl)
>> summarize(
= _.mpg.mean(),
avg range = _.mpg.max() - _.mpg.min(),
= (_.mpg / _.cyl).mean()
avg_per_cyl
) )
cyl | avg | range | avg_per_cyl | |
---|---|---|---|---|
0 | 4 | 26.663636 | 12.5 | 6.665909 |
1 | 6 | 19.742857 | 3.6 | 3.290476 |
2 | 8 | 15.100000 | 8.8 | 1.887500 |
Note there are 3 unique groupings for cyl
(4, 6, and 8), so the resulting table has 3 rows.