from siuba.data import mtcars
from siuba import _, summarize, group_by, selectSummarize to aggregate
The summarize() creates new columns in your table, based on an aggregation. Aggregations take data and reduces it to a single number. When applied to grouped data, this function returns one row per grouping.
Summarize over all rows
mtcars >> summarize(avg_mpg = _.mpg.mean())
mtcars| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| 1 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
| 31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
32 rows × 11 columns
Summarize over groups
Use group_by() to split the data up, apply some aggregation, and then combine results.
(mtcars
>> group_by(_.cyl)
>> summarize(
avg = _.mpg.mean(),
range = _.mpg.max() - _.mpg.min(),
avg_per_cyl = (_.mpg / _.cyl).mean()
)
)| cyl | avg | range | avg_per_cyl | |
|---|---|---|---|---|
| 0 | 4 | 26.663636 | 12.5 | 6.665909 |
| 1 | 6 | 19.742857 | 3.6 | 3.290476 |
| 2 | 8 | 15.100000 | 8.8 | 1.887500 |
Note there are 3 unique groupings for cyl (4, 6, and 8), so the resulting table has 3 rows.