import numpy as np
import pandas as pd
123)
np.random.seed(= pd.DataFrame({
students 'student_id': np.repeat(np.arange(2000), 10),
'course_id': np.random.randint(1, 20, 20000),
'score': np.random.randint(1, 100, 20000)
})
= students.groupby('student_id') g_students
Optimized grouped pandas ops
Problem: combining grouped operations is slow
If you just need to make a single calculation, then pandas methods are very fast. For example, take the code below, which calculates the minimum score for each student.
%%timeit
min() g_students.score.
278 µs ± 398 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This took very little time (less than a millisecond, which is 1 thousandth of a second!).
However, now suppose you wanted to do something more complex. Let’s say you wanted to get rows corresponding to each students minimum score. In pandas, there are two ways to do this:
- transform with a lambda
- by using both the
students
andg_student
data frames.
These are shown below.
%%timeit
= g_students.score.transform(lambda x: x == x.min())
is_student_min = students[is_student_min] df_min1
394 ms ± 266 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
= students.score == g_students.score.transform('min')
is_student_min = students[is_student_min] df_min2
776 µs ± 823 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Note that while the first one could be expressed using only the grouped data (g_student
), it took over a second to run!
On the other hand, while the other was fairly quick, it required juggling two forms of the data.
Siuba attempts to optimize these operations to be quick AND require less data juggling.
Siuba filtering is succinct AND performant
from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize
from siuba import _
%%timeit
= fast_filter(g_students, _.score == _.score.min()) df_min3
1.59 ms ± 956 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
= _.score == _.score.min()) fast_mutate(students, is_low_score
505 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
= _.score.min() / 100.) fast_summarize(g_students, lowest_percent
1.54 ms ± 979 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
How do the optimizations work?
Siuba replaces important parts of the call tree–like ==
and score()
–with functions that take a grouped series and return a grouped series. Because it then becomes grouped series all the way down, these operations are nicely composable.
== _.score.min() _.score
█─==
├─█─.
│ ├─_
│ └─'score'
└─█─'__call__'
└─█─.
├─█─.
│ ├─_
│ └─'score'
└─'min'
After the expressions are executed, the verb in charge handles the output. For example, fast_filter
uses the result (usually a boolean Series) to keep only rows where the result is True.
An example is shown below, for how siuba replaces the “mean” function.
from siuba.experimental.pd_groups.translate import method_agg_op
= method_agg_op('mean', False, None)
f_mean
# result is a subclass of SeriesGroupBy
= f_mean(g_students.score)
res_agg
print(res_agg)
print(res_agg.obj.head())
<siuba.experimental.pd_groups.groupby.GroupByAgg object at 0x7f4b05e31e20>
student_id
0 48.9
1 51.2
2 41.3
3 46.8
4 47.6
Name: score, dtype: float64
Defining custom grouped operations
I’m in the progress of writing out documentation on custom operations. For more context on strategic decisions made during their implementation, see this architecture doc.
An example of implementing a custom cumulative mean function is below.
from siuba.siu import symbolic_dispatch
from pandas.core.groupby import SeriesGroupBy, GroupBy
from pandas import Series
@symbolic_dispatch(cls = Series)
def cummean(x):
"""Return a same-length array, containing the cumulative mean."""
return x.expanding().mean()
@cummean.register(SeriesGroupBy)
def _cummean_grouped(x) -> SeriesGroupBy:
= x.grouper
grouper = x.obj.notna().groupby(grouper).cumsum()
n_entries
= x.cumsum() / n_entries
res
return res.groupby(grouper)
from siuba import _, mutate
from siuba.data import mtcars
# a pandas DataFrameGroupBy object
= mtcars.groupby("cyl")
g_cyl
= cummean(_.score)) mutate(g_students, cumul_mean
(grouped data frame)
student_id | course_id | score | cumul_mean | |
---|---|---|---|---|
0 | 0 | 14 | 38 | 38.000000 |
1 | 0 | 3 | 40 | 39.000000 |
2 | 0 | 3 | 17 | 31.666667 |
3 | 0 | 7 | 74 | 42.250000 |
4 | 0 | 18 | 29 | 39.600000 |
... | ... | ... | ... | ... |
19995 | 1999 | 10 | 16 | 37.833333 |
19996 | 1999 | 9 | 74 | 43.000000 |
19997 | 1999 | 3 | 1 | 37.750000 |
19998 | 1999 | 11 | 32 | 37.111111 |
19999 | 1999 | 17 | 10 | 34.400000 |
20000 rows × 4 columns
Note that this approach requires that the custom grouped function (_cummean_grouped
) specify a type annotation for its return value.