import pandas as pd
"display.max_rows", 5) pd.set_option(
Pandas backend
The pandas backend has two key responsibilities:
- Implementing functions that enable complex and fast expressions over grouped data.
- Implementing verb behavior over pandas’ DataFrameGroupBy (e.g. a grouped mutate).
While it might seem like a lot of work, this mostly involves using a few simple strategies to take advantage of logic already existing in the pandas library.
The strategy can be described as follows:
- operations over grouped objects return grouped objects.
- e.g.
add(SeriesGroupBy, SeriesGroupBy) -> SeriesGroupBy
.
- e.g.
- operations that do aggregation return a subclass of SeriesGroupBy called GroupByAgg. This represents the case where each row is its own group. It also holds information about the original data, so it can be broadcast back to the original length.
As a final note, while the SQL backend uses a custom backend class (LazyTbl), the backend for table verbs in this case is just the pandas’ DataFrameGroupBy class itself.
Column op translation
from siuba.ops import mean
from siuba.data import mtcars
# equivalent to mtcars.hp.mean()
mean(mtcars.hp)
146.6875
import pandas as pd
from siuba.ops.utils import operation
= operation("mean", "Agg", 1)
mean2
# Series implementation just calls the Series' mean method
mean2.register(
pd.Series,lambda self, *args, **kwargs: self.mean(*args, **kwargs)
)
<function __main__.<lambda>(self, *args, **kwargs)>
mean2(mtcars.hp)
146.6875
mean2.operation.name
'mean'
The purpose of the .operation
data is to make it easy to generate new translations for functions. For example, if we want to translate pandas’ ser.str.upper()
method, then it helps to know it uses the str
accessor.
Using an existing translation
from siuba.experimental.pd_groups.translate import method_el_op
= pd.DataFrame({
df "x": ["a", "b", "c"],
"g": [0, 0, 1]
})
# notice method_ag_op uses some details from .operation
= method_el_op("upper", is_property = False, accessor = "str")
upper = method_el_op("lower", is_property = False, accessor = "str")
lower
= df.groupby("g")
g_df
= upper(g_df.x)
res
# note: .obj is how you "ungroup" a SeriesGroupBy
res.obj
0 A
1 B
2 C
Name: x, dtype: object
# convert to uppercase and back to lowercase
# equivalent to df.x.str.upper().str.lower()
= lower(upper(g_df.x))
res2 res2.obj
0 a
1 b
2 c
Name: x, dtype: object
isinstance(res, pd.core.groupby.SeriesGroupBy)
True
lower(upper(g_df.x))
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f66c8da1b20>
See the internals of functions like method_el_op
for details.
New verb implementations
Like with other backends, verbs use single dispatch to register new backend implementations.
from pandas import DataFrame
from pandas.core.groupby import DataFrameGroupBy
from siuba.dply.verbs import singledispatch2
@singledispatch2(DataFrame)
def my_verb(__data):
print("Running default.")
# register grouped implementation ----
@my_verb.register(DataFrameGroupBy)
def _my_verb_gdf(__data):
print("Running grouped!")
# test it out ----
from siuba.data import mtcars
"cyl")) my_verb(mtcars.groupby(
Running grouped!