[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)

Pandas backend

The pandas backend has two key responsibilities:

  1. Implementing functions that enable complex and fast expressions over grouped data.

  2. Implementing verb behavior over pandas’ DataFrameGroupBy (e.g. a grouped mutate).

While it might seem like a lot of work, this mostly involves using a few simple strategies to take advantage of logic already existing in the pandas library.

The strategy can be described as follows:

  • operations over grouped objects return grouped objects.

    • e.g. add(SeriesGroupBy, SeriesGroupBy) -> SeriesGroupBy.

  • operations that do aggregation return a subclass of SeriesGroupBy called GroupByAgg. This represents the case where each row is its own group. It also holds information about the original data, so it can be broadcast back to the original length.

As a final note, while the SQL backend uses a custom backend class (LazyTbl), the backend for table verbs in this case is just the pandas’ DataFrameGroupBy class itself.

Column op translation

image1

[2]:
from siuba.ops import mean
from siuba.data import mtcars

# equivalent to mtcars.hp.mean()
mean(mtcars.hp)
[2]:
146.6875
[3]:
import pandas as pd
from siuba.ops.utils import operation

mean2 = operation("mean", "Agg", 1)

# Series implementation just calls the Series' mean method
mean2.register(
    pd.Series,
    lambda self, *args, **kwargs: self.mean(*args, **kwargs)
)
[3]:
<function __main__.<lambda>(self, *args, **kwargs)>
[4]:
mean2(mtcars.hp)
[4]:
146.6875
[5]:
mean2.operation.name
[5]:
'mean'

The purpose of the .operation data is to make it easy to generate new translations for functions. For example, if we want to translate pandas’ ser.str.upper() method, then it helps to know it uses the str accessor.

Using an existing translation

[6]:
from siuba.experimental.pd_groups.translate import method_el_op

df = pd.DataFrame({
    "x": ["a", "b", "c"],
    "g": [0, 0, 1]
})

# notice method_ag_op uses some details from .operation
upper = method_el_op("upper", is_property = False, accessor = "str")
lower = method_el_op("lower", is_property = False, accessor = "str")

g_df = df.groupby("g")

res = upper(g_df.x)

# note: .obj is how you "ungroup" a SeriesGroupBy
res.obj
[6]:
0    A
1    B
2    C
Name: x, dtype: object
[7]:
# convert to uppercase and back to lowercase
# equivalent to df.x.str.upper().str.lower()
res2 = lower(upper(g_df.x))
res2.obj
[7]:
0    a
1    b
2    c
Name: x, dtype: object
[8]:
isinstance(res, pd.core.groupby.SeriesGroupBy)
[8]:
True
[9]:
lower(upper(g_df.x))
[9]:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f59fc1d39d0>

See the internals of functions like method_el_op for details.

New verb implementations

Like with other backends, verbs use single dispatch to register new backend implementations.

[10]:
from pandas import DataFrame
from pandas.core.groupby import DataFrameGroupBy
from siuba.dply.verbs import singledispatch2

@singledispatch2(DataFrame)
def my_verb(__data):
    print("Running default.")

# register grouped implementation ----
@my_verb.register(DataFrameGroupBy)
def _my_verb_gdf(__data):
    print("Running grouped!")


# test it out ----
from siuba.data import mtcars

my_verb(mtcars.groupby("cyl"))
Running grouped!

Edit page on github here. Interactive version: Binder badge