Custom table verbs

You may be wondering how a siuba function, like mutate(), could work on a SQL database. This is because these functions are defined using a technique called single dispatch. This approach allows you to define class-specific versions of a function.

# 🚧 TODO: expose verb_dispatch on siu module
import pandas as pd

from siuba.siu.dispatchers import verb_dispatch
from import mtcars

Create a new verb with verb_dispatch()

Use the verb_dispatch() decorator to create a new table verb function.

The code below creates a function called head(), with an implementation that works specifically on a DataFrame.

# DataFrame version of function ---

def head(__data, n = 5):
    return __data.head(n)

head(mtcars, 2)
mpg cyl disp hp drat wt qsec vs am gear carb
0 21.0 6 160.0 110 3.9 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.9 2.875 17.02 0 1 4 4

Register a SQL translation

We can define a SQL specific version, that acts on a SqlAlchemy Table by registering a new translation of our function head, with the @head.register decorator.

from siuba.sql import LazyTbl

def _head_sql(__data: LazyTbl, n = 5):
    new_select = __data.last_select.limit(n)
    return __data.append_op(new_select)

Here is the function being run on a mock postgresql backend:

from import cars_sql
from siuba import show_query

q = cars_sql >> head(3) >> show_query()
SELECT cars.cyl, cars.mpg, cars.hp 
FROM cars 

siuba uses SqlAlchemy internally to handle SQL queries. Most verbs in siuba involve creating a new subquery, or modifying the current query.

LazyTbl uses the properties .group_by and .order_by to track options set by group_by() and arrange(), respectively.

Grouped data

Since single dispatch functions define how to execute an action for a specific class of data, it allows siuba to handle grouped data in two ways:

  • pandas - register dispatchers for its special grouped data classes (DataFrameGroupBy, SeriesGroupBy).
  • SQL - use a single class for grouped and ungrouped data, with grouping info as an attribute (siuba.sql.LazyTbl).

For example, here is a simple verb that calculates the number of rows in a grouped DataFrame.

from pandas.core.groupby import DataFrameGroupBy

def size(__data):
    return __data.size()

4    11
6     7
8    14
dtype: int64

Why not use methods?

Why use singledispatch rather than create a class method like mtcars.head()?

There are two big benefits:

  1. Anyone can define and package a function. Using it is just a matter of importing it. With a method, you need to somehow put it onto the class representing your data. You end up with 300+ methods on a class.
  2. Your function might do something that is not the class’s core responsibility. In this case, it should not be part of the class definition.

See the post Single dispatch for democratizing data science tools for more.