SQL backend

import pandas as pd
pd.set_option("display.max_rows", 5)

from siuba.siu.format import Formatter

show_tree = lambda x: print(Formatter().format(x))
Warning

This document is being revised (though the code runs correctly!).

Step 1: Column Translation

Column translation requires three pieces:

  1. Locals: Functions for creating the sqlalchemy clause corresponding to an operation.
  2. Column Data: Classes representing columns under normal and aggregate settings.
  3. Translator: A class that can take a symbolic expression (e.g. _.x.mean()) and return it in call form: mean(_.x).
  4. Codata visitor: A class that takes the above call, and swaps in the sql dialect version of each call.
# Column data =================================================================

from siuba.sql.translate import SqlColumn, SqlColumnAgg

# used if you want to add a scalar or window translation
# (eg. something that gets used in mutate)
class WowSqlColumn(SqlColumn): pass

# used if you want to add a aggregate translation
# (eg. something that gets used in a query with a GROUP BY clause)
class WowSqlColumnAgg(WowSqlColumn, SqlColumn): pass


# Locals ======================================================================

from siuba.sql.translate import (
        win_over,
        win_agg,
        sql_agg,
        sql_scalar,
        sql_colmeth,
        sql_not_impl,
        )

scalar = {
    "__add__": sql_colmeth("__add__"),
    "round": sql_scalar("round"),
}

window = {
    "rank": win_over("rank"),
    "mean": win_agg("mean"),
}

aggregation = {
    "rank": sql_not_impl(),
    "mean": sql_agg("mean"),
}


# Translator ==================================================================

from siuba.sql.translate import SqlTranslator

translator = SqlTranslator.from_mappings(
    WowSqlColumn, WowSqlColumnAgg
)

# TODO: how to work in codata visitor?

Column Data

There are two kinds of data classes, corresponding to whether the generated outermost query in the generated SQL will use a GROUP BY clause.

Locals

The entries of each local dictionary are functions that take a sqlalchemy.sql.ClauseElement–which is the parent class of many sqlalchemy elements–and returns a ClauseElement.

from sqlalchemy import sql

expr_rank = window["rank"](WowSqlColumn(), sql.column("a_col"))
expr_rank
<siuba.sql.translate.RankOver object at 0x7f49c7825e50>
print(expr_rank)
rank() OVER (ORDER BY a_col)

Translator

Below, we set up a sqlalchemy select statement in order to demonstrate the translator in action.

from siuba import _


from sqlalchemy.sql import column, select

sel = select([column('x'), column('y')])

Then we feed the columns to the translated call.

call_add = translator.translate(_.x + _.y)

show_tree(call_add)
█─'__call__'
├─█─'__custom_func__'
│ └─<function singledispatch.<locals>.wrapper at 0x7f49c40fedc0>
├─█─.
│ ├─_
│ └─'x'
└─█─.
  ├─_
  └─'y'

Note that behind the scenes, the translator goes down the call tree and swaps functions like "__add__" with the local translations.

from siuba.siu.visitors import CodataVisitor
codata = CodataVisitor(WowSqlColumn, object)

call_add_final = codata.visit(call_add)

show_tree(call_add_final)
█─'__call__'
├─█─'__custom_func__'
│ └─<function sql_colmeth.<locals>.f at 0x7f49bff79550>
├─<__main__.WowSqlColumn object at 0x7f49c7825b50>
├─█─.
│ ├─_
│ └─'x'
└─█─.
  ├─_
  └─'y'
# the root node is __add__. shown as +.
_.x + _.y
█─+
├─█─.
│ ├─_
│ └─'x'
└─█─.
  ├─_
  └─'y'
# We can see this in action by calling the translation directly.
scalar["__add__"](WowSqlColumn(), sel.columns.x, sel.columns.y)
/tmp/ipykernel_4007/1405687756.py:2: SADeprecationWarning: The SelectBase.c and SelectBase.columns attributes are deprecated and will be removed in a future release; these attributes implicitly create a subquery that should be explicit.  Please call SelectBase.subquery() first in order to create a subquery, which then contains this attribute.  To access the columns that this SELECT object SELECTs from, use the SelectBase.selected_columns attribute. (deprecated since: 1.4)
  scalar["__add__"](WowSqlColumn(), sel.columns.x, sel.columns.y)
<sqlalchemy.sql.elements.BinaryExpression object at 0x7f49c7825460>

By default the translate method assumes the expression is using window functions, so operations like .mean() return SqlAlchemy Over clauses.

f_translate = translator.translate(_.x.mean())

f_translate_co = codata.visit(f_translate)
expr = f_translate_co(sel.columns)

expr
/tmp/ipykernel_4007/2826175941.py:4: SADeprecationWarning: The SelectBase.c and SelectBase.columns attributes are deprecated and will be removed in a future release; these attributes implicitly create a subquery that should be explicit.  Please call SelectBase.subquery() first in order to create a subquery, which then contains this attribute.  To access the columns that this SELECT object SELECTs from, use the SelectBase.selected_columns attribute. (deprecated since: 1.4)
  expr = f_translate_co(sel.columns)
<siuba.sql.translate.AggOver object at 0x7f49bfe8e550>
print(expr)
avg(anon_1.x) OVER ()

Keep in mind–since the translator doesn’t know about grouping variables–it returns an empty over clause. This separation of concerns:

  • translator: handles simple column ops, including returning over clauses to be filled.
  • backend: handles broader contexts–like if the data has been grouped, arranged, or limited–by visiting the result of each translation.

User defined functions

Remember those column data classes we made and passed to the translator? They allow users to register custom column operation functions!

Below we show a custom round function, which calls the sqlalchemy that usually corresponds to ROUND(<col>).

from siuba.siu import _, symbolic_dispatch
from sqlalchemy import sql

@symbolic_dispatch(cls = WowSqlColumn)
def round(self, col):
    print("running round function")
    
    return sql.function.round(col)

# Creates a special symbolic object
round(_)
█─'__call__'
├─█─'__custom_func__'
│ └─<function round at 0x7f49bfef5700>
└─_

Note that a special feature of symbolic_dispatch, is that it let’s you form complex expressions by passing _ to your function call.

# Symbolic objects let you express complex operations
round(_) + 9999
█─+
├─█─'__call__'
│ ├─█─'__custom_func__'
│ │ └─<function round at 0x7f49bfef5700>
│ └─_
└─9999

.

Step 2: Backend class

So far we’ve only discussed how to translate symbolic expressions like _.x + _.y in the mutate call below.

mutate(backend, _.x + _.y)

The last piece is implementing a backend that works with the mutate function itself. If you are just adding support for a new SQL dialect, you can use the LazyTbl class provided by siuba.sql. While the translations above work on columns of data, this class handles tables and queries over data (e.g. select statements).

More specifically, LazyTbl has 2 jobs:

  1. Representing a specific table of data from the database.
  2. Creating SQL queries via table verbs. This includes using translators above, as well as broader SQL constructs (e.g. ordering, grouping / partitions, limiting rows, etc..).
# Setup example data ----
from sqlalchemy import create_engine
from siuba.data import mtcars

# copy pandas DataFrame to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")
32

Representing SQL tables

from siuba.sql import LazyTbl

tbl_cars = LazyTbl(engine, "mtcars")
tbl_cars
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
index mpg cyl disp hp drat wt qsec vs am gear carb
0 0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

# .. may have more rows

Note that you can access a number of useful attributes.

# calls the underlying translator and codata
f_add = tbl_cars.shape_call(_.mpg + _.hp)
f_add(tbl_cars.last_op.columns)
<sqlalchemy.sql.elements.BinaryExpression object at 0x7f49bfbc4c10>
# the underlying sqlalchemy source table
tbl_cars.tbl
Table('mtcars', MetaData(bind=Engine(sqlite:///:memory:)), Column('index', BIGINT(), table=<mtcars>), Column('mpg', FLOAT(), table=<mtcars>), Column('cyl', BIGINT(), table=<mtcars>), Column('disp', FLOAT(), table=<mtcars>), Column('hp', BIGINT(), table=<mtcars>), Column('drat', FLOAT(), table=<mtcars>), Column('wt', FLOAT(), table=<mtcars>), Column('qsec', FLOAT(), table=<mtcars>), Column('vs', BIGINT(), table=<mtcars>), Column('am', BIGINT(), table=<mtcars>), Column('gear', BIGINT(), table=<mtcars>), Column('carb', BIGINT(), table=<mtcars>), schema=None)
# grouping info
from siuba import group_by, _

tbl2 = group_by(tbl_cars, rounded_mpg = _.mpg.round(10))
tbl2.group_by
('rounded_mpg',)

Table verbs

Backends register on a dispatcher called singledispatch2. This is shown below with an assign verb that is a limited implementation of mutate.

from siuba.dply.verbs import singledispatch2
from sqlalchemy.sql import select

@singledispatch2(LazyTbl)
def assign(__data, **kwargs):

    new_cols = []
    for k, expr in kwargs.items():

        # .shape_call mostly data.translator under the hood
        new_call = __data.shape_call(expr)
        sql_expr = new_call(__data.last_op.columns)
        
        new_cols.append(sql_expr.label(k))
    
    # copy of data, where .last_op is the new select statement
    new_data = __data.append_op(select(new_cols))
    
    return new_data


# test out
tbl_assigned = assign(tbl_cars, res = _.mpg + _.hp)

tbl_assigned 
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
res
0 131.0
1 131.0
2 115.8
3 131.4
4 193.7

# .. may have more rows

Note that the backend.last_op property holds the current select statement, so the verb can adapt it, or wrap it in another select statement.

print(tbl_assigned.last_op)
SELECT mtcars.mpg + mtcars.hp AS res 
FROM mtcars

Unit tests

Siuba generates a table of all supported operations per backend, and tests against a simple example for each, to ensure they produce the same result as the pandas backend.

However, some translations may deviate in the following ways:

  • returning a float rather than an int (or vice-versa).
  • an aggregation that works in summarize, but not in mutate verbs.

In order to mark translations as deviating, you can use the annotation functions.

from sqlalchemy import sql
from siuba.sql.translate import annotate, wrap_annotate

# puts an `operation` attribute on function, describing limitations
f_and = annotate(lambda col: col & False, input_type = "bool")
f_and(sql.column('a'))
<sqlalchemy.sql.elements.AsBoolean object at 0x7f49bfb47a90>
f_and.operation
{'input_type': 'bool'}
# creates a new function (wrapping the old one), and annotates that
f_and2 = wrap_annotate(f_and, something_else = False)

f_and2.operation
{'something_else': False}