SQL basics

Up to this point we’ve covered lazy expressions (_), and using table verbs. A major benefit of these two approaches is that they allow us to change how siuba behaves depending on the data source on which it is operating.

Setup

For these examples we first set up a sqlite database, with an mtcars table.

from sqlalchemy import create_engine
from siuba.sql import LazyTbl
from siuba import _, group_by, summarize, show_query, collect 
from siuba.data import mtcars

# copy in to sqlite, using the pandas .to_sql() method
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")
32

Accessing tables

Use the LazyTbl class to connect to a SQL table. Printing the table will show a preview of the first few rows.

# Create a lazy SQL DataFrame
tbl_mtcars = LazyTbl(engine, "mtcars")
tbl_mtcars
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
index mpg cyl disp hp drat wt qsec vs am gear carb
0 0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

# .. may have more rows

Notice that we defined the variable tbl_mtcars to refer to the mtcars table in the database. When we print tbl_mtcars it shows a preview of the underlying data, along with some notes about the database being used: # DB Conn: Engine(sqlite:///:memory:).

Basic analysis

You don’t need to change your analysis code to run it on a SQL table. For example, the code below groups and summarizes the data.

# connect with siuba

tbl_query = (tbl_mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean())
  )

tbl_query
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286

# .. may have more rows

Under the hood, functions like summarize know how to convert the lazy expressions like _.hp.mean() shown in the code above to SQL.

Show query

By default, printing out a LazyTbl shows a preview of the data. Use show_query() to see the actual SQL query siuba will generate.

q = tbl_query >> show_query()
SELECT mtcars.cyl, avg(mtcars.hp) AS avg_hp 
FROM mtcars GROUP BY mtcars.cyl

Collect to DataFrame

Use collect() to fetch the full query results as a pandas DataFrame.

tbl_query >> collect()
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286