Lazy expressions

A siu expression is a way of specifying what action you want to perform. This allows siuba verbs to decide how to execute the action, depending on whether your data is a local DataFrame or remote table.

from siuba import _

_.cyl == 4

█─==
├─█─.
│ ├─_
│ └─'cyl'
└─4

Notice how the output represents each step in our lazy expression, with these pieces:

black box █ - a method like checking equality (==) or getting an attribute (.).
underscore (_) - a placeholder for some data

Method translation

You can include method calls like .isin() in a lazy expression.

from siuba import _, filter
from siuba.data import mtcars

expr = _.cyl.isin([2,4])

expr

█─'__call__'
├─█─.
│ ├─█─.
│ │ ├─_
│ │ └─'cyl'
│ └─'isin'
└─[2, 4]

When used in a verb like filter() it will call it over the underlying data. So when you call it on a pandas Series, the Series.isin() method gets called.

# call our expr, which uses .isin
mtcars >> filter(expr)

# equivalent to...
mtcars >> filter(_.cyl.isin([2, 4]))

# or in pandas
mtcars[lambda d: d.cyl.isin([2, 4])]

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
2	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
7	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
...	...	...	...	...	...	...	...	...	...	...	...
27	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
31	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

11 rows × 11 columns

See the pandas.Series API documentation for detailed documentation on all the different methods available.

Use in pipes

Sometimes it is helpful to use Pandas DataFrame methods, in addition to siuba verbs. This can be done by piping the data to _.<some_method>().

Here is an example using the siuba verb count(), with the pandas method .sort_values().

from siuba import _, count
from siuba.data import mtcars

(mtcars
    >> count(_.cyl)         # this is a siuba verb
    >> _.sort_values("n")   # this is a pandas method
)

	cyl	n
1	6	7
0	4	11
2	8	14

Here is another example, using the DataFrame .shape attribute.

# siuba pipe
mtcars >> _.shape[0]

# regular pandas
mtcars.shape[0]

Call external functions

import pandas as pd
from siuba import _, mutate
from siuba.siu import call

my_dates = pd.DataFrame({"date": ["2021-01-01", "2021-01-02"]})

pd.to_datetime(my_dates.date)

0   2021-01-01
1   2021-01-02
Name: date, dtype: datetime64[ns]

my_dates >> mutate(parsed = _.date) >> _.parsed

0    2021-01-01
1    2021-01-02
Name: parsed, dtype: object

my_dates >> mutate(parsed = call(pd.to_datetime, _.date))

	date	parsed
0	2021-01-01	2021-01-01
1	2021-01-02	2021-01-02

Common challenges

Reserved words (`_.class`)

Most column names can be referred to using _.some_name syntax. However, python reserved words like class can’t be used in this way.

Use indexing (e.g. _["some_name"]) to refer to any column by name.

# bad: raises a SyntaxError
_.class

# good
_["class"]

Moreover, pandas reserves names for its methods (e.g. _.shape or _.mean). This is also solved by indexing.

df = pd.DataFrame({"mean": [1,2,3]})

# bad: is accessing the mean method
df.mean + 1

# good (pandas)
df["mean"]

# good (siuba)
_["mean"]

Logical keywords: `and`, `or`, `in`

In python libraries like pandas (and numpy), logical comparisons are done using special operators.

Below is some example data, along with the operators for logical operations.

import pandas as pd

df = pd.DataFrame({"x": [2, 3, 4, 5]})

python keyword	pandas	example
or	`\|`	`(df.x < 3) \| (df.x > 4)`
and	`&`	`(df.x > 3) & (df.x < 4)`
in	`.isin()`	`df.x.isin([3, 4, 5])`

Google colab overrides `_`

Google colab uses very old versions of the library ipykernel, which has a bug in it. This causes it to continuously overwrite the _ variable.

To fix this, rename the _ variable imported from siuba.

from siuba import _ as D, filter
from siuba.data import mtcars

mtcars >> filter(D.mpg > 30)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
17	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
18	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
19	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
27	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2