Lazy expressions

A siu expression is a way of specifying what action you want to perform. This allows siuba verbs to decide how to execute the action, depending on whether your data is a local DataFrame or remote table.

from siuba import _

_.cyl == 4
β–ˆβ”€==
β”œβ”€β–ˆβ”€.
β”‚ β”œβ”€_
β”‚ └─'cyl'
└─4

Notice how the output represents each step in our lazy expression, with these pieces:

Method translation

You can include method calls like .isin() in a lazy expression.

from siuba import _, filter
from siuba.data import mtcars

expr = _.cyl.isin([2,4])

expr
β–ˆβ”€'__call__'
β”œβ”€β–ˆβ”€.
β”‚ β”œβ”€β–ˆβ”€.
β”‚ β”‚ β”œβ”€_
β”‚ β”‚ └─'cyl'
β”‚ └─'isin'
└─[2, 4]

When used in a verb like filter() it will call it over the underlying data. So when you call it on a pandas Series, the Series.isin() method gets called.

# call our expr, which uses .isin
mtcars >> filter(expr)

# equivalent to...
mtcars >> filter(_.cyl.isin([2, 4]))

# or in pandas
mtcars[lambda d: d.cyl.isin([2, 4])]
mpg cyl disp hp drat wt qsec vs am gear carb
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
... ... ... ... ... ... ... ... ... ... ... ...
27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

11 rows Γ— 11 columns

See the pandas.Series API documentation for detailed documentation on all the different methods available.

Use in pipes

Sometimes it is helpful to use Pandas DataFrame methods, in addition to siuba verbs. This can be done by piping the data to _.<some_method>().

Here is an example using the siuba verb count(), with the pandas method .sort_values().

from siuba import _, count
from siuba.data import mtcars

(mtcars
    >> count(_.cyl)         # this is a siuba verb
    >> _.sort_values("n")   # this is a pandas method
)
cyl n
1 6 7
0 4 11
2 8 14

Here is another example, using the DataFrame .shape attribute.

# siuba pipe
mtcars >> _.shape[0]
32
# regular pandas
mtcars.shape[0]

Call external functions

import pandas as pd
from siuba import _, mutate
from siuba.siu import call

my_dates = pd.DataFrame({"date": ["2021-01-01", "2021-01-02"]})

pd.to_datetime(my_dates.date)
0   2021-01-01
1   2021-01-02
Name: date, dtype: datetime64[ns]
my_dates >> mutate(parsed = _.date) >> _.parsed
0    2021-01-01
1    2021-01-02
Name: parsed, dtype: object
my_dates >> mutate(parsed = call(pd.to_datetime, _.date))
date parsed
0 2021-01-01 2021-01-01
1 2021-01-02 2021-01-02

Common challenges

Reserved words (_.class)

Most column names can be referred to using _.some_name syntax. However, python reserved words like class can’t be used in this way.

Use indexing (e.g. _["some_name"]) to refer to any column by name.

# bad: raises a SyntaxError
_.class

# good
_["class"]

Moreover, pandas reserves names for its methods (e.g. _.shape or _.mean). This is also solved by indexing.

df = pd.DataFrame({"mean": [1,2,3]})

# bad: is accessing the mean method
df.mean + 1

# good (pandas)
df["mean"]

# good (siuba)
_["mean"]

Logical keywords: and, or, in

In python libraries like pandas (and numpy), logical comparisons are done using special operators.

Below is some example data, along with the operators for logical operations.

import pandas as pd

df = pd.DataFrame({"x": [2, 3, 4, 5]})
python keyword pandas example
or | (df.x < 3) | (df.x > 4)
and & (df.x > 3) & (df.x < 4)
in .isin() df.x.isin([3, 4, 5])

Google colab overrides _

Google colab uses very old versions of the library ipykernel, which has a bug in it. This causes it to continuously overwrite the _ variable.

To fix this, rename the _ variable imported from siuba.

from siuba import _ as D, filter
from siuba.data import mtcars

mtcars >> filter(D.mpg > 30)
mpg cyl disp hp drat wt qsec vs am gear carb
17 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2