from siuba import _
== 4 _.cyl
ββ==
ββββ.
β ββ_
β ββ'cyl'
ββ4
A siu expression is a way of specifying what action you want to perform. This allows siuba verbs to decide how to execute the action, depending on whether your data is a local DataFrame or remote table.
Notice how the output represents each step in our lazy expression, with these pieces:
==
) or getting an attribute (.
)._
) - a placeholder for some dataYou can include method calls like .isin()
in a lazy expression.
ββ'__call__'
ββββ.
β ββββ.
β β ββ_
β β ββ'cyl'
β ββ'isin'
ββ[2, 4]
When used in a verb like filter()
it will call it over the underlying data. So when you call it on a pandas Series, the Series.isin() method gets called.
# call our expr, which uses .isin
mtcars >> filter(expr)
# equivalent to...
mtcars >> filter(_.cyl.isin([2, 4]))
# or in pandas
mtcars[lambda d: d.cyl.isin([2, 4])]
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
7 | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
27 | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
11 rows Γ 11 columns
See the pandas.Series
API documentation for detailed documentation on all the different methods available.
Sometimes it is helpful to use Pandas DataFrame methods, in addition to siuba verbs. This can be done by piping the data to _.<some_method>()
.
Here is an example using the siuba verb count()
, with the pandas method .sort_values()
.
from siuba import _, count
from siuba.data import mtcars
(mtcars
>> count(_.cyl) # this is a siuba verb
>> _.sort_values("n") # this is a pandas method
)
cyl | n | |
---|---|---|
1 | 6 | 7 |
0 | 4 | 11 |
2 | 8 | 14 |
Here is another example, using the DataFrame .shape
attribute.
import pandas as pd
from siuba import _, mutate
from siuba.siu import call
my_dates = pd.DataFrame({"date": ["2021-01-01", "2021-01-02"]})
pd.to_datetime(my_dates.date)
0 2021-01-01
1 2021-01-02
Name: date, dtype: datetime64[ns]
0 2021-01-01
1 2021-01-02
Name: parsed, dtype: object
_.class
)Most column names can be referred to using _.some_name
syntax. However, python reserved words like class
canβt be used in this way.
Use indexing (e.g. _["some_name"]
) to refer to any column by name.
Moreover, pandas reserves names for its methods (e.g. _.shape
or _.mean
). This is also solved by indexing.
and
, or
, in
In python libraries like pandas (and numpy), logical comparisons are done using special operators.
Below is some example data, along with the operators for logical operations.
python keyword | pandas | example |
---|---|---|
or | | |
(df.x < 3) | (df.x > 4) |
and | & |
(df.x > 3) & (df.x < 4) |
in | .isin() |
df.x.isin([3, 4, 5]) |
_
Google colab uses very old versions of the library ipykernel, which has a bug in it. This causes it to continuously overwrite the _
variable.
To fix this, rename the _
variable imported from siuba.
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
17 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
18 | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
19 | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
27 | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |