[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)

Filter

This function keeps rows of data that meet all specified conditions.

At its core, it follows the three following rules:

  • if each condition is true for a row, then it keeps that row.

  • it does not keep a row when a condition results in NA values.

  • it works with normal functions, including lambdas.

[2]:
from siuba import _, filter, group_by
from siuba.data import mtcars

Filter basics

This section demonstrates each of the three rules listed above.

For example, in the code below, a row must meet two conditions to be kept in the data. You can verbalize these conditions as, “cyl is equal to four, and gear is equal to five”.

[3]:
mtcars >> filter(_.cyl == 4, _.gear == 5)
[3]:
mpg cyl disp hp drat wt qsec vs am gear carb
26 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
27 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Filters with OR conditions

In order to keep a row when one of several conditions is met, use the bar (|) operator.

[4]:
mtcars >> filter((_.cyl == 4) | (_.gear == 5))
[4]:
mpg cyl disp hp drat wt qsec vs am gear carb
2 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
7 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
... ... ... ... ... ... ... ... ... ... ... ...
30 15.0 8 301.0 335 3.54 3.57 14.60 0 1 5 8
31 21.4 4 121.0 109 4.11 2.78 18.60 1 1 4 2

14 rows × 11 columns

The code above keeps rows where cyl is equal to 4 OR gear is equal to 5.

Be sure to explicitly put parentheses around both sides. Otherwise, python will group the operation like _.cyl == (4 | _.gear) == 5.

Dropping NAs

As with most subsetting in pandas, when a condition evaluates to an NA value, the row is automatically excluded. This is different from pandas indexing, where NA values produce errors.

[5]:
df = pd.DataFrame({
    "x": [True, False, None],
    })

df
[5]:
x
0 True
1 False
2 None
[6]:
df >> filter(_.x)
[6]:
x
0 True

Finally, while the simplest method is to use filter with siu expressions, both these forms work:

  • _.cyl == 4

  • lambda _: _.cyl == 4

[7]:
mtcars >> filter(lambda _: _.cyl == 4, lambda _: _.gear == 5)
[7]:
mpg cyl disp hp drat wt qsec vs am gear carb
26 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
27 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Under the hood, filter will call the lambda on the underlying data. However, keep in mind that using a lambda function prevents siuba from working with SQL, and from optimizing filter conditions over pandas DataFrames in the future.

Grouped filters

In the example below, we keep rows where the horsepower (hp) is above the median horsepower for each cylinder group. This means that the filter performas the following operations.

  • calculates the median hp for each cyl grouping.

  • for each row, based on the cyl group it belongs to, tests where the hp is greater than the median hp.

  • keeps only rows where this test passes.

[8]:
(mtcars
  >> group_by(_.cyl)
  >> filter(_.hp > _.hp.median())
  )
[8]:

(grouped data frame)

mpg cyl disp hp drat wt qsec vs am gear carb
2 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
6 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
... ... ... ... ... ... ... ... ... ... ... ...
30 15.0 8 301.0 335 3.54 3.57 14.60 0 1 5 8
31 21.4 4 121.0 109 4.11 2.78 18.60 1 1 4 2

15 rows × 11 columns

Filter with siuba vector functions

The siuba library includes a number of helper functions for operating on arrays of data (i.e. pandas Series or numpy arrays).

These functions can also do other useful calculations inside a filter, like…

  • return the row number of an entry

  • calculate a cumulative sum

  • shift all values down (lag) or up (lead) one

Keeping the two lowest horsepower rows per cylinder

For example, in the code below we filter to…

  • sort the data by ascending horse power (hp).

  • get the row number of each entry within a cyl group.

  • filter it to keep the two lowest hp cars per group.

Note that since there are 3 cyl groups (4, 6, or 8 cylinders), this returns 6 rows.

[9]:
from siuba.dply.vector import row_number, lag
from siuba import arrange

(mtcars
  >> arrange(_.hp)
  >> group_by(_.cyl)
  >> filter(row_number(_) <= 2)
  )
[9]:

(grouped data frame)

mpg cyl disp hp drat wt qsec vs am gear carb
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
... ... ... ... ... ... ... ... ... ... ... ...
21 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2

6 rows × 11 columns

Comparing shifts in hp across rows

Below we sort the data by ascending horsepower (hp), then filter to keep rows where it increases by 50.

[10]:
(mtcars
  >> arrange(_.hp)
  >> filter(_.hp - lag(_.hp) > 50)
  )
[10]:
mpg cyl disp hp drat wt qsec vs am gear carb
30 15.0 8 301.0 335 3.54 3.57 14.6 0 1 5 8

Edit page on github here. Interactive version: Binder badge