Filter rows

The filter() function keeps rows of data that meet all specified conditions.

Basics

Here is a simple filter().

from siuba import _, filter, group_by
from siuba.data import mtcars

mtcars >> filter(_.cyl == 4, _.gear == 5)
mpg cyl disp hp drat wt qsec vs am gear carb
26 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
27 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

It keeps rows where the cyl column is equal to 4, AND the gear column is equal to five.

Filters with OR conditions

In order to keep a row when one of several conditions is met, use the bar (|) operator.

mtcars >> filter((_.cyl == 4) | (_.gear == 5))
mpg cyl disp hp drat wt qsec vs am gear carb
2 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
7 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
... ... ... ... ... ... ... ... ... ... ... ...
30 15.0 8 301.0 335 3.54 3.57 14.60 0 1 5 8
31 21.4 4 121.0 109 4.11 2.78 18.60 1 1 4 2

14 rows × 11 columns

The code above keeps rows where cyl is equal to 4 OR gear is equal to 5.

Be sure to explicitly put parentheses around both sides. Otherwise, python will group the operation like _.cyl == (4 | _.gear) == 5.

Dropping NA values

Filter drops rows where conditions return False or not available (NA) values. This section will cover how to tell what is considered NA, and how to drop rows of data with NA values.

What counts as NA?

Use pandas.isna() to determine whether a value is considered to be NA.

df = pd.DataFrame({
    "x": [True, False, None],
    })

df.x
0     True
1    False
2     None
Name: x, dtype: object

Notice in the code above that the last value is None. We can confirm pandas sees this as an NA with the code below.

pd.isna(df.x)
0    False
1    False
2     True
Name: x, dtype: bool

Since None is considered an NA, its row gets removed in the filter below.

df >> filter(_.x)
x
0 True

Drop only by NA

If you want to remove only by NA values from your data, use the pandas .notna() method.

This effectively says, “keep any values of x that are not NA”.

df >> filter(_.x.notna())
x
0 True
1 False

Keeping NA values

In general, use an OR condition with _.some_col.isna() to keep rows with missing values.

df = pd.DataFrame({
    "some_number": [1.1, None],
})

df
some_number
0 1.1
1 NaN

This strategy should work almost all the time.

(df.some_number > 1) | df.some_number.isna()
0    True
1    True
Name: some_number, dtype: bool
Warning

Be careful with columns that contain None values. In plain python None or True is True, but in pandas it’s False.

None or True                 # True
pd.Series([None]) | True     # False

Grouped filters

In the example below, we keep rows where the horsepower (hp) is above the median horsepower for each cylinder group.

(mtcars
  >> group_by(_.cyl)
  >> filter(_.hp > _.hp.median())
  )

(grouped data frame)

mpg cyl disp hp drat wt qsec vs am gear carb
2 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
6 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
... ... ... ... ... ... ... ... ... ... ... ...
30 15.0 8 301.0 335 3.54 3.57 14.60 0 1 5 8
31 21.4 4 121.0 109 4.11 2.78 18.60 1 1 4 2

15 rows × 11 columns

This means that the filter performs the following operations:

  • splits the rows of data into groups, corresponding to their value of _.cyl
  • calculates median hp within each group.
  • for each row, tests where the hp is greater than the group’s median hp.