Select columns

This function lets you select specific columns of your data to keep.

There are three different building blocks that can used in a selection:

import pandas as pd

pd.set_option("display.max_rows", 5)

from siuba import _, select
from siuba.data import penguins

penguins
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
... ... ... ... ... ... ... ... ...
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

344 rows × 8 columns

Select by name or position

The simplest way to select a column to keep is to refer to it by name or position.

select(penguins, _.species, _.island, 6, -1)
species island sex year
0 Adelie Torgersen male 2007
1 Adelie Torgersen female 2007
... ... ... ... ...
342 Chinstrap Dream male 2009
343 Chinstrap Dream female 2009

344 rows × 4 columns

The code above does the following:

  • selects by name the species and island columns.
  • selects by position the index 6 and -1 columns (the last item).

Selecting by position should produce the same results as indexing a list of names.

penguins.columns[6]       # "sex"
penguins.columns[-1]      # "year"

Excluding columns

You can remove a column from the data by putting a tilde operator (~) in front of it.

penguins >> select(~_.body_mass_g, ~_.sex, ~_.year)
species island bill_length_mm bill_depth_mm flipper_length_mm
0 Adelie Torgersen 39.1 18.7 181.0
1 Adelie Torgersen 39.5 17.4 186.0
... ... ... ... ... ...
342 Chinstrap Dream 50.8 19.0 210.0
343 Chinstrap Dream 50.2 18.7 198.0

344 rows × 5 columns

The code above keeps all columns except body_mass_g, sex, and year.

Note that the ~ operator flips the value of True and False in pandas, and is called the “invert operator”.

~pd.Series([True, False])
0    False
1     True
dtype: bool

Renaming columns

You can rename a specified column by using the equality operator (==). This operation takes the following form.

  • _.new_name == _.old_name
penguins >> select(_.species_name == _.species)
species_name
0 Adelie
1 Adelie
... ...
342 Chinstrap
343 Chinstrap

344 rows × 1 columns

Note that expressing the new column name on the left is similar to how creating a python dictionary works. For example…

  • select(_.a == _.x, _.b == _.y)
  • dict(a = "x", b = "y")

both create new entries named “a” and “b”.

Select by slice

When the columns are adjacent to each other, you can select them using _["start_col":"end_col"].

penguins >> select(_.species, _["bill_length_mm":"body_mass_g"])
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
0 Adelie 39.1 18.7 181.0 3750.0
1 Adelie 39.5 17.4 186.0 3800.0
... ... ... ... ... ...
342 Chinstrap 50.8 19.0 210.0 4100.0
343 Chinstrap 50.2 18.7 198.0 3775.0

344 rows × 5 columns

You can use three methods to specify a column in a slice:

  • _.some_col
  • "some_col"
  • a position number

Exclusion

You can exclude slice selections using the ~ operator.

penguins >> select(~_["bill_length_mm":"body_mass_g"])
species island sex year
0 Adelie Torgersen male 2007
1 Adelie Torgersen female 2007
... ... ... ... ...
342 Chinstrap Dream male 2009
343 Chinstrap Dream female 2009

344 rows × 4 columns

Position number

Note that when position number is used to slice columns, the end position is not included in the selection.

# these are equivalent

penguins >> select(0, 1)
penguins >> select(_[0:2])
species island
0 Adelie Torgersen
1 Adelie Torgersen
... ... ...
342 Chinstrap Dream
343 Chinstrap Dream

344 rows × 2 columns

Select by pattern (e.g. endswith)

penguins >> select(_.species, _.endswith("mm"))
species bill_length_mm bill_depth_mm flipper_length_mm
0 Adelie 39.1 18.7 181.0
1 Adelie 39.5 17.4 186.0
... ... ... ... ...
342 Chinstrap 50.8 19.0 210.0
343 Chinstrap 50.2 18.7 198.0

344 rows × 4 columns

penguins >> select(_.contains("length"))
bill_length_mm flipper_length_mm
0 39.1 181.0
1 39.5 186.0
... ... ...
342 50.8 210.0
343 50.2 198.0

344 rows × 2 columns

Pandas comparison

import pandas as pd

from siuba.data import mtcars
from siuba import select, _

Click between tabs to compare code across siuba and pandas.

# keep cyl column
mtcars >> select(_.cyl)

# keep all *except* cyl column
mtcars >> select(-_.cyl)

# complex select, plus rename cyl to cylinder
mtcars >> select(_.cylinder == _.cyl, _.startswith("m"))
# keep cyl column
mtcars[["cyl"]]

# keep all *except* cyl column
mtcars.drop(["cyl"], axis=1)

# complex select, plus rename cyl to cylinder
cols = mtcars.columns
mtcars.loc[:, (cols == "cyl") | cols.str.startswith("m")] \
      .rename({"cyl": "cylinder"})