Select columns

This function lets you select specific columns of your data to keep.

There are three different building blocks that can used in a selection:

including columns using a name, slice, or position).
excluding columns by putting - beforehand.
matching based on a pattern, like all columns that start with "ttl_".

import pandas as pd

pd.set_option("display.max_rows", 5)

from siuba import _, select
from siuba.data import penguins

penguins

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
...	...	...	...	...	...	...	...	...
342	Chinstrap	Dream	50.8	19.0	210.0	4100.0	male	2009
343	Chinstrap	Dream	50.2	18.7	198.0	3775.0	female	2009

344 rows × 8 columns

Select by name or position

The simplest way to select a column to keep is to refer to it by name or position.

select(penguins, _.species, _.island, 6, -1)

	species	island	sex	year
0	Adelie	Torgersen	male	2007
1	Adelie	Torgersen	female	2007
...	...	...	...	...
342	Chinstrap	Dream	male	2009
343	Chinstrap	Dream	female	2009

344 rows × 4 columns

The code above does the following:

selects by name the species and island columns.
selects by position the index 6 and -1 columns (the last item).

Selecting by position should produce the same results as indexing a list of names.

penguins.columns[6]       # "sex"
penguins.columns[-1]      # "year"

Excluding columns

You can remove a column from the data by putting a tilde operator (~) in front of it.

penguins >> select(~_.body_mass_g, ~_.sex, ~_.year)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm
0	Adelie	Torgersen	39.1	18.7	181.0
1	Adelie	Torgersen	39.5	17.4	186.0
...	...	...	...	...	...
342	Chinstrap	Dream	50.8	19.0	210.0
343	Chinstrap	Dream	50.2	18.7	198.0

344 rows × 5 columns

The code above keeps all columns except body_mass_g, sex, and year.

Note that the ~ operator flips the value of True and False in pandas, and is called the “invert operator”.

~pd.Series([True, False])

0    False
1     True
dtype: bool

Renaming columns

You can rename a specified column by using the equality operator (==). This operation takes the following form.

_.new_name == _.old_name

penguins >> select(_.species_name == _.species)

	species_name
0	Adelie
1	Adelie
...	...
342	Chinstrap
343	Chinstrap

344 rows × 1 columns

Note that expressing the new column name on the left is similar to how creating a python dictionary works. For example…

select(_.a == _.x, _.b == _.y)
dict(a = "x", b = "y")

both create new entries named “a” and “b”.

Select by slice

When the columns are adjacent to each other, you can select them using _["start_col":"end_col"].

penguins >> select(_.species, _["bill_length_mm":"body_mass_g"])

	species	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	Adelie	39.1	18.7	181.0	3750.0
1	Adelie	39.5	17.4	186.0	3800.0
...	...	...	...	...	...
342	Chinstrap	50.8	19.0	210.0	4100.0
343	Chinstrap	50.2	18.7	198.0	3775.0

344 rows × 5 columns

You can use three methods to specify a column in a slice:

_.some_col
"some_col"
a position number

Exclusion

You can exclude slice selections using the ~ operator.

penguins >> select(~_["bill_length_mm":"body_mass_g"])

	species	island	sex	year
0	Adelie	Torgersen	male	2007
1	Adelie	Torgersen	female	2007
...	...	...	...	...
342	Chinstrap	Dream	male	2009
343	Chinstrap	Dream	female	2009

344 rows × 4 columns

Position number

Note that when position number is used to slice columns, the end position is not included in the selection.

# these are equivalent

penguins >> select(0, 1)
penguins >> select(_[0:2])

	species	island
0	Adelie	Torgersen
1	Adelie	Torgersen
...	...	...
342	Chinstrap	Dream
343	Chinstrap	Dream

344 rows × 2 columns

Select by pattern (e.g. endswith)

penguins >> select(_.species, _.endswith("mm"))

	species	bill_length_mm	bill_depth_mm	flipper_length_mm
0	Adelie	39.1	18.7	181.0
1	Adelie	39.5	17.4	186.0
...	...	...	...	...
342	Chinstrap	50.8	19.0	210.0
343	Chinstrap	50.2	18.7	198.0

344 rows × 4 columns

penguins >> select(_.contains("length"))

	bill_length_mm	flipper_length_mm
0	39.1	181.0
1	39.5	186.0
...	...	...
342	50.8	210.0
343	50.2	198.0

344 rows × 2 columns

Pandas comparison

import pandas as pd

from siuba.data import mtcars
from siuba import select, _

Click between tabs to compare code across siuba and pandas.

# keep cyl column
mtcars >> select(_.cyl)

# keep all *except* cyl column
mtcars >> select(-_.cyl)

# complex select, plus rename cyl to cylinder
mtcars >> select(_.cylinder == _.cyl, _.startswith("m"))

# keep cyl column
mtcars[["cyl"]]

# keep all *except* cyl column
mtcars.drop(["cyl"], axis=1)

# complex select, plus rename cyl to cylinder
cols = mtcars.columns
mtcars.loc[:, (cols == "cyl") | cols.str.startswith("m")] \
      .rename({"cyl": "cylinder"})