This function lets you select specific columns of your data to keep.
There are three different building blocks that can used in a selection:
- including columns using a name, slice, or position).
- excluding columns by putting
-
beforehand.
- matching based on a pattern, like all columns that start with
"ttl_"
.
import pandas as pd
pd.set_option("display.max_rows", 5)
from siuba import _, select
from siuba.data import penguins
penguins
|
species |
island |
bill_length_mm |
bill_depth_mm |
flipper_length_mm |
body_mass_g |
sex |
year |
0 |
Adelie |
Torgersen |
39.1 |
18.7 |
181.0 |
3750.0 |
male |
2007 |
1 |
Adelie |
Torgersen |
39.5 |
17.4 |
186.0 |
3800.0 |
female |
2007 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
342 |
Chinstrap |
Dream |
50.8 |
19.0 |
210.0 |
4100.0 |
male |
2009 |
343 |
Chinstrap |
Dream |
50.2 |
18.7 |
198.0 |
3775.0 |
female |
2009 |
344 rows × 8 columns
Select by name or position
The simplest way to select a column to keep is to refer to it by name or position.
select(penguins, _.species, _.island, 6, -1)
|
species |
island |
sex |
year |
0 |
Adelie |
Torgersen |
male |
2007 |
1 |
Adelie |
Torgersen |
female |
2007 |
... |
... |
... |
... |
... |
342 |
Chinstrap |
Dream |
male |
2009 |
343 |
Chinstrap |
Dream |
female |
2009 |
344 rows × 4 columns
The code above does the following:
- selects by name the
species
and island
columns.
- selects by position the index
6
and -1
columns (the last item).
Selecting by position should produce the same results as indexing a list of names.
penguins.columns[6] # "sex"
penguins.columns[-1] # "year"
Excluding columns
You can remove a column from the data by putting a tilde operator (~
) in front of it.
penguins >> select(~_.body_mass_g, ~_.sex, ~_.year)
|
species |
island |
bill_length_mm |
bill_depth_mm |
flipper_length_mm |
0 |
Adelie |
Torgersen |
39.1 |
18.7 |
181.0 |
1 |
Adelie |
Torgersen |
39.5 |
17.4 |
186.0 |
... |
... |
... |
... |
... |
... |
342 |
Chinstrap |
Dream |
50.8 |
19.0 |
210.0 |
343 |
Chinstrap |
Dream |
50.2 |
18.7 |
198.0 |
344 rows × 5 columns
The code above keeps all columns except body_mass_g
, sex
, and year
.
Note that the ~
operator flips the value of True and False in pandas, and is called the “invert operator”.
~pd.Series([True, False])
0 False
1 True
dtype: bool
Renaming columns
You can rename a specified column by using the equality operator (==
). This operation takes the following form.
penguins >> select(_.species_name == _.species)
|
species_name |
0 |
Adelie |
1 |
Adelie |
... |
... |
342 |
Chinstrap |
343 |
Chinstrap |
344 rows × 1 columns
Note that expressing the new column name on the left is similar to how creating a python dictionary works. For example…
select(_.a == _.x, _.b == _.y)
dict(a = "x", b = "y")
both create new entries named “a” and “b”.
Select by slice
When the columns are adjacent to each other, you can select them using _["start_col":"end_col"]
.
penguins >> select(_.species, _["bill_length_mm":"body_mass_g"])
|
species |
bill_length_mm |
bill_depth_mm |
flipper_length_mm |
body_mass_g |
0 |
Adelie |
39.1 |
18.7 |
181.0 |
3750.0 |
1 |
Adelie |
39.5 |
17.4 |
186.0 |
3800.0 |
... |
... |
... |
... |
... |
... |
342 |
Chinstrap |
50.8 |
19.0 |
210.0 |
4100.0 |
343 |
Chinstrap |
50.2 |
18.7 |
198.0 |
3775.0 |
344 rows × 5 columns
You can use three methods to specify a column in a slice:
_.some_col
"some_col"
- a position number
Exclusion
You can exclude slice selections using the ~
operator.
penguins >> select(~_["bill_length_mm":"body_mass_g"])
|
species |
island |
sex |
year |
0 |
Adelie |
Torgersen |
male |
2007 |
1 |
Adelie |
Torgersen |
female |
2007 |
... |
... |
... |
... |
... |
342 |
Chinstrap |
Dream |
male |
2009 |
343 |
Chinstrap |
Dream |
female |
2009 |
344 rows × 4 columns
Position number
Note that when position number is used to slice columns, the end position is not included in the selection.
# these are equivalent
penguins >> select(0, 1)
penguins >> select(_[0:2])
|
species |
island |
0 |
Adelie |
Torgersen |
1 |
Adelie |
Torgersen |
... |
... |
... |
342 |
Chinstrap |
Dream |
343 |
Chinstrap |
Dream |
344 rows × 2 columns
Select by pattern (e.g. endswith)
penguins >> select(_.species, _.endswith("mm"))
|
species |
bill_length_mm |
bill_depth_mm |
flipper_length_mm |
0 |
Adelie |
39.1 |
18.7 |
181.0 |
1 |
Adelie |
39.5 |
17.4 |
186.0 |
... |
... |
... |
... |
... |
342 |
Chinstrap |
50.8 |
19.0 |
210.0 |
343 |
Chinstrap |
50.2 |
18.7 |
198.0 |
344 rows × 4 columns
penguins >> select(_.contains("length"))
|
bill_length_mm |
flipper_length_mm |
0 |
39.1 |
181.0 |
1 |
39.5 |
186.0 |
... |
... |
... |
342 |
50.8 |
210.0 |
343 |
50.2 |
198.0 |
344 rows × 2 columns
Pandas comparison
import pandas as pd
from siuba.data import mtcars
from siuba import select, _
Click between tabs to compare code across siuba and pandas.
# keep cyl column
mtcars >> select(_.cyl)
# keep all *except* cyl column
mtcars >> select(-_.cyl)
# complex select, plus rename cyl to cylinder
mtcars >> select(_.cylinder == _.cyl, _.startswith("m"))
# keep cyl column
mtcars[["cyl"]]
# keep all *except* cyl column
mtcars.drop(["cyl"], axis=1)
# complex select, plus rename cyl to cylinder
cols = mtcars.columns
mtcars.loc[:, (cols == "cyl") | cols.str.startswith("m")] \
.rename({"cyl": "cylinder"})