[1]:
import pandas as pd
pd.set_option("display.max_rows", 5)

Select

This function lets you select specific columns of your data to keep. Each selection may include up to three pieces…

  • specifying column(s) to include or remove

  • excluding some specified columns

  • renaming a column

The documentation below will illustrate these pieces when specifying one column at a time, specifying multiple columns, or searching columns using functions like contains.

[2]:
from siuba import _, select
from siuba.data import mtcars

mtcars
[2]:
mpg cyl disp hp drat wt qsec vs am gear carb
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
... ... ... ... ... ... ... ... ... ... ... ...
30 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

32 rows × 11 columns

Specifying one column at a time

Specify columns by name or position

The cleanest way to specify a column is to refer to it by name. By default, referring to a column will keep it.

[3]:
mtcars >> select(_.mpg, _.cyl)
[3]:
mpg cyl
0 21.0 6
1 21.0 6
... ... ...
30 15.0 8
31 21.4 4

32 rows × 2 columns

This approach ensures that you can easily rename, or exclude it from the data (shown in following sections). However, you can also refer to a column using a string, or its 0-indexed column position.

[4]:
# two other ways to keep the same columns
mtcars >> select(0, 1)
mtcars >> select("mpg", "cyl")
[4]:
mpg cyl
0 21.0 6
1 21.0 6
... ... ...
30 15.0 8
31 21.4 4

32 rows × 2 columns

Excluding columns

You can remove a column from the data by specifying it with a minus sign (-) in front of it. This action can be performed on multiple columns.

[5]:
# simple select with exclusion
mtcars >> select(-_.mpg, -_.cyl)
[5]:
disp hp drat wt qsec vs am gear carb
0 160.0 110 3.90 2.620 16.46 0 1 4 4
1 160.0 110 3.90 2.875 17.02 0 1 4 4
... ... ... ... ... ... ... ... ... ...
30 301.0 335 3.54 3.570 14.60 0 1 5 8
31 121.0 109 4.11 2.780 18.60 1 1 4 2

32 rows × 9 columns

Renaming columns

You can rename a specified column by using the equality operator (==). This operation takes the following form.

  • _.new_name == _.old_name

[6]:
# select with rename
mtcars >> select(_.miles_per_gallon == _.mpg, _.cyl)
[6]:
miles_per_gallon cyl
0 21.0 6
1 21.0 6
... ... ...
30 15.0 8
31 21.4 4

32 rows × 2 columns

Note that expressing the new column name on the left is similar to how creating a python dictionary works. For example…

  • select(_.a == _.x, _.b == _.y)

  • dict(a = "x", b = "y")

both create new entries named “a” and “b”.

However, keep in mind that pandas DataFrame.rename method uses the opposite approach.

Select a slice of columns

When the columns you want to select are adjacent to each other, you can select them using a special slicing syntax. This syntax takes the form…

  • _["start_col":"end_col"]

where “start_col” and “end_col” can be any of the three methods to specify a column: _.some_col, “some_col”, or its position number.

[7]:
mtcars >> select(_["mpg": "hp"])
[7]:
mpg cyl disp hp
0 21.0 6 160.0 110
1 21.0 6 160.0 110
... ... ... ... ...
30 15.0 8 301.0 335
31 21.4 4 121.0 109

32 rows × 4 columns

Note that when position number is used to slice columns, the columns you specify are exactly the ones you would be from indexing the DataFrame.columns attribute.

[8]:
print(mtcars.columns[0:4])

mtcars >> select(_[0:4])
Index(['mpg', 'cyl', 'disp', 'hp'], dtype='object')
[8]:
mpg cyl disp hp
0 21.0 6 160.0 110
1 21.0 6 160.0 110
... ... ... ... ...
30 15.0 8 301.0 335
31 21.4 4 121.0 109

32 rows × 4 columns

Finally, columns selected through slicing can be excluded using the minus operator (-).

[9]:
mtcars >> select(-_["mpg": "hp"])
[9]:
drat wt qsec vs am gear carb
0 3.90 2.620 16.46 0 1 4 4
1 3.90 2.875 17.02 0 1 4 4
... ... ... ... ... ... ... ...
30 3.54 3.570 14.60 0 1 5 8
31 4.11 2.780 18.60 1 1 4 2

32 rows × 7 columns

Searching with methods like startswith or contains

The final, most flexible way to specify columns is to use any of the methods on the DataFrame.columns.str attribute. This is done by calling any of these methods in a siu expression (e.g. _.startswith('a')).

[10]:
# prints columns that contain the letter d
columns = mtcars.columns
print(columns[columns.str.contains('d')])

# uses the same method to select only these columns
mtcars >> select(_.contains('d'))
Index(['disp', 'drat'], dtype='object')
[10]:
disp drat
0 160.0 3.90
1 160.0 3.90
... ... ...
30 301.0 3.54
31 121.0 4.11

32 rows × 2 columns

As with the other approaches of specifying columns, you can also choose to exclude them.

[11]:
mtcars >> select(-_.contains('d'))
[11]:
mpg cyl hp wt qsec vs am gear carb
0 21.0 6 110 2.620 16.46 0 1 4 4
1 21.0 6 110 2.875 17.02 0 1 4 4
... ... ... ... ... ... ... ... ... ...
30 15.0 8 335 3.570 14.60 0 1 5 8
31 21.4 4 109 2.780 18.60 1 1 4 2

32 rows × 9 columns

There are many string methods that can be accessed from DataFrame.colname.str. See their pandas docs, or their docstrings (e.g. help(mtcars.cyl.str.contains)) for more information.

For convenience, the names of these methods are listed below.

[12]:
str_methods = dir(mtcars.columns.str)
str_useful = [x for x in str_methods if not x.startswith("_")]

print(str_useful)
['capitalize', 'casefold', 'cat', 'center', 'contains', 'count', 'decode', 'encode', 'endswith', 'extract', 'extractall', 'find', 'findall', 'fullmatch', 'get', 'get_dummies', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'islower', 'isnumeric', 'isspace', 'istitle', 'isupper', 'join', 'len', 'ljust', 'lower', 'lstrip', 'match', 'normalize', 'pad', 'partition', 'repeat', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'slice', 'slice_replace', 'split', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'wrap', 'zfill']

Edit page on github here. Interactive version: Binder badge