Datetime operations (.dt)

Warning

This section is in the draft phase.

Overview

This page covers how to work with dates and times in siuba. siuba works by using pandas methods, either by calling them directly, or translating them to SQL.

import pandas as pd

from siuba import _, count, mutate
from siuba.siu import call

Using datetime methods (.dt)

siuba uses pandas methods, so can use any of the datetime methods it makes available, like .dt.month_name().

df_dates = pd.DataFrame({
    "dates": pd.to_datetime(["2021-01-02", "2021-02-03"]),
    "raw": ["2023-04-05 06:07:08", "2024-05-06 07:08:09"],
})


df_dates.dates.dt.month_name()
0     January
1    February
Name: dates, dtype: object

Use in a verb

df_dates >> count(month = _.dates.dt.month_name())
month n
0 February 1
1 January 1

You can call functions like pd.to_datetime() using siuba’s call().

res = df_dates >> mutate(parsed = call(pd.to_datetime, _.raw))

res.parsed
0   2023-04-05 06:07:08
1   2024-05-06 07:08:09
Name: parsed, dtype: datetime64[ns]

Notice that this creates a new datetime column by calling pd.to_datetime(df.raw).

Creating datetime columns

There are roughly two ways to create a datetime column in pandas:

  • pd.to_datetime() which takes a range of inputs.
  • pd.Series.astype("datetime64[ns]") method call.

The pd.to_datetime() function is flexible, and can also take a list of datetimes or a Series.

dt_index = pd.to_datetime(["2021-01-02 01:02:03", "2022-02-03 04:05:06"])
dt_index
DatetimeIndex(['2021-01-02 01:02:03', '2022-02-03 04:05:06'], dtype='datetime64[ns]', freq=None)

Note that sometimes the result is not a Series. For example, the above object is a DatetimeIndex. Generally, everything is easier after wrapping it in a pandas Series.

ser_times = pd.Series(dt_index)

On the other hand, the .astype() method is a simple way to convert a series to a datetime.

pd.Series(["2021-01-02 03:04:05"]).astype("datetime64[ns]")
0   2021-01-02 03:04:05
dtype: datetime64[ns]

The pandas time series docs discuss in exquisite detail the intricacies of different datetime objects, and how they’re created! The rest of this page will just use pandas Series to look at datetime operations.

Change granularity with floor_date()

Siuba has an experimental function called floor_date() for rounding down to a specific unit of time (e.g. the week, the day, or the hour).

import pandas as pd

from siuba.experimental.datetime import floor_date, ceil_date

dates = pd.to_datetime(["2021-01-01 01:02:03", "2021-01-08 01:02:03"])

Basics

The floor_date() functions takes two arguments:

  • A column to round down (e.g. a pandas Series).
  • A datetime unit to round to (e.g. “MS” for “Month Start”; see the pandas unit alias doc)

For example, the code below rounds dates down to the nearest week.

floor_date(dates, "W")
DatetimeIndex(['2021-01-01', '2021-01-08'], dtype='datetime64[ns]', freq=None)

Datetime unit options

There are a lot of useful time units, such as “MS” for the start of a month. Below is a table of some of the most useful ones.

human speak pandas alias
second S
minute M
hour H
day D
week W
month M
bimonth 2M
quarter Q
season
halfyear
year Y
# month start
floor_date(dates, "MS")
DatetimeIndex(['2021-01-01', '2021-01-01'], dtype='datetime64[ns]', freq=None)

Round up with ceil_date()

The counterpart function ceil_date() rounds up to the specified unit of time.

# round up to month end
ceil_date(dates, "M")
DatetimeIndex(['2021-01-31', '2021-01-31'], dtype='datetime64[ns]', freq=None)

Preserving input type

Note that pandas has multiple formats for representing datetime:

  • Timestamp: directly representing points in time.
  • Period: representing time as number of spans of some time unit from a reference point (e.g. 120 months from Jan, 1970).
x = ["2021-01-01 01:02:03", "2021-02-03 01:02:03"]

dt_index = pd.DatetimeIndex(x)
floor_date(dt_index, "W")
DatetimeIndex(['2021-01-01', '2021-02-03'], dtype='datetime64[ns]', freq=None)
# note freq="m" refers to minute frequency
per_index = pd.PeriodIndex(x, freq="m")
floor_date(per_index, "W")
PeriodIndex(['2020-12-28/2021-01-03', '2021-02-01/2021-02-07'], dtype='period[W-SUN]')

Note that the "W" stands for week.

Unit start vs unit end

Note that the units we discussed here all referred to “unit start”. This is a bit tricky, so might be explained best in an example using month units.

  • month start: “2021-02-03” becomes “2021-02-01”
  • month end: “2021-02-03” becomes “2021-01-31”

For most time units pandas has a unit end version, such as:

  • “M”: month end (vs “MS” for month start)
  • “Y”: year end (vs “YS” for year start)
  • “Q”: quarter end (vs “QS” for quarter start)

It’s a bit confusing that “M” stands for month end, but “D” stands for day start. In general, time units at the day level or finer grain only do unit start, so be careful with units!

Learning more