Reshape tables πŸ“

import pandas as pd
from siuba import _, spread, gather

costs = pd.DataFrame({
    'id': [1,2],
    'price_x': [.1, .2],
    'price_y': [.4, .5],
    'price_z': [.7, .8]
})

costs
id price_x price_y price_z
0 1 0.1 0.4 0.7
1 2 0.2 0.5 0.8

Gather data long

# selecting each variable manually
costs >> gather('measure', 'value', _.price_x, _.price_y, _.price_z)

# selecting variables using a slice
costs >> gather('measure', 'value', _["price_x":"price_z"])

# selecting by excluding id
costs >> gather('measure', 'value', -_.id)
id measure value
0 1 price_x 0.1
1 2 price_x 0.2
... ... ... ...
4 1 price_z 0.7
5 2 price_z 0.8

6 rows Γ— 3 columns

Spread data wide

(costs
    >> gather('measure', 'value', -_.id)
    >> spread('measure', 'value')
)
id price_x price_y price_z
0 1 0.1 0.4 0.7
1 2 0.2 0.5 0.8

Experimental pivot_wider, pivot_longer

Warning

These functions are thoroughly tested, but currently experimental. The sections below describe their basic use. See their reference page for more examples.

from siuba.experimental.pivot import pivot_wider, pivot_longer
from siuba import Fx

pivot_wider()

If there would be multiple entries per cell in the spread wide data, then the spread() function raises an error.

This is shown below, where there are duplicate entries where id=1 and measure="a".

df = pd.DataFrame({
    "id": [1, 1, 2],
    "measure": ["a", "a", "b"],
    "value": [8, 9, 10]
})

df >> spread("measure", "value")
ValueError: Index contains duplicate entries, cannot reshape

Use the pivot_wider() function to deal with this situation.

df >> pivot_wider(_.id, names_from=_.measure, values_from=_.value, values_fn=tuple)
id a b
0 1 (8, 9) NaN
1 2 NaN (10,)

Notice that the top-left entry is a list of two values, (8, 9). The values_fn argument is able to reduce those values down to one.

For example, by taking the mean.

df >> pivot_wider(_.id, names_from=_.measure, values_from=_.value, values_fn="mean")
id a b
0 1 8.5 NaN
1 2 NaN 10.0

Note that values_fn= may be a lambda function.

Note

If you’d rather use pandas’s pivot_table() method, siuba can pipe to it using this syntax:

(df
    >> _.pivot_table(...)
)

Where you would replace ... with your arguments. See flexible piping for more details.

pivot_longer()

Use pivot_longer() to stack columns of data, turning them into rows.

wide = pd.DataFrame({"id": [1, 2], "x": [5, 6], "y": [7, 8]})

wide
id x y
0 1 5 7
1 2 6 8

Notice that this data has two columns (x and y) that you might want to stack on top of each other.

df >> pivot_longer(~_.id, names_to="variable", values_to="number")
id variable number
0 1 measure a
0 1 value 8
... ... ... ...
2 2 measure b
2 2 value 10

6 rows Γ— 3 columns

Note that in the code above, we do the following:

  • Use ~_.id selects every columns except id for stacking.
  • Set column names to be stored in a new variable column.
  • Set column values to be stored in a new value column.

Specifying aggregations

Currently, the most general way to specify an aggregation in functions like pivot_wider() is to use a function in siuba.ops. These functions all correspond to pandas methods, and many have translations to SQL.

from siuba import ops

df >> pivot_wider(_.id, names_from=_.measure, values_from=_.value, values_fn=ops.mean)
id a b
0 1 8.5 NaN
1 2 NaN 10.0

This allows your code to also work with SQL:

from siuba import _, tbl, show_query
from sqlalchemy import create_engine
from siuba.experimental.pivot import pivot_wider

engine = create_engine("duckdb:///:memory:")

q = (tbl(engine, "df", df)
    >> pivot_wider(_.id, names_from=_.measure, values_from=_.value, values_fn=ops.mean)
    >> show_query()
)
SELECT df_1.id AS id, avg(CASE WHEN (df_1.measure = 'a') THEN df_1.value END) AS a, avg(CASE WHEN (df_1.measure = 'b') THEN df_1.value END) AS b 
FROM df AS df_1 GROUP BY df_1.id
/opt/hostedtoolcache/Python/3.9.15/x64/lib/python3.9/site-packages/duckdb_engine/__init__.py:229: DuckDBEngineWarning: duckdb-engine doesn't yet support reflection on indices
  warnings.warn(