Reshape tables 📝

import pandas as pd
from siuba import _, spread, gather

costs = pd.DataFrame({
    'id': [1,2],
    'price_x': [.1, .2],
    'price_y': [.4, .5],
    'price_z': [.7, .8]
})

costs

	id	price_x	price_y	price_z
0	1	0.1	0.4	0.7
1	2	0.2	0.5	0.8

Gather data long

# selecting each variable manually
costs >> gather('measure', 'value', _.price_x, _.price_y, _.price_z)

# selecting variables using a slice
costs >> gather('measure', 'value', _["price_x":"price_z"])

# selecting by excluding id
costs >> gather('measure', 'value', -_.id)

	id	measure	value
0	1	price_x	0.1
1	2	price_x	0.2
...	...	...	...
4	1	price_z	0.7
5	2	price_z	0.8

6 rows × 3 columns

Spread data wide

(costs
    >> gather('measure', 'value', -_.id)
    >> spread('measure', 'value')
)

	id	price_x	price_y	price_z
0	1	0.1	0.4	0.7
1	2	0.2	0.5	0.8

Experimental pivot_wider, pivot_longer

Warning

These functions are thoroughly tested, but currently experimental. The sections below describe their basic use. See their reference page for more examples.

from siuba.experimental.pivot import pivot_wider, pivot_longer
from siuba import Fx

`pivot_wider()`

If there would be multiple entries per cell in the spread wide data, then the spread() function raises an error.

This is shown below, where there are duplicate entries where id=1 and measure="a".

df = pd.DataFrame({
    "id": [1, 1, 2],
    "measure": ["a", "a", "b"],
    "value": [8, 9, 10]
})

df >> spread("measure", "value")

ValueError: Index contains duplicate entries, cannot reshape

Use the pivot_wider() function to deal with this situation.

df >> pivot_wider(_.id, names_from=_.measure, values_from=_.value, values_fn=tuple)

	id	a	b
0	1	(8, 9)	NaN
1	2	NaN	(10,)

Notice that the top-left entry is a list of two values, (8, 9). The values_fn argument is able to reduce those values down to one.

For example, by taking the mean.

df >> pivot_wider(_.id, names_from=_.measure, values_from=_.value, values_fn="mean")

	id	a	b
0	1	8.5	NaN
1	2	NaN	10.0

Note that values_fn= may be a lambda function.

Note

If you’d rather use pandas’s pivot_table() method, siuba can pipe to it using this syntax:

(df
    >> _.pivot_table(...)
)

Where you would replace ... with your arguments. See flexible piping for more details.

`pivot_longer()`

Use pivot_longer() to stack columns of data, turning them into rows.

wide = pd.DataFrame({"id": [1, 2], "x": [5, 6], "y": [7, 8]})

wide

	id	x	y
0	1	5	7
1	2	6	8

Notice that this data has two columns (x and y) that you might want to stack on top of each other.

df >> pivot_longer(~_.id, names_to="variable", values_to="number")

	id	variable	number
0	1	measure	a
0	1	value	8
...	...	...	...
2	2	measure	b
2	2	value	10

6 rows × 3 columns

Note that in the code above, we do the following:

Use ~_.id selects every columns except id for stacking.
Set column names to be stored in a new variable column.
Set column values to be stored in a new value column.

Specifying aggregations

Currently, the most general way to specify an aggregation in functions like pivot_wider() is to use a function in siuba.ops. These functions all correspond to pandas methods, and many have translations to SQL.

from siuba import ops

df >> pivot_wider(_.id, names_from=_.measure, values_from=_.value, values_fn=ops.mean)

	id	a	b
0	1	8.5	NaN
1	2	NaN	10.0

This allows your code to also work with SQL:

from siuba import _, tbl, show_query
from sqlalchemy import create_engine
from siuba.experimental.pivot import pivot_wider

engine = create_engine("duckdb:///:memory:")

q = (tbl(engine, "df", df)
    >> pivot_wider(_.id, names_from=_.measure, values_from=_.value, values_fn=ops.mean)
    >> show_query()
)

SELECT df_1.id AS id, avg(CASE WHEN (df_1.measure = 'a') THEN df_1.value END) AS a, avg(CASE WHEN (df_1.measure = 'b') THEN df_1.value END) AS b 
FROM df AS df_1 GROUP BY df_1.id

/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/duckdb_engine/__init__.py:229: DuckDBEngineWarning: duckdb-engine doesn't yet support reflection on indices
  warnings.warn(