from siuba import (
_,
inner_join, left_join, full_join, right_join,
semi_join, anti_join
)
= pd.DataFrame({'id': [1,2,3], 'val': ['lhs.1', 'lhs.2', 'lhs.3']})
lhs = pd.DataFrame({'id': [1,2,4], 'val': ['rhs.1', 'rhs.2', 'rhs.3']}) rhs
Join tables
Joins allow you to combine data from two tables, using two steps:
- determining matches in values between a set of specified columns.
- merging columns from matching rows into a new table.
The number of ways joins can be used is surprisingly deep, and provides an important foundation for most data analyses!
lhs
id | val | |
---|---|---|
0 | 1 | lhs.1 |
1 | 2 | lhs.2 |
2 | 3 | lhs.3 |
rhs
id | val | |
---|---|---|
0 | 1 | rhs.1 |
1 | 2 | rhs.2 |
2 | 4 | rhs.3 |
Syntax
Like other siuba verbs, joins can be used in two ways: by directly passing both data as arguments, or by piping.
# directly passing data
="id")
inner_join(lhs, rhs, on
# piping
>> inner_join(_, rhs, on="id") lhs
id | val_x | val_y | |
---|---|---|---|
0 | 1 | lhs.1 | rhs.1 |
1 | 2 | lhs.2 | rhs.2 |
Note that when used in a pipe, the first argument to the join must be _
, to represent the data.
How joins work
This page will use graphs to illustrate how joins work. They show the values being joined on for the left-hand side on the rows, and those for the right-hand side on the columns.
For example, consider the inner join from the last section:
="id") inner_join(lhs, rhs, on
id | val_x | val_y | |
---|---|---|---|
0 | 1 | lhs.1 | rhs.1 |
1 | 2 | lhs.2 | rhs.2 |
Here is a graph for this inner join:
Notice that 1, 2, 3
come from the lhs.id
column, and 1, 2, 4
from rhs.id
.
There are two steps to any join:
- Determine matches. The colored squares mark new rows that will be created. These are determined by looking at every pair of values from the left- and right-hand sides, and checking whether they’re equal. This means that there are 9 checks in total (one for each square in the grid).
- Merge rows. For each match, a row is created that has columns from both tables.
The next two sections on this page discuss different types of joins:
- Mutating joins: what should happen if a row or column doesn’t have any matches?
- Filtering joins: using matching rules without merging to filter the left-hand table.
The remaining four sections discuss important situations related to matching: duplicates, missing values, using multiple columns, and expressions.
Mutating joins
Up to this point we’ve looked at an inner join. We’ll call this a mutating join, because it puts the columns from both tables into the final result. In this section, we’ll look at two more important joins: left join and full join.
The diagram below shows these three mutating joins.
Left and full joins define a new matching rule for rows of data that don’t have any matches:
Left join. Include rows from the left-hand table that didn’t have any matches. This is done by matching left-hand rows with no matches, to a “dummy” right-hand row of all missing values. This is signified in the graph by the circled NA column.
Full join: Similar to left join, but extended to include unmatched rows from both tables. Notice that the graph for full join has two “dummy circled NA” entries, one to match rows, and one to match columns.
Inner join
="id") inner_join(lhs, rhs, on
id | val_x | val_y | |
---|---|---|---|
0 | 1 | lhs.1 | rhs.1 |
1 | 2 | lhs.2 | rhs.2 |
Outer joins
="id") left_join(lhs, rhs, on
id | val_x | val_y | |
---|---|---|---|
0 | 1 | lhs.1 | rhs.1 |
1 | 2 | lhs.2 | rhs.2 |
2 | 3 | lhs.3 | NaN |
="id") full_join(lhs, rhs, on
id | val_x | val_y | |
---|---|---|---|
0 | 1 | lhs.1 | rhs.1 |
1 | 2 | lhs.2 | rhs.2 |
2 | 3 | lhs.3 | NaN |
3 | 4 | NaN | rhs.3 |
Filtering joins
Filtering joins use some of the same logic to determine matches as a join, but only to filter (remove) rows from the left-hand table.
We’ll use the data from the previous section to look at filtering joins.
lhs
id | val | |
---|---|---|
0 | 1 | lhs.1 |
1 | 2 | lhs.2 |
2 | 3 | lhs.3 |
rhs
id | val | |
---|---|---|
0 | 1 | rhs.1 |
1 | 2 | rhs.2 |
2 | 4 | rhs.3 |
Use semi_join()
to keep rows in the left-hand table that would have a match:
="id") semi_join(lhs, rhs, on
id | val | |
---|---|---|
0 | 1 | lhs.1 |
1 | 2 | lhs.2 |
Use anti_join()
to keep the rows that a semi_join()
would drop—the rows with no matches:
="id") anti_join(lhs, rhs, on
id | val | |
---|---|---|
2 | 3 | lhs.3 |
Duplicate matches
Mutating joins create a new row for each match between value pairs. This means that a join can duplicate rows from the left- or right-hand data.
For example, consider the situation where the left- and right-hand data both have the value 2
multiple times in the columns they’re joining on.
This is shown in the graph above—for an inner join, where both tables contain the value 2
twice. In this case, it results in 4 matches, which will each produce a row in the final result.
Here is the code corresponding to the graph above.
import pandas as pd
= pd.DataFrame({
lhs_dupes "id": [1, 2, 2, 3],
"val": ["lhs.1", "lhs.2", "lhs.3", "lhs.4"]
})
= pd.DataFrame({
rhs_dupes "id": [1, 2, 2, 4],
"val": ["rhs.1", "rhs.2", "rhs.3", "rhs.4"]
})
lhs_dupes
id | val | |
---|---|---|
0 | 1 | lhs.1 |
1 | 2 | lhs.2 |
2 | 2 | lhs.3 |
3 | 3 | lhs.4 |
rhs_dupes
id | val | |
---|---|---|
0 | 1 | rhs.1 |
1 | 2 | rhs.2 |
2 | 2 | rhs.3 |
3 | 4 | rhs.4 |
="id") inner_join(lhs_dupes, rhs_dupes, on
id | val_x | val_y | |
---|---|---|---|
0 | 1 | lhs.1 | rhs.1 |
1 | 2 | lhs.2 | rhs.2 |
2 | 2 | lhs.2 | rhs.3 |
3 | 2 | lhs.3 | rhs.2 |
4 | 2 | lhs.3 | rhs.3 |
NA handling
By default, missing values are matched to each other in joins.
This is shown in the left join below, where each table has a single NA
value.
Note the purple mark indicating a match between the NA
values. Importantly, literal NA
values here are different from the circled dummy NA (top right), which indicates the special matching rule for left joins.
Here is the code for the example:
import pandas as pd
= pd.DataFrame({"id": [1, pd.NA, 3], 'val': ['lhs.1', 'lhs.2', 'lhs.3']})
lhs_na = pd.DataFrame({"id": [1, pd.NA, 2], 'val': ['rhs.1', 'rhs.2', 'rhs.3']})
rhs_na ="id") left_join(lhs_na, rhs_na, on
id | val_x | val_y | |
---|---|---|---|
0 | 1 | lhs.1 | rhs.1 |
1 | <NA> | lhs.2 | rhs.2 |
2 | 3 | lhs.3 | NaN |
Match on multiple columns
Joins can be performed by comparing multiple pairs of columns to each other— such as the source
and id
columns in the data below.
= pd.DataFrame({
lhs_multi "source": ["a", "a", "b"],
"id": [1, 2, 1],
"val": ["lhs.1", "lhs.2", "lhs.3"]
})
= pd.DataFrame({
rhs_multi "source": ["a", "b", "c"],
"id": [1, 1, 1],
"val": ["rhs.1", "rhs.2", "rhs.3"]
})
lhs_multi
source | id | val | |
---|---|---|---|
0 | a | 1 | lhs.1 |
1 | a | 2 | lhs.2 |
2 | b | 1 | lhs.3 |
rhs_multi
source | id | val | |
---|---|---|---|
0 | a | 1 | rhs.1 |
1 | b | 1 | rhs.2 |
2 | c | 1 | rhs.3 |
In this case, a match happens when source
matches AND id
matches—as shown by the composite values (source, id)
in the graph below.
Using a list of columns
Use the on=
argument with a list of columns, in order to join on multiple columns.
=["source", "id"]) inner_join(lhs_multi, rhs_multi, on
source | id | val_x | val_y | |
---|---|---|---|---|
0 | a | 1 | lhs.1 | rhs.1 |
1 | b | 1 | lhs.3 | rhs.2 |
Using a dictionary of columns
Use the on=
argument with a dictionary to join pairs of columns with different names.
from siuba import rename
= rename(rhs_multi, some_source = "source")
rhs_multi2
"source": "some_source", "id": "id"}) inner_join(lhs_multi, rhs_multi2, {
source | id | val_x | some_source | val_y | |
---|---|---|---|---|---|
0 | a | 1 | lhs.1 | a | rhs.1 |
1 | b | 1 | lhs.3 | b | rhs.2 |
Match on expressions
Some siuba backends—like those that execute SQL—can join by expressions for determining matches.
For example, the graph below shows an inner join, where a match occurs when the left-hand value is less than or equal to the right-hand value.
Notice that the value 1
on the left-hand side matches everything on the right-hand side (top row).
SQL backend sql_on
argument.
Here is a full example, using sqlite:
from sqlalchemy import create_engine
from siuba.sql import LazyTbl
= create_engine("sqlite:///:memory:")
engine
= pd.DataFrame({'id': [1,2,3], 'val': ['lhs.1', 'lhs.2', 'lhs.3']})
lhs = pd.DataFrame({'id': [1,2,4], 'val': ['rhs.1', 'rhs.2', 'rhs.3']})
rhs
"lhs", engine, index=False)
lhs.to_sql("rhs", engine, index=False)
rhs.to_sql(
= LazyTbl(engine, "lhs")
tbl_sql_lhs = LazyTbl(engine, "rhs")
tbl_sql_rhs
inner_join(
tbl_sql_lhs,
tbl_sql_rhs,= lambda lhs, rhs: lhs.val <= rhs.val
sql_on )
# Source: lazy query # DB Conn: Engine(sqlite:///:memory:) # Preview:
id_x | val_x | id_y | val_y | |
---|---|---|---|---|
0 | 1 | lhs.1 | 1 | rhs.1 |
1 | 1 | lhs.1 | 2 | rhs.2 |
2 | 1 | lhs.1 | 4 | rhs.3 |
3 | 2 | lhs.2 | 1 | rhs.1 |
4 | 2 | lhs.2 | 2 | rhs.2 |
# .. may have more rows
Note that the function passed to sql_on
takes sqlalchemy columns for the left- and right-hand sides, so currently limited to what can be done in sqlalchemy.