```
from siuba import (
_,
inner_join, left_join, full_join, right_join,
semi_join, anti_join
)
= pd.DataFrame({'id': [1,2,3], 'val': ['lhs.1', 'lhs.2', 'lhs.3']})
lhs = pd.DataFrame({'id': [1,2,4], 'val': ['rhs.1', 'rhs.2', 'rhs.3']}) rhs
```

# Join tables

Joins allow you to combine data from two tables, using two steps:

**determining matches**in values between a set of specified columns.**merging columns**from matching rows into a new table.

The number of ways joins can be used is surprisingly deep, and provides an important foundation for most data analyses!

` lhs`

id | val | |
---|---|---|

0 | 1 | lhs.1 |

1 | 2 | lhs.2 |

2 | 3 | lhs.3 |

` rhs`

id | val | |
---|---|---|

0 | 1 | rhs.1 |

1 | 2 | rhs.2 |

2 | 4 | rhs.3 |

## Syntax

Like other siuba verbs, joins can be used in two ways: by directly passing both data as arguments, or by piping.

```
# directly passing data
="id")
inner_join(lhs, rhs, on
# piping
>> inner_join(_, rhs, on="id") lhs
```

id | val_x | val_y | |
---|---|---|---|

0 | 1 | lhs.1 | rhs.1 |

1 | 2 | lhs.2 | rhs.2 |

Note that when used in a pipe, the first argument to the join **must be _**, to represent the data.

## How joins work

This page will use graphs to illustrate how joins work. They show the values being joined on for the left-hand side on the rows, and those for the right-hand side on the columns.

For example, consider the inner join from the last section:

`="id") inner_join(lhs, rhs, on`

id | val_x | val_y | |
---|---|---|---|

0 | 1 | lhs.1 | rhs.1 |

1 | 2 | lhs.2 | rhs.2 |

Here is a graph for this inner join:

Notice that `1, 2, 3`

come from the `lhs.id`

column, and `1, 2, 4`

from `rhs.id`

.

There are two steps to any join:

**Determine matches**. The colored squares mark new rows that will be created. These are determined by looking at every pair of values from the left- and right-hand sides, and checking whether theyâ€™re equal. This means that there are 9 checks in total (one for each square in the grid).**Merge rows**. For each match, a row is created that has columns from both tables.

The next two sections on this page discuss different types of joins:

- Mutating joins: what should happen if a row or column doesnâ€™t have any matches?
- Filtering joins: using matching rules without merging to filter the left-hand table.

The remaining four sections discuss important situations related to matching: duplicates, missing values, using multiple columns, and expressions.

## Mutating joins

Up to this point weâ€™ve looked at an inner join. Weâ€™ll call this a mutating join, because it puts the columns from both tables into the final result. In this section, weâ€™ll look at two more important joins: left join and full join.

The diagram below shows these three mutating joins.

Left and full joins define a new matching rule for rows of data that donâ€™t have any matches:

**Left join**. Include rows from the left-hand table that didnâ€™t have any matches. This is done by matching left-hand rows with no matches, to a â€śdummyâ€ť right-hand row of all missing values. This is signified in the graph by the circled NA column.

**Full join**: Similar to left join, but extended to include unmatched rows from both tables. Notice that the graph for full join has two â€śdummy circled NAâ€ť entries, one to match rows, and one to match columns.

### Inner join

`="id") inner_join(lhs, rhs, on`

id | val_x | val_y | |
---|---|---|---|

0 | 1 | lhs.1 | rhs.1 |

1 | 2 | lhs.2 | rhs.2 |

### Outer joins

`="id") left_join(lhs, rhs, on`

id | val_x | val_y | |
---|---|---|---|

0 | 1 | lhs.1 | rhs.1 |

1 | 2 | lhs.2 | rhs.2 |

2 | 3 | lhs.3 | NaN |

`="id") full_join(lhs, rhs, on`

id | val_x | val_y | |
---|---|---|---|

0 | 1 | lhs.1 | rhs.1 |

1 | 2 | lhs.2 | rhs.2 |

2 | 3 | lhs.3 | NaN |

3 | 4 | NaN | rhs.3 |

## Filtering joins

Filtering joins use some of the same logic to determine matches as a join, but only to filter (remove) rows from the left-hand table.

Weâ€™ll use the data from the previous section to look at filtering joins.

` lhs`

id | val | |
---|---|---|

0 | 1 | lhs.1 |

1 | 2 | lhs.2 |

2 | 3 | lhs.3 |

` rhs`

id | val | |
---|---|---|

0 | 1 | rhs.1 |

1 | 2 | rhs.2 |

2 | 4 | rhs.3 |

Use `semi_join()`

to keep rows in the left-hand table that would have a match:

`="id") semi_join(lhs, rhs, on`

id | val | |
---|---|---|

0 | 1 | lhs.1 |

1 | 2 | lhs.2 |

Use `anti_join()`

to keep the rows that a `semi_join()`

would dropâ€”the rows with no matches:

`="id") anti_join(lhs, rhs, on`

id | val | |
---|---|---|

2 | 3 | lhs.3 |

## Duplicate matches

Mutating joins create a new row for each match between value pairs. This means that a join can duplicate rows from the left- or right-hand data.

For example, consider the situation where the left- and right-hand data both have the value `2`

multiple times in the columns theyâ€™re joining on.

This is shown in the graph aboveâ€”for an inner join, where both tables contain the value `2`

twice. In this case, it results in 4 matches, which will each produce a row in the final result.

Here is the code corresponding to the graph above.

```
import pandas as pd
= pd.DataFrame({
lhs_dupes "id": [1, 2, 2, 3],
"val": ["lhs.1", "lhs.2", "lhs.3", "lhs.4"]
})
= pd.DataFrame({
rhs_dupes "id": [1, 2, 2, 4],
"val": ["rhs.1", "rhs.2", "rhs.3", "rhs.4"]
})
```

` lhs_dupes`

id | val | |
---|---|---|

0 | 1 | lhs.1 |

1 | 2 | lhs.2 |

2 | 2 | lhs.3 |

3 | 3 | lhs.4 |

` rhs_dupes`

id | val | |
---|---|---|

0 | 1 | rhs.1 |

1 | 2 | rhs.2 |

2 | 2 | rhs.3 |

3 | 4 | rhs.4 |

`="id") inner_join(lhs_dupes, rhs_dupes, on`

id | val_x | val_y | |
---|---|---|---|

0 | 1 | lhs.1 | rhs.1 |

1 | 2 | lhs.2 | rhs.2 |

2 | 2 | lhs.2 | rhs.3 |

3 | 2 | lhs.3 | rhs.2 |

4 | 2 | lhs.3 | rhs.3 |

## NA handling

By default, missing values are matched to each other in joins.

This is shown in the left join below, where each table has a single `NA`

value.

Note the purple mark indicating a match between the `NA`

values. Importantly, literal `NA`

values here are different from the circled dummy NA (top right), which indicates the special matching rule for left joins.

Here is the code for the example:

```
import pandas as pd
= pd.DataFrame({"id": [1, pd.NA, 3], 'val': ['lhs.1', 'lhs.2', 'lhs.3']})
lhs_na = pd.DataFrame({"id": [1, pd.NA, 2], 'val': ['rhs.1', 'rhs.2', 'rhs.3']})
rhs_na ="id") left_join(lhs_na, rhs_na, on
```

id | val_x | val_y | |
---|---|---|---|

0 | 1 | lhs.1 | rhs.1 |

1 | <NA> | lhs.2 | rhs.2 |

2 | 3 | lhs.3 | NaN |

## Match on multiple columns

Joins can be performed by comparing multiple pairs of columns to each otherâ€” such as the `source`

and `id`

columns in the data below.

```
= pd.DataFrame({
lhs_multi "source": ["a", "a", "b"],
"id": [1, 2, 1],
"val": ["lhs.1", "lhs.2", "lhs.3"]
})
= pd.DataFrame({
rhs_multi "source": ["a", "b", "c"],
"id": [1, 1, 1],
"val": ["rhs.1", "rhs.2", "rhs.3"]
})
```

` lhs_multi`

source | id | val | |
---|---|---|---|

0 | a | 1 | lhs.1 |

1 | a | 2 | lhs.2 |

2 | b | 1 | lhs.3 |

` rhs_multi`

source | id | val | |
---|---|---|---|

0 | a | 1 | rhs.1 |

1 | b | 1 | rhs.2 |

2 | c | 1 | rhs.3 |

In this case, a match happens when `source`

matches AND `id`

matchesâ€”as shown by the composite values `(source, id)`

in the graph below.

### Using a list of columns

Use the `on=`

argument with a list of columns, in order to join on multiple columns.

`=["source", "id"]) inner_join(lhs_multi, rhs_multi, on`

source | id | val_x | val_y | |
---|---|---|---|---|

0 | a | 1 | lhs.1 | rhs.1 |

1 | b | 1 | lhs.3 | rhs.2 |

### Using a dictionary of columns

Use the `on=`

argument with a dictionary to join pairs of columns with different names.

```
from siuba import rename
= rename(rhs_multi, some_source = "source")
rhs_multi2
"source": "some_source", "id": "id"}) inner_join(lhs_multi, rhs_multi2, {
```

source | id | val_x | some_source | val_y | |
---|---|---|---|---|---|

0 | a | 1 | lhs.1 | a | rhs.1 |

1 | b | 1 | lhs.3 | b | rhs.2 |

## Match on expressions

Some siuba backendsâ€”like those that execute SQLâ€”can join by expressions for determining matches.

For example, the graph below shows an inner join, where a match occurs when the left-hand value is less than or equal to the right-hand value.

Notice that the value `1`

on the left-hand side matches everything on the right-hand side (top row).

### SQL backend `sql_on`

argument.

Here is a full example, using sqlite:

```
from sqlalchemy import create_engine
from siuba.sql import LazyTbl
= create_engine("sqlite:///:memory:")
engine
= pd.DataFrame({'id': [1,2,3], 'val': ['lhs.1', 'lhs.2', 'lhs.3']})
lhs = pd.DataFrame({'id': [1,2,4], 'val': ['rhs.1', 'rhs.2', 'rhs.3']})
rhs
"lhs", engine, index=False)
lhs.to_sql("rhs", engine, index=False)
rhs.to_sql(
= LazyTbl(engine, "lhs")
tbl_sql_lhs = LazyTbl(engine, "rhs")
tbl_sql_rhs
inner_join(
tbl_sql_lhs,
tbl_sql_rhs,= lambda lhs, rhs: lhs.val <= rhs.val
sql_on )
```

# Source: lazy query # DB Conn: Engine(sqlite:///:memory:) # Preview:

id_x | val_x | id_y | val_y | |
---|---|---|---|---|

0 | 1 | lhs.1 | 1 | rhs.1 |

1 | 1 | lhs.1 | 2 | rhs.2 |

2 | 1 | lhs.1 | 4 | rhs.3 |

3 | 2 | lhs.2 | 1 | rhs.1 |

4 | 2 | lhs.2 | 2 | rhs.2 |

# .. may have more rows

Note that the function passed to `sql_on`

takes sqlalchemy columns for the left- and right-hand sides, so currently limited to what can be done in sqlalchemy.