pandas is everywhere in python data analysis. The siuba library builds on this incredible work by using pandas Series methods as its reference implementation. This means that you use the pandas methods you’ve already learned!
Note how you can debug both pieces of code by running and inspecting df.a.mean().
While pandas is an incredibly powerful API, its syntax can get quite cumbersome.
(my_data .assign(avg =lambda d: d.x.mean()) # create new column .loc[lambda d: d.x !=3] # filter out some rows)
g
x
avg
0
a
1
2.0
1
a
2
2.0
Notice how much of this code is writing the word lambda.
Like other ports of the popular R library, dplyr–such as dplython–siuba offers a simple, flexible way to work on tables of data.
Pipe syntax
The pipe syntax allows you to import table functions (verbs), rather than having 300+ methods on your DataFrame.
# actions can be imported individuallyfrom siuba import mutate, arrange# they can be combined using a pipemy_data >> mutate(y = _.x +1) >> arrange(_.g, -_.x)
g
x
y
1
a
2
3
0
a
1
2
2
b
3
4
Lazy expressions
Using lazy expressions saves you from repeating the name of your DataFrame over and over.
# rather than repeat the name of your data, you can use lazy expressions ---my_data_frame = pd.DataFrame({'a': [1,2,3]})# badmy_data_frame["b"] = my_data_frame["a"] +1my_data_frame["c"] = my_data_frame["b"] +2# goodmy_data_frame >> mutate(b = _.a +1, c = _.b +2)
a
b
c
0
1
2
4
1
2
3
5
2
3
4
6
No reset_index
Notice how siuba mutate can take a DataFrame, and return a DataFrame. Moreover, it doesn’t stick columns onto the index. This means you don’t need to call reset_index all the time.
A common place where reset_index is called is after a pandas grouped aggregation.
Note that g_cyl does not have an assign method, and requires passing what operation you want to do ("mean") as a string to .transform().
Fast and flexible
Fast grouped operations
Consider some data (students) where 2,000 students have each completed 10 courses, and received a score on each course.
# fast grouped operations (pull from dev docs)# PLOT of timingimport numpy as npimport pandas as pdnp.random.seed(123)students = pd.DataFrame({'student_id': np.repeat(np.arange(2000), 10),'course_id': np.random.randint(1, 20, 20000),'score': np.random.randint(1, 100, 20000)})
CPU times: user 1.76 ms, sys: 199 µs, total: 1.96 ms
Wall time: 1.81 ms
This is because siuba’s lazy expressions let it optimize grouped operations.
However, dplython is over 100x slower in this case, because it uses the slower pandas DataFrame.apply() method under the hood.
# set up code for timingfrom dplython import X, DplyFrame, sift, group_by as dply_group_byg_students2 = DplyFrame(students) >> dply_group_by(X.student_id)
SELECT cars.cyl, cars.mpg, cars.hp, cars.hp - avg(cars.hp) OVER (PARTITION BY cars.cyl) AS demeaned
FROM cars
Abstract syntax trees
This is made possible because siuba represents lazy expressions with abstract syntax trees. Fast grouped operations and SQL queries are just the beginning. This allows people to produce a whole range of interesting tools!
Siuba’s lazy expressions consist of a Symbolic and Call class.
Symbolic is used to quickly create lazy expressions.
Each black box in the printout above is a Call. Calls are the pieces that represent the underlying operations. They have methods to inspect and transform them.
call = strip_symbolic(sym)# get columns names used in lazy expressioncall.op_vars(attr_calls =False)