tidypyspark
python package provides minimal, pythonic wrapper around pyspark sql dataframe API in tidyverse flavor.
- With accessor
ts
, applytidypyspark
methods where both input and output are mostly pyspark dataframes. - Consistent 'verbs' (
select
,arrange
,distinct
, ...)
Also see tidypandas
: A grammar of data manipulation for pandas inspired by tidyverse
# assumed that pyspark session is active
from tidypyspark import ts
import pyspark.sql.functions as F
from tidypyspark.datasets import get_penguins_path
pen = spark.read.csv(get_penguins_path(), header = True, inferSchema = True)
(pen.ts.add_row_number(order_by = 'bill_depth_mm')
.ts.mutate({'cumsum_bl': F.sum('bill_length_mm')},
by = 'species',
order_by = ['bill_depth_mm', 'row_number'],
range_between = (-float('inf'), 0)
)
.ts.select(['species', 'bill_length_mm', 'cumsum_bl'])
).show(5)
+-------+--------------+------------------+
|species|bill_length_mm| cumsum_bl|
+-------+--------------+------------------+
| Adelie| 32.1| 32.1|
| Adelie| 35.2| 67.30000000000001|
| Adelie| 37.7|105.00000000000001|
| Adelie| 36.2|141.20000000000002|
| Adelie| 33.1| 174.3|
+-------+--------------+------------------+
tidypyspark
code:
(pen.ts.select(['species','bill_length_mm','bill_depth_mm', 'flipper_length_mm'])
.ts.pivot_longer('species', include = False)
).show(5)
+-------+-----------------+-----+
|species| name|value|
+-------+-----------------+-----+
| Adelie| bill_length_mm| 39.1|
| Adelie| bill_depth_mm| 18.7|
| Adelie|flipper_length_mm| 181|
| Adelie| bill_length_mm| 39.5|
| Adelie| bill_depth_mm| 17.4|
+-------+-----------------+-----+
- equivalent pyspark code:
stack_expr = '''
stack(3, 'bill_length_mm', `bill_length_mm`,
'bill_depth_mm', `bill_depth_mm`,
'flipper_length_mm', `flipper_length_mm`)
as (`name`, `value`)
'''
pen.select('species', F.expr(stack_expr)).show(5)
tidypyspark
relies on the amazingpyspark
library and spark ecosystem.
pip install tidypyspark
- On github: https://github.com/talegari/tidypyspark
- On pypi: https://pypi.org/project/tidypyspark
- website: https://talegari.github.io/tidypyspark/