A domain specific language for transforming tabular data.
- Example program
- TNL Concepts
- CLI options
- Error checking
- Features that may be implemented at some point
- Current project status
- Development
- Built-in maps
- add [number]
- mult [number]
- divide [number]
- power [number]
- auto_inc
- round [number]
- max [column_selector|number] [column_selector|number]
- min [column_selector|number] [column_selector|number]
- mean [column_selector|number] [column_selector|number]
- replace_last [string] [string]
- trim
- slice [integer] [integer]
- title
- upper
- lower
- remove_prefix [string]
- remove_suffix [string]
- concat [string|column_selector] [string|column_selector] [string|column_selector]
- format [format_string]
Given the input
'date' | 'name' | 'producer' |
---|---|---|
'2019-10-5' | ' parasite ' | 'Kwak Sin-ae; Bong Joon-ho ' |
'2018-09-11' | ' green book ' | 'Jim Burke; Charles B. Wessler; Brian Currie; Peter Farrelly; Nick Vallelonga' |
'2017-08-31' | ' the shape of water ' | 'Guillermo del Toro; J. Miles Dale' |
'2016-09-02' | ' moonlight ' | 'Adele Romanski; Dede Gardner; Jeremy Kleiner' |
the tnl program:
transform Movies {
headers {
'date' -> 'Year'
'name' -> 'Title'
'producer' -> 'Producer(s)'
}
values {
['Year'] -> slice 0 4
['Title'] -> trim | title | replace 'Of' 'of'
['Producer(s)'] -> {
| trim
| replace ';' ','
| replace_last ',' ', and'
}
}
}
would produce
'Year' | 'Title' | 'Producer(s)' |
---|---|---|
'2019' | 'Parasite' | 'Kwak Sin-ae, and Bong Joon-ho' |
'2018' | 'Green Book' | 'Jim Burke, Charles B. Wessler, Brian Currie, Peter Farrelly, and Nick Vallelonga' |
'2017' | 'The Shape of Water' | 'Guillermo del Toro, and J. Miles Dale' |
'2016' | 'Moonlight' | 'Adele Romanski, Dede Gardner, and Jeremy Kleiner' |
This example can be tried from this repo with: tnl sample/movies.tnl sample/movies.csv
Right now header and value transforms must exist inside a named transform
block, and inside headers
and values
blocks. This is temporary. In the
future the program above will be able to be expressed as:
'date' -> 'Year'
'name' -> 'Title'
'producer' -> 'Producer(s)'
['Year'] -> slice 0 4
['Title'] -> trim | title | replace 'Of' 'of'
['Producer(s)'] -> {
| trim
| replace ';' ','
| replace_last ',' ', and'
}
To map headers, we use the syntax 'from' -> 'to'
, where the left hand side is
a string and the right hand side is an arbitrary map pipeline (in this case it
just returns 'to'
), and the header from
is mapped to the header to
.
To map values, we use the syntax ['from'] -> 'to'
where the square brackets
[]
represent a "column selector". In this case, all the values in the from
column are mapped to to
.
The left-hand-side of the ->
can also be a header pattern or column selector
pattern where a pattern is expressed as /regex/
. Here are some examples:
/.*/ -> trim
[/.*date.*/] -> '2021-01-01'
This applies trim
(deletes leading and trailing spaces) to each header, and
takes any column with header containing date
and maps all the values to
2021-01-01
.
The right-hand-side of the ->
is a single or multi-line pipeline. This is a
set of map calls, where data flows left-to-right, top-to-bottom. Right now, TNL
only supports a set of built-in maps (enumerated at the bottom of this page).
Maps in a pipeline are separated by |
, where a leading |
is optional. A
single-line pipeline does not have curly braces {}
, and the pipeline ends at
the end of the line. Multi-line pipelines (a syntactic sugar), must be
expressed in curly braces {...}
. Different maps take different types of
arguments (numbers
, integers
, strings
, patterns
, column_selectors
etc.). Some can be applied to both headers
and values
.
Comments are written with a leading #
.
There are a few cli options:
tnl [src_file] [csv_file] -h
- prints
tnl [src_file] [csv_file] --print-tokens
- prints the tokens recognized by the tnl lexer
tnl [src_file] [csv_file] --print-ast
- prints the ast nodes constructed by the tnl parser (with indentation)
tnl [src_file] [csv_file] --print-code
- prints the source code back out as a "pretty print"
tnl [src_file] [csv_file] --check
- runs semantic analysis and the type checker
tnl [src_file] [csv_file] --compile TARGET
(not implemented yet)- outputs some executable program
- e.g. one planned option is to generate python, pandas code that
performs the desired tabular data transformation:
tnl src.tnl data.csv --compile pandas
tnl [src_file] [csv_file] --interpret
- this is default cli mode
- this is the same as what gets executed with no provided arguments:
tnl src.tnl data.csv
- Detect use of a unrecognized built-in map:
produces:
transform T { headers { 'hello' -> hello 'world' } }
Unrecognized map 'hello'.
- Detect invalid format string (see
format
in the built-in maps section):produces:transform T { headers { 'hello' -> format ' {planet' } }
Invalid format string (expected '}' before end of string).
- Detect invalid regex pattern:
produces:
transform T { headers { # would likely need to be /.*/ /*/ -> 'world' } }
Invalid regex pattern /*/.
- More to be implemented ...
- remove need to nest maps in the
transform
,headers
, andvalues
blocks. - unary operators
- binary operators
- conditionals
- type checking
- more symantic analysis checks
- semantics for creating a new columns
- semantics for deleting a column
- semantics for deleting rows
- semantics for adding rows
- compilation (e.g. to pandas code)
- potentially add an IR for optimization and ease of generating code (right now execution traverses the AST)
- variable definitions
- built-in testing suport
- library of date built-in functions
- type casting
- variadic arguments for certain maps
- AST and/or IR optimizations
The project is considered pre-alpha. Many components are likely to change.
- Currently implemented using python 3.9
- Running the tests:
pytest
- Type checking:
mypy --strict tnl
- Linting:
flake8 .
Add a number to each value in a column.
TNL program:
transform Test {
values {
['b'] -> add 2
}
}
csv before:
a,b
1,2
3,4
csv after:
a,b
1,4
3,6
Multiply each value in a column by a number.
TNL program:
transform Test {
values {
['b'] -> mult 3
}
}
csv before:
a,b
1,2
3,4
csv after:
a,b
1,6
3,12
Divide each value in a column by a number.
TNL program:
transform Test {
values {
['b'] -> divide 2
}
}
csv before:
a,b
1,2
3,4
csv after:
a,b
1,1
3,2
Raise each value in a column to a power.
TNL program:
transform Test {
values {
['b'] -> power 3
}
}
csv before:
a,b
1,2
3,4
csv after:
a,b
1,8
3,64
Assign a column to 1, 2, 3, ..., n-1, n
where n
is the number of rows in
the input data.
TNL program:
transform Test {
values {
['idx'] -> auto_inc
}
}
csv before:
idx,a,b
placeholder,1,2
placeholder,3,4
placeholder,5,6
placeholder,7,8
csv after:
idx,a,b
1,1,2
2,3,4
3,5,6
4,7,8
Round each value in a column to a some number of decimals.
TNL program:
transform Test {
values {
['c'] -> ['b'] | round 1
['b'] -> round 0
}
}
csv before:
a,b,c
1,1.5,placeholder
3,4.4,placeholder
csv after:
a,b,c
1,2.0,1.5
3,4.0,4.4
Return the max value, comparing each value in a column to each value in another column, or each value in a column to a number, or just from comparing two numbers.
TNL program:
transform Test {
values {
['c'] -> max ['a'] ['b']
['b'] -> max ['b'] 3
['a'] -> max 3 5
}
}
csv before:
a,b,c
1,2,placeholder
5,4,placeholder
csv after:
a,b,c
5,3,2
5,4,5
Return the min value, comparing each value in a column to each value in another column, or each value in a column to a number, or just from comparing two numbers.
TNL program:
transform Test {
values {
['c'] -> min ['a'] ['b']
['b'] -> min ['b'] 3
['a'] -> min 3 5
}
}
csv before:
a,b,c
1,2,placeholder
5,4,placeholder
csv after:
a,b,c
3,2,1
3,3,4
Return the mean value, using each value in a column with each value in another column, or each value in a column with to a number, or just using two values.
TNL program:
transform Test {
values {
['c'] -> mean ['a'] ['b']
['b'] -> mean ['a'] 5
['a'] -> mean 1 7
}
}
csv before:
a,b,c
1,2,placeholder
3,4,placeholder
csv after:
a,b,c
4.0,3.0,1.5
4.0,4.0,3.5
Replace the last instance of a string (in a string) with another string.
TNL program:
transform Test {
headers {
'a;b;c' -> {
| replace ';' '; '
| replace_last '; ' '; and '
}
}
values {
['a; b; and c'] -> replace_last 'a' 'b'
}
}
csv before:
idx,a;b;c
1,aaaabac
2,aabc
csv after:
idx,a; b; and c
1,aaaabbc
2,abbc
Remove leading and trailing spaces from a header, or form all string values in a column.
TNL program:
transform Test {
headers {
/(\\s+.*)|(.*\\s+)/ -> trim
}
}
csv before:
a , b , c,d
1,2,3,4
5,6,7,8
csv after:
a,b,c,d
1,2,3,4
5,6,7,8
Return a subset of a string using 0-based indexing.
TNL program:
transform Test {
headers {
'idx' -> 'Idx'
'Year-Month-Day' -> slice 0 4
}
values {
['Year'] -> slice 0 4
}
}
csv before:
idx,Year-Month-Day
1,2020-01-01
2,2019-02-15
3,2017-08-02
csv after:
Idx,Year
1,2020
2,2019
3,2017
Uppercase the first letter in each word.
TNL program:
transform Test {
headers {
'idx' -> title
'message' -> title
}
values {
['Message'] -> title
}
}
csv before:
idx,message
1,hello world
2,hello mars
3,hello andromeda
csv after:
Idx,Message
1,Hello World
2,Hello Mars
3,Hello Andromeda
Uppercase all letters in a string.
TNL program:
transform Test {
headers {
/b|d/ -> upper
}
}
csv before:
a,b,c,d
1,2,3,4
5,6,7,8
csv after:
a,B,c,D
1,2,3,4
5,6,7,8
Lowercase all letters in a string.
TNL program:
transform Test {
headers {
'B' -> lower
}
values {
['b'] -> lower
}
}
csv before:
A,B
HELLO,WORLD
HELLO,MARS
csv after:
A,b
HELLO,world
HELLO,mars
Remove the specified string prefix from a string.
TNL program:
transform Test {
headers {
'noisea' -> remove_prefix 'noise'
}
values {
['a'] -> remove_prefix 'noise'
}
}
csv before:
noisea,b
noisehello,world
noisehello,mars
csv after:
a,b
hello,world
hello,mars
Remove the specified string suffix from a string.
TNL program:
transform Test {
headers {
'anoise' -> remove_suffix 'noise'
}
values {
['a'] -> remove_suffix 'noise'
}
}
csv before:
anoise,b
hellonoise,world
hellonoise,mars
csv after:
a,b
hello,world
hello,mars
Join strings together. Right now this map always takes 3 arguments. At some point it will support a variable number of arguments.
TNL program:
transform Test {
headers {
'placeholder' -> concat 'hello' ' ' 'message'
}
values {
['hello message'] -> concat ['a'] ' ' ['b']
}
}
csv before:
a,b,placeholder
hello,world,placeholder
hello,mars,placeholder
csv after:
a,b,hello,message
hello,world,hello world
hello,mars,hello mars
Replace any instance of a {}
pattern in a string with values passed
into this map. Multiple {}
can be used in a single format string.
Curly bracket literals can still be used in a string, but they must be
escaped (\{
or \}
).
TNL program:
transform Test {
headers {
'planet' -> format '{} greeting'
}
values {
[/.*planet.*/] -> format 'hello {}'
}
}
csv before:
idx,planet
1,earth
2,mars
csv after:
idx,planet,greeting
1,hello,earth
2,hello,mars