Skip to content

Commit

Permalink
Wrote section Final touches
Browse files Browse the repository at this point in the history
  • Loading branch information
TheRoniOne committed Nov 10, 2021
1 parent c42890f commit 28a7950
Showing 1 changed file with 77 additions and 11 deletions.
88 changes: 77 additions & 11 deletions docs/src/man/workflow_tips.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Usually you will start by having a [Tables.jl](https://github.com/JuliaData/Tables.jl) implementation loaded with the data you want to work with, so your
next step could be to use a non-mutating `Cleaner` function to start your `Cleaner` workflow.

"""jldoctest start
```jldoctest start
julia> using DataFrames: DataFrame
julia> using Cleaner
Expand All @@ -30,10 +30,11 @@ julia> ct = polish_names(df)
└───────────────┴────────────────────┘
"""
```

After that, you can decide whether to continue using non-mutating functions or start using mutating ones.
"""jldoctest

```jldoctest start
julia> ct |> compact_columns |> reinfer_schema
┌────────────────────┐
│ another_weird_name │
Expand Down Expand Up @@ -78,7 +79,7 @@ julia> ct
└────────────────────┘
"""
```

Depending on what you are trying to do, one could be a better option than the other. For example,
if you need to keep copies of the data in order to do different transformations between copies, using non-mutating
Expand All @@ -94,7 +95,7 @@ There is also the option to build a `CleanTable` from any Tables.jl implementati
even the data stored in the original table, as the `CleanTable` constructor has a keyword argument `copycols` that can be
set to false to use the original columns directly at your own risk.

"""jldoctest start
```jldoctest start
julia> ct = CleanTable(df; copycols=false) |> polish_names! |> compact_columns!
┌────────────────────┐
│ another_weird_name │
Expand Down Expand Up @@ -129,13 +130,13 @@ julia> df
2 │ missing 4
3 │ missing 3
"""
```

The complete oposite approach would be to use a function from the ROT (returning original type) variants (e.g. polish_names_ROT)
that take as input any table, does it's transformation on a copy of it and then returns a new table of the same type of
the source table.

"""jldoctest start
```jldoctest start
julia> df |> polish_names_ROT
3×2 DataFrame
Row │ some_bad_name another_weird_name
Expand All @@ -145,7 +146,7 @@ julia> df |> polish_names_ROT
2 │ missing 4
3 │ missing 3
"""
```

## Looking for performance

Expand Down Expand Up @@ -223,7 +224,7 @@ the original table type. For this cases we have the convinient ROT function vari
by applying the transformation on a new `CleanTable` with copied columns and return a new table based on the result but having it be
of the original source type.

"""jldoctest convenience; setup = :(using Cleaner; using DataFrames: DataFrame)
```jldoctest convenience; setup = :(using Cleaner; using DataFrames: DataFrame)
julia> df = DataFrame("A" => [missing, missing, missing], "B" => [4, 'x', 6])
3×2 DataFrame
Row │ A B
Expand All @@ -249,7 +250,7 @@ julia> df3 = row_as_names_ROT(df2, 2)
─────┼─────
1 │ 6
"""
```

Its not recommended to use more than 2 ROT functions on a workflow, as they are the least performant and most allocating function variants.
For each time a ROT function is called, it first is creating a `CleanTable` with copied columns to work with, then applying the
Expand All @@ -261,4 +262,69 @@ garbage collector as compared by using an alternative workflow.

## Final touches

TODO
After using all the `CleanTable` functions you needed, you probably want to have the result be another table type to continue your workflow.
For this cases, you can try calling the constructor of your desired table type to try and build a new table based on the output or, if you
are not sure if your desired table type has a constructor that works with other table implementations, you can use the `materializer` function
from [Tables.jl](https://github.com/JuliaData/Tables.jl) we conveniently export for you.

```jldoctest final; setup = :(using Cleaner; using DataFrames: DataFrame)
julia> df = DataFrame("A" => [missing, missing, missing], "B" => [4, 'x', 6])
3×2 DataFrame
Row │ A B
│ Missing Any
─────┼──────────────
1 │ missing 4
2 │ missing x
3 │ missing 6
julia> ct = compact_columns(df);
julia> row_as_names!(ct, 2);
julia> DataFrame(ct)
1×1 DataFrame
Row │ x
│ Any
─────┼─────
1 │ 6
julia> materializer(df)(ct)
1×1 DataFrame
Row │ x
│ Any
─────┼─────
1 │ 6
```

If you are looking to get the most performance, some table types also let you call their constructor having it use the original columns so this
way you could avoid some extra allocations.

```jldoctest final
julia> df2 = DataFrame(ct; copycols=false)
1×1 DataFrame
Row │ x
│ Any
─────┼─────
1 │ 6
julia> df2.x[1] = 3
3
julia> df2
1×1 DataFrame
Row │ x
│ Any
─────┼─────
1 │ 3
julia> ct
┌─────┐
│ x │
│ Any │
├─────┤
│ 3 │
└─────┘
```

0 comments on commit 28a7950

Please sign in to comment.