-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.jmd
185 lines (128 loc) · 4.26 KB
/
README.jmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# DataConvenience
An eclectic collection of convenience functions for your data manipulation needs.
## Data
### Sampling with `sample`
You can conveniently sample a dataframe with the `sample` method
```
df = DataFrame(a=1:10)
# sample 10 rows
sample(df, 10)
```
```
# sample 10% of rows
sample(df, 0.1)
```
```
# sample 1/10 of rows
sample(df, 1//10)
```
### Faster sorting for DataFrames
You can sort `DataFrame`s (in ascending order only) faster than the `sort` function by using the `fsort` function. E.g.
```julia
using DataConvenience
using DataFrames
df = DataFrame(col = rand(1_000_000), col1 = rand(1_000_000), col2 = rand(1_000_000))
fsort(df, :col) # sort by `:col`
fsort(df, [:col1, :col2]) # sort by `:col1` and `:col2`
fsort!(df, :col) # sort by `:col` # sort in-place by `:col`
fsort!(df, [:col1, :col2]) # sort in-place by `:col1` and `:col2`
```
```julia
df = DataFrame(col = rand(1_000_000), col1 = rand(1_000_000), col2 = rand(1_000_000))
using BenchmarkTools
fsort_1col = @belapsed fsort($df, :col) # sort by `:col`
fsort_2col = @belapsed fsort($df, [:col1, :col2]) # sort by `:col1` and `:col2`
sort_1col = @belapsed sort($df, :col) # sort by `:col`
sort_2col = @belapsed sort($df, [:col1, :col2]) # sort by `:col1` and `:col2`
using Plots
bar(["DataFrames.sort 1 col","DataFrames.sort 2 col2", "DataCon.sort 1 col","DataCon.sort 2 col2"],
[sort_1col, sort_2col, fsort_1col, fsort_2col],
title="DataFrames sort performance comparison",
label = "seconds")
```
### Clean column names with `cleannames!`
Somewhat similiar to R's `janitor::clean_names` so that `cleannames!(df)` cleans the names of a `DataFrame`.
### Nesting of `DataFrame`s
Sometimes, nesting is more convenient then using `GroupedDataFrame`s
```
using DataFrames
df = DataFrame(
a = rand(1:8, 1000),
b = rand(1:8, 1000),
c = rand(1:8, 1000),
)
nested_df = nest(df, :a, :nested_df)
```
To unnest use `unnest(nested_df, :nested_df)`.
### One hot encoding
```
a = DataFrame(
player1 = ["a", "b", "c"],
player2 = ["d", "c", "a"]
)
# does not modify a
onehot(a, :player1)
# modfies a
onehot!(a, :player1)
```
### CSV Chunk Reader
You can read a CSV in chunks and apply logic to each chunk. The types of each column is inferred by `CSV.read`.
```julia
using DataFrames
using CSV
df = DataFrame(a = rand(1_000_000), b = rand(Int8, 1_000_000), c = rand(Int8, 1_000_000))
filepath = tempname()*".csv"
CSV.write(filepath, df)
for (i, chunk) in enumerate(CsvChunkIterator(filepath))
println(i)
print(describe(chunk))
end
```
The chunk iterator uses `CSV.read` parameters. The user can pass in `type` and `types` to dictate the types of each column e.g.
```julia
# read all column as String
for (i, chunk) in enumerate(CsvChunkIterator(filepath, types=String))
println(i)
print(describe(chunk))
end
```
```julia
# read a three colunms csv where the column types are String, Int, Float32
for chunk in CsvChunkIterator(filepath, types=[String, Int, Float32])
print(describe(chunk))
end
```
**Note** The chunks MAY have different column types.
## Statistics & Correlations
### Canonical Correlation
The first component of Canonical Correlation.
```
x = rand(100, 5)
y = rand(100, 5)
canonicalcor(x, y)
```
### Correlation for `Bool`
`cor(x::Bool, y)` - allow you to treat `Bool` as 0/1 when computing correlation
### Correlation for `DataFrames`
`dfcor(df::AbstractDataFrame, cols1=names(df), cols2=names(df), verbose=false)`
Compute correlation in a DataFrames by specifying a set of columns `cols1` vs
another set `cols2`. The cartesian product of `cols1` and `cols2`'s correlation
will be computed
## Miscellaneous
### `@replicate`
`@replicate code times` will run `code` multiple times e.g.
```julia
@replicate 10 8
```
### StringVector
`StringVector(v::CategoricalVector{String})` - Convert `v::CategoricalVector` efficiently to `WeakRefStrings.StringVector`
### Faster count missing
There is a `count_missisng` function
```julia
x = Vector{Union{Missing, Int}}(undef, 10_000_000)
cmx = count_missing(x) # this is faster
cmx2 = countmissing(x) # this is faster
cimx = count(ismissing, x) # the way available at base
cmx == cimx # true
```
There is also the `count_non_missisng` function and `countnonmissing` is its synonym.