-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add transform data functionality #26
Changes from 9 commits
e2d0ea2
42d8cb2
20afc41
507ff5c
270626e
6b45e37
62712be
06bcd09
d99a329
6c69031
2e166dd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,24 @@ function initialize_embedding(graph::AbstractMatrix{T}, n_components, ::Val{:ran | |
return [20 .* rand(T, n_components) .- 10 for _ in 1:size(graph, 1)] | ||
end | ||
|
||
""" | ||
initialize_embedding(graph::AbstractMatrix{<:Real}, ref_embedding::AbstractMatrix{T<:AbstractFloat}) -> embedding | ||
|
||
Initialize an embedding of points corresponding to the columns of the `graph`, by taking weighted means of | ||
the columns of `ref_embedding`, where weights are values from the rows of the `graph`. | ||
|
||
The resulting embedding will have shape `(size(ref_embedding, 1), length(query_inds))`, where `size(ref_embedding, 1)` | ||
is the number of components (dimensions) of the `reference embedding`, and `length(query_inds)` is the number of | ||
samples in the resulting embedding. Its elements will have type T. | ||
""" | ||
function initialize_embedding(graph::AbstractMatrix{<:Real}, ref_embedding::AbstractMatrix{T}) where {T<:AbstractFloat} | ||
embed = [zeros(T, size(ref_embedding, 1)) for _ in 1:size(graph, 2)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You never use these zero vectors, so this is a bit unnecessary. To create an array with the correct type, you could do something like |
||
for (i, col) in enumerate(eachcol(graph)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can simplify this expression, since this is just matrix multiplication:
Double check it, but it should be equivalent. Also, you'll want There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good catch |
||
embed[i] = vec(sum(transpose(col) .* ref_embedding, dims=2) ./ (sum(col) + eps(zero(T)))) | ||
end | ||
return embed | ||
end | ||
|
||
""" | ||
spectral_layout(graph, embed_dim) -> embedding | ||
|
||
|
@@ -45,6 +63,13 @@ function spectral_layout(graph::SparseMatrixCSC{T}, | |
return convert.(T, layout) | ||
end | ||
|
||
|
||
# The optimize_embedding methods have parameters that are ::AbstractVector{<:AbstractVector{T}}. | ||
# AbstractVector{<:AbstractVector{T}} allows arguments to be views of some other array/matrix | ||
# rather than just vectors themselves (so we can avoid copying the model.data and instead just | ||
# create a view to satisfy our reshaping needs). For example, calling collect(eachcol(X)) creates | ||
# an Array of SubArrays, and SubArray is not an Array, but SubArray <: AbstractArray. | ||
|
||
""" | ||
optimize_embedding(graph, embedding, n_epochs, initial_alpha, min_dist, spread, gamma, neg_sample_rate) -> embedding | ||
|
||
|
@@ -53,22 +78,71 @@ Optimize an embedding by minimizing the fuzzy set cross entropy between the high | |
# Arguments | ||
- `graph`: a sparse matrix of shape (n_samples, n_samples) | ||
- `embedding`: a vector of length (n_samples,) of vectors representing the embedded data points | ||
- `n_epochs`: the number of training epochs for optimization | ||
- `n_epochs`::Integer: the number of training epochs for optimization | ||
- `initial_alpha`: the initial learning rate | ||
- `gamma`: the repulsive strength of negative samples | ||
- `neg_sample_rate::Integer`: the number of negative samples per positive sample | ||
""" | ||
function optimize_embedding(graph, | ||
embedding, | ||
n_epochs, | ||
embedding::AbstractVector{<:AbstractVector{<:AbstractFloat}}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure we need these type constraints. I actually think this package is overly type constrained (that's mostly my doing), when it could be much more generic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So one issue was having the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you have no preference, I will just remove the original |
||
n_epochs::Integer, | ||
initial_alpha, | ||
min_dist, | ||
spread, | ||
gamma, | ||
neg_sample_rate, | ||
_a=nothing, | ||
_b=nothing) | ||
return optimize_embedding( | ||
graph, | ||
embedding, | ||
embedding, | ||
n_epochs, | ||
initial_alpha, | ||
min_dist, | ||
spread, | ||
gamma, | ||
neg_sample_rate, | ||
_a, | ||
_b, | ||
move_ref=true | ||
) | ||
end | ||
|
||
""" | ||
optimize_embedding(graph, query_embedding, ref_embedding, n_epochs, initial_alpha, min_dist, spread, gamma, neg_sample_rate, _a=nothing, _b=nothing; move_ref=false) -> embedding | ||
|
||
Optimize an embedding by minimizing the fuzzy set cross entropy between the high and low dimensional simplicial sets using stochastic gradient descent. | ||
Optimize "query" samples with respect to "reference" samples. | ||
|
||
# Arguments | ||
- `graph`: a sparse matrix of shape (n_samples, n_samples) | ||
- `query_embedding`: a vector of length (n_samples,) of vectors representing the embedded data points to be optimized | ||
- `ref_embedding`: a vector of length (n_samples,) of vectors representing the embedded data points to optimize against | ||
- `n_epochs`::Integer: the number of training epochs for optimization | ||
- `initial_alpha`: the initial learning rate | ||
- `gamma`: the repulsive strength of negative samples | ||
- `neg_sample_rate`: the number of negative samples per positive sample | ||
- `_a`: this controls the embedding. If the actual argument is `nothing`, this is determined automatically by `min_dist` and `spread`. | ||
- `_b`: this controls the embedding. If the actual argument is `nothing`, this is determined automatically by `min_dist` and `spread`. | ||
|
||
# Keyword Arguments | ||
- `move_ref::Bool = false`: if true, also improve the embeddings in `ref_embedding`, else fix them and only improve embeddings in `query_embedding`. | ||
""" | ||
function optimize_embedding(graph, | ||
query_embedding::AbstractVector{<:AbstractVector{T}}, | ||
ref_embedding::AbstractVector{<:AbstractVector{T}}, | ||
n_epochs::Integer, | ||
initial_alpha, | ||
min_dist, | ||
spread, | ||
gamma, | ||
neg_sample_rate, | ||
_a=nothing, | ||
_b=nothing; | ||
move_ref::Bool=false) where {T <: AbstractFloat} | ||
a, b = fit_ab(min_dist, spread, _a, _b) | ||
self_reference = query_embedding === ref_embedding | ||
|
||
alpha = initial_alpha | ||
for e in 1:n_epochs | ||
|
@@ -77,42 +151,44 @@ function optimize_embedding(graph, | |
j = rowvals(graph)[ind] | ||
p = nonzeros(graph)[ind] | ||
if rand() <= p | ||
sdist = evaluate(SqEuclidean(), embedding[i], embedding[j]) | ||
sdist = evaluate(SqEuclidean(), query_embedding[i], ref_embedding[j]) | ||
if sdist > 0 | ||
delta = (-2 * a * b * sdist^(b-1))/(1 + a*sdist^b) | ||
else | ||
delta = 0 | ||
end | ||
@simd for d in eachindex(embedding[i]) | ||
grad = clamp(delta * (embedding[i][d] - embedding[j][d]), -4, 4) | ||
embedding[i][d] += alpha * grad | ||
embedding[j][d] -= alpha * grad | ||
@simd for d in eachindex(query_embedding[i]) | ||
grad = clamp(delta * (query_embedding[i][d] - ref_embedding[j][d]), -4, 4) | ||
query_embedding[i][d] += alpha * grad | ||
if move_ref | ||
ref_embedding[j][d] -= alpha * grad | ||
end | ||
end | ||
|
||
for _ in 1:neg_sample_rate | ||
k = rand(1:size(graph, 2)) | ||
i != k || continue | ||
sdist = evaluate(SqEuclidean(), embedding[i], embedding[k]) | ||
k = rand(eachindex(ref_embedding)) | ||
if i == k && self_reference | ||
continue | ||
end | ||
sdist = evaluate(SqEuclidean(), query_embedding[i], ref_embedding[k]) | ||
if sdist > 0 | ||
delta = (2 * gamma * b) / ((1//1000 + sdist)*(1 + a*sdist^b)) | ||
else | ||
delta = 0 | ||
end | ||
@simd for d in eachindex(embedding[i]) | ||
@simd for d in eachindex(query_embedding[i]) | ||
if delta > 0 | ||
grad = clamp(delta * (embedding[i][d] - embedding[k][d]), -4, 4) | ||
grad = clamp(delta * (query_embedding[i][d] - ref_embedding[k][d]), -4, 4) | ||
else | ||
grad = 4 | ||
end | ||
embedding[i][d] += alpha * grad | ||
query_embedding[i][d] += alpha * grad | ||
end | ||
end | ||
|
||
end | ||
end | ||
end | ||
alpha = initial_alpha*(1 - e//n_epochs) | ||
end | ||
|
||
return embedding | ||
return query_embedding | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should replace
length(query_inds)
withsize(graph, 2)
, since query_inds isn't an argument that's passed to this. Also note that the returned value of this function isn't a matrix, but a vector of vectors (intentionally).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah mb, forgot to update this documentation