You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there a way to provide initial counts values before starting to fit?
I know that normally in agglomerative clustering, each observation is considered as a cluster and therefore each count value starts at 1.
However, I would like to experiment with different initial distributions.
Is this possible with your current implementation?
The text was updated successfully, but these errors were encountered:
Yeah-nah, I have kind-of done it in my recent paper Clustering with minimum spanning trees: How good can it be?https://arxiv.org/abs/2303.05679, using some code from genieclust, but with custom modifications.
There is also now the GIc algorithm implementation available (https://genieclust.gagolewski.com/genieclust_gic.html) – it starts from the intersection of the partitions returned by Genie with different thresholds. Similar idea, but does not allow for an arbitrary starting partition still.
Anyway, there is no public API in the package that would allow that, only modifying src/c_genie.h, but good luck with that.
For own experiments, I think it would be easier to implement Genie from scratch (the algorithm is quite simple if you do not care about its time complexity) in Python and then modify it accordingly. All the necessary bits are available: genieclust.inequity.gini_index, genieclust.internal.DisjointSets or genieclust.internal.GiniDisjointSets, genieclust.internal.mst_from_distance.
Let me know how motivated/confident/adventurous do you feel about the above, maybe I can help with something.
I think I have what I need now by using this (dumb) workaround:
I simply repeat the rows of my feature matrix according to the distribution in a vector. Then I use Genie on this matrix instead and ignore the first merging level in the linkage tree.
Here is my code using pytorch
embeddings = ... # feature matrix
distribution = ... # vector of positive ints and shape = embeddings.shape[0]
emb_repeated = torch.repeat_interleave(embeddings, distribution, dim=0)
g = genieclust.Genie(compute_full_tree=True).fit(emb_repeated)
Hi, thanks for this awesome package.
Is there a way to provide initial counts values before starting to fit?
I know that normally in agglomerative clustering, each observation is considered as a cluster and therefore each count value starts at 1.
However, I would like to experiment with different initial distributions.
Is this possible with your current implementation?
The text was updated successfully, but these errors were encountered: