-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semi-metric dissimilarities in glcust and differing branch lengths #81
Comments
Yes, any symmetric dissimilarity matrix will do the trick - triangle inequality is not necessary. You can pass
Genie doesn't merge clusters in increasing order (wrt the distance metric) - sometimes it combines small clusters that are farther away from each other. In such a case, the dendrogram would be a mess, therefore I needed to adjust it heuristically. This is something called the lack of ultrametricity property and is actually not specific to Genie; e.g., centroid-based linkage in the built-in R's The left part of the dendrogram still makes sense (the last, large groups - the coarsest level of granularity), and I would stick to that. |
@gagolews Thank you for addressing those issues, that makes complete sense! I remember seeing you did a comparison of standard validity measures in a paper, but have not had a chance to read it. Are any of them worthwhile for comparing the hclust vs hclust performance? Because right now my interpretation is just based off of my prior knowledge of the sites. |
People tend to use the Silhouette, the Caliński-Harabasz, and the Dunn index (most often), but, from the perspective of the paper you mentioned [DOI:10.1016/j.ins.2021.10.004] [preprint] I wouldn't recommend relying on any of these measures. 😬 The clusterings they promote are not necessarily valid... |
@gagolews Would using something like the cophenetic correlation, plus a scatterplot to visualize the relationship, be a reasonable may to measure a chosen link/method's representation of the dissimilarity matrix? I read about that last week and it seems to be a reasonable method. |
It's definitely worth to try! |
@mike-kratz I maybe late, but have you checked SIMPROF, I believe it could be used to drill down the dendrogram to check if each cluster structure is multivariate or not? |
Hi there!
I work with ecological data, in particular microbial ecology, and we cannot use Euclidean distance for comparing community dissimilarities (either using cluster analysis, PCoA, or NMDS) since Euclidean dissimilarities perform poorly when datasets have many zeroes, which is almost always the case with microbial sequencing data. We tend to use Bray-Curtis dissimilarity (also known as percentage-similarity) which is semi-metric and does not obey the triangle-inequality theorem. Would genieclust not work for this type of dissimilairty matrix?
Also, when I used genie clust on my environmental data, which is fine to use Euclidean distances for since it does not have double-zeroes, the branch height was very different from the original Euclidean pairwise distances shown in the output matrix. i.e., it showed groups had more Euclidean similarity than the original input matrix, while hierarchical clustering with "average" linkage tended to show the original values more accurately. See below:
Genie clust dendrogram
Standard hierarchical clustering with average linkage
Snapshot of original Euclidean dissimilarity matrix (notice that most pairwise dissimilarities are greater than 1, but the genie dendrogram shows most the branch lengths are around 1)
Thank you for your help,
Mike
The text was updated successfully, but these errors were encountered: