Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semi-metric dissimilarities in glcust and differing branch lengths #81

Open
mike-kratz opened this issue Jun 7, 2023 · 6 comments
Open

Comments

@mike-kratz
Copy link

Hi there!

I work with ecological data, in particular microbial ecology, and we cannot use Euclidean distance for comparing community dissimilarities (either using cluster analysis, PCoA, or NMDS) since Euclidean dissimilarities perform poorly when datasets have many zeroes, which is almost always the case with microbial sequencing data. We tend to use Bray-Curtis dissimilarity (also known as percentage-similarity) which is semi-metric and does not obey the triangle-inequality theorem. Would genieclust not work for this type of dissimilairty matrix?

Also, when I used genie clust on my environmental data, which is fine to use Euclidean distances for since it does not have double-zeroes, the branch height was very different from the original Euclidean pairwise distances shown in the output matrix. i.e., it showed groups had more Euclidean similarity than the original input matrix, while hierarchical clustering with "average" linkage tended to show the original values more accurately. See below:

Genie clust dendrogram
image

Standard hierarchical clustering with average linkage
image

Snapshot of original Euclidean dissimilarity matrix (notice that most pairwise dissimilarities are greater than 1, but the genie dendrogram shows most the branch lengths are around 1)
image

Thank you for your help,

Mike

@gagolews
Copy link
Owner

gagolews commented Jun 8, 2023

I work with ecological data, in particular microbial ecology, and we cannot use Euclidean distance for comparing community dissimilarities (either using cluster analysis, PCoA, or NMDS) since Euclidean dissimilarities perform poorly when datasets have many zeroes, which is almost always the case with microbial sequencing data. We tend to use Bray-Curtis dissimilarity (also known as percentage-similarity) which is semi-metric and does not obey the triangle-inequality theorem. Would genieclust not work for this type of dissimilairty matrix?

Yes, any symmetric dissimilarity matrix will do the trick - triangle inequality is not necessary.

You can pass affinity="precomputed" and a distance matrix to Genie (Python) or an object of S3 class dist to gclust (R).

Also, when I used genie clust on my environmental data, which is fine to use Euclidean distances for since it does not have double-zeroes, the branch height was very different from the original Euclidean pairwise distances shown in the output matrix. i.e., it showed groups had more Euclidean similarity than the original input matrix, while hierarchical clustering with "average" linkage tended to show the original values more accurately. See below:

Genie doesn't merge clusters in increasing order (wrt the distance metric) - sometimes it combines small clusters that are farther away from each other. In such a case, the dendrogram would be a mess, therefore I needed to adjust it heuristically. This is something called the lack of ultrametricity property and is actually not specific to Genie; e.g., centroid-based linkage in the built-in R's hclust also produces degenerated dendrograms.

The left part of the dendrogram still makes sense (the last, large groups - the coarsest level of granularity), and I would stick to that.

@mike-kratz
Copy link
Author

@gagolews Thank you for addressing those issues, that makes complete sense! I remember seeing you did a comparison of standard validity measures in a paper, but have not had a chance to read it. Are any of them worthwhile for comparing the hclust vs hclust performance? Because right now my interpretation is just based off of my prior knowledge of the sites.

@gagolews
Copy link
Owner

People tend to use the Silhouette, the Caliński-Harabasz, and the Dunn index (most often), but, from the perspective of the paper you mentioned [DOI:10.1016/j.ins.2021.10.004] [preprint] I wouldn't recommend relying on any of these measures. 😬 The clusterings they promote are not necessarily valid...

@mike-kratz
Copy link
Author

@gagolews Would using something like the cophenetic correlation, plus a scatterplot to visualize the relationship, be a reasonable may to measure a chosen link/method's representation of the dissimilarity matrix? I read about that last week and it seems to be a reasonable method.

@gagolews
Copy link
Owner

It's definitely worth to try!

@HtheChemist
Copy link

@mike-kratz I maybe late, but have you checked SIMPROF, I believe it could be used to drill down the dendrogram to check if each cluster structure is multivariate or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants