This is the repository for the study "A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations".
It contains all execution and analysis modules.
BBC news
of gensimVSM
of gensim + tf-idf-WeightingLSI
of gensimLSI
of gensim + Linear CombinedLSI
of gensim + tf-idf-WeightingLSI
of gensim + tf-idf-Weighting + Linear CombinedNMF
of gensimNMF
of gensim + Linear CombinedNMF
of gensim + tf-idf-WeightingNMF
of gensim + tf-idf-Weighting + Linear CombinedLDA
of gensimLDA
of gensim + Linear CombinedDoc2Vec
of gensimBERT_all_mpnet_base_v2
the best sentence embedding model according to
the second best sentence embedding model according to
with one investigated hyperparametern_iter
with two investigated hyperparametersn
with three investigated hyperparameterslearning_rate
, andperplexity
with two investigated hyperparametersmin_dist
Local Stability Metrics
- Trustworthiness
- Continuity
- Mean Relative Rank Errors
- Local Continuity Meta-Criterion
- Label Preservation
Global Stability Metrics
- Pearson's Correlation
- Soearnab's Rank Correlation
- Cluster Ordering
Class Seperation Metric
- Absolute Difference Distance Consistency
Further scores:
- Silhouette Coefficient
- Rotation from Procrustes Analysis
- Heatmaps
- Binary Tests
- Correlation Tests
We have written our code for a Ubuntu 22.04 system.
- openjdk-19-jdk
- ant
- python3-minimal
- python3.10-full
- python3-pip
- git
- RScript from r-base-core
Please install this via
> sudo apt install openjdk-19-jdk ant python3-minimal python3.10-full python3-pip git r-base-core
> pip3 install numpy==1.23.5
> pip3 install -r requirements.txt
> python3 -m spacy download en_core_web_sm
> python3 -m spacy download en_core_web_lg
For postprocessing we also need ggplot2. Please install it via executing:
> R
> > install.packages("ggplot2")
and answering yes at every prompt.
## Run
### Parameter Generator
> python3 > parameters.csv
Repeated calls to using a wide range of parameters (see parameter generator) like this call:
> python3 --perplexity_tsne 30 --n_iter_tsne 1000 --learning_rate auto --n_neighbors_umap 15 --min_dist_umap 0.1 --max_iter_mds 300 --dataset_name 20_newsgroups --topic_model lsi_tfidf --res_file_name ./results/20_newsgroups/results_perplexity_tsne_30_n_iter_tsne_1000_learning_rate_auto_n_neighbors_umap_15_min_dist_umap_0.1_max_iter_mds_300_dataset_name_20_newsgroups_topic_model_lsi_tfidf.csv
For replication, we recommend you to (first) test a command like above. For running the full benchmark you will most probably need a computer cluster and about two weeks. Further calls can be produced by See above.
After finishing your runs, it is recommended to run again to see which job did finish and which not. The results_files then are copied to a directory called res_files_only, where the results can be collected. Thereafter, the results can be analyzed with one of the four analysis script:, experiment_2_stability_hyperparameters, experiment_3_stability_randomness, The script experiment_5_high_dimensional_similarity calculates the stability metrics for the standard corpus compared to a jittered corpus.
So the standard workflow for [path_to_results_of_dataset] with, e.g., 10 parallel jobs, is:
> python3 [path_to_results_of_dataset]/random_seed_0/jitter_amount_0.0 --n_jobs 10
> python3 [path_to_results_of_dataset] --n_jobs 10
> python3 [path_to_results_of_dataset] --n_jobs 10
> python3 [path_to_results_of_dataset] --n_jobs 10 --random_seed 42
> python3
This will create result csv files in your current working directory. Please note that you should avoid [path_to_results_of_dataset] a final path seperator at the end of your path so that os.path.basename works properly. In addition, those analysis may take a while (depending on the dataset and your machine a couple of second up to 5 minutes per comparison).
In addition, to get a small report of covered analysis results, perform some sanity checks and get a large file of all results per experiment you may want to run:
> python3 perform_sanity_checks [path_to_experiment_results_from_calls_mentioned_above]
To execute this step you need ggplot2. Please refer to "Setup" for instructions how to install this package. For aesthetics we used ggplot2 for postprocessing our scatter plots in the paper. This is done via the script. If you executed the call under "ML Processing", the according postprocessing call would be:
> python3 --base_path results/20_newsgroups --dataset_name 20_newsgroups
Afterwards, you can find the new scatter plots in the "Analysis_Visualization" directory.
> docker build . -t python-ml_batch:latest projections_benchmark --build-arg PLATFORM=amd64
> docker run python-ml_batch python3 --perplexity_tsne 40 --n_iter_tsne 6000 --dataset_name reuters --res_file_name ./results/reuters/results_perplexity_tsne_40_n_iter_tsne_6000_dataset_name_reuters.csv
Additionally, mounts and workdir need to be set accordingly.
> ./
All of our results obtained by the method described above can be found in result_files/results_[experiment_name].zip After unzipping, the results for each dataset may be found in results_experiment_[experiment_number]_[dataset_name].csv. In addition, a summary covering report is also placed in the respective experiment directory.
You may find all layout .npy
(numpy-files) files under: