doc-topic distr. #19

mhbodell · 2021-04-20T08:08:55Z

Outputen sparad av "save_doc_theta_estimate = true" har fel dimensioner och uutputen visar inte heller proportioner utan counts.

Detta står i README.txt-filen:

Save the a file with document topic theta estimates (will not include zeros)

Unlike Phi means which are sampled with thinning, theta means is just a simple

average of the topic counts in the last iteration divided by the number of

tokens in the document thus there is not theta_burnin or theta_thinning

save_doc_theta_estimate = true
doc_topic_theta_filename = doc_topic_theta.csv

Har en model med 200 ämnen men doc_theta_means filen har 400 kolumner och antal dokument som rader? Varför är antalet kolumner dubbla antalet ämnen i modellen?

Config-file:

configs = Spalias
no_runs = 1

[Spalias]
title = PCPLDA
description = 200 topics with alpha 0.2 and extended priorlist
dataset = data/fb_politics_news.txt
scheme = spalias_priors
seed = 1904
topics = 200
alpha = 0.2
beta = 0.01
iterations = 1500
rare_threshold = 0
batches = 4
topic_batches = 4
topic_interval = 500
start_diagnostic = 200
debug = 0
#log_type_topic_density = true
log_document_density = true
log_phi_density = true
phi_mean_filename = phi-mean.csv
phi_mean_burnin = 20
phi_mean_thin = 5
stoplist = nsc-test/PartiallyCollapsedLDA-8.4.0/stoplist-empty.txt
save_vocabulary = true
vocabulary_filename = lda_vocab.txt
topic_prior_filename = wfw/bash/priors/k200_v7.txt
keep_connecting_punctuation = true
log_topic_indicators = true
save_sampler = false
save_doc_theta_estimate = true
doc_topic_theta_filename = doc_topic_theta.csv
save_phi_mean = true

Jag bifogar en bild av delar av outputen så du ser hur den ser ut.

rebeckahw · 2022-10-07T13:26:55Z

The problem seems to stem from WriteASCIIDoubleMatrix. Decimal numbers are written with commas both as decimal separators and column separators. This adds an extra column for each printed value and every other column gets the value 0.

lejon · 2022-10-16T14:21:48Z

Yes, I noticed this bug also, and have a fix in 9.2.0, for parts of the problem, but will have to double check if this is also solved with that fix...

lejon · 2022-10-16T14:28:03Z

9.2.0 should solve this problem

rebeckahw · 2022-10-19T09:44:49Z

The test for WriteASCIIDoubleMatrix now passes, but the problem unfortunately remains for me.
It could maybe? be caused by the method formatDouble in LDAUtils.java:

		String formatString = "%." + noDigits + "f";
		return String.format(formatString, d);

since String.format() depends on defaultLocale (which for me is SE)

lejon · 2022-10-19T14:24:50Z

Yes, it is due to locale and it is a bit of a mess now unfortunately, the combination of Locale and possibility of selecting separator makes it complicated... I'll have a look and see if I can re-design to a better solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc-topic distr. #19

doc-topic distr. #19

mhbodell commented Apr 20, 2021

rebeckahw commented Oct 7, 2022

lejon commented Oct 16, 2022

lejon commented Oct 16, 2022

rebeckahw commented Oct 19, 2022

lejon commented Oct 19, 2022

doc-topic distr. #19

doc-topic distr. #19

Comments

mhbodell commented Apr 20, 2021

Detta står i README.txt-filen:

Save the a file with document topic theta estimates (will not include zeros)

Unlike Phi means which are sampled with thinning, theta means is just a simple

average of the topic counts in the last iteration divided by the number of

tokens in the document thus there is not theta_burnin or theta_thinning

save_doc_theta_estimate = true doc_topic_theta_filename = doc_topic_theta.csv

Config-file:

rebeckahw commented Oct 7, 2022

lejon commented Oct 16, 2022

lejon commented Oct 16, 2022

rebeckahw commented Oct 19, 2022

lejon commented Oct 19, 2022

save_doc_theta_estimate = true
doc_topic_theta_filename = doc_topic_theta.csv