Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc-topic distr. #19

Open
mhbodell opened this issue Apr 20, 2021 · 5 comments
Open

doc-topic distr. #19

mhbodell opened this issue Apr 20, 2021 · 5 comments

Comments

@mhbodell
Copy link

Outputen sparad av "save_doc_theta_estimate = true" har fel dimensioner och uutputen visar inte heller proportioner utan counts.

Detta står i README.txt-filen:

Save the a file with document topic theta estimates (will not include zeros)

Unlike Phi means which are sampled with thinning, theta means is just a simple

average of the topic counts in the last iteration divided by the number of

tokens in the document thus there is not theta_burnin or theta_thinning

save_doc_theta_estimate = true
doc_topic_theta_filename = doc_topic_theta.csv

Har en model med 200 ämnen men doc_theta_means filen har 400 kolumner och antal dokument som rader? Varför är antalet kolumner dubbla antalet ämnen i modellen?

Config-file:

configs = Spalias
no_runs = 1

[Spalias]
title = PCPLDA
description = 200 topics with alpha 0.2 and extended priorlist
dataset = data/fb_politics_news.txt
scheme = spalias_priors
seed = 1904
topics = 200
alpha = 0.2
beta = 0.01
iterations = 1500
rare_threshold = 0
batches = 4
topic_batches = 4
topic_interval = 500
start_diagnostic = 200
debug = 0
#log_type_topic_density = true
log_document_density = true
log_phi_density = true
phi_mean_filename = phi-mean.csv
phi_mean_burnin = 20
phi_mean_thin = 5
stoplist = nsc-test/PartiallyCollapsedLDA-8.4.0/stoplist-empty.txt
save_vocabulary = true
vocabulary_filename = lda_vocab.txt
topic_prior_filename = wfw/bash/priors/k200_v7.txt
keep_connecting_punctuation = true
log_topic_indicators = true
save_sampler = false
save_doc_theta_estimate = true
doc_topic_theta_filename = doc_topic_theta.csv
save_phi_mean = true

Jag bifogar en bild av delar av outputen så du ser hur den ser ut.

Screen Shot 2021-04-20 at 10 06 37

@rebeckahw
Copy link

The problem seems to stem from WriteASCIIDoubleMatrix. Decimal numbers are written with commas both as decimal separators and column separators. This adds an extra column for each printed value and every other column gets the value 0.

@lejon
Copy link
Owner

lejon commented Oct 16, 2022

Yes, I noticed this bug also, and have a fix in 9.2.0, for parts of the problem, but will have to double check if this is also solved with that fix...

@lejon
Copy link
Owner

lejon commented Oct 16, 2022

9.2.0 should solve this problem

@rebeckahw
Copy link

The test for WriteASCIIDoubleMatrix now passes, but the problem unfortunately remains for me.
It could maybe? be caused by the method formatDouble in LDAUtils.java:

		String formatString = "%." + noDigits + "f";
		return String.format(formatString, d);

since String.format() depends on defaultLocale (which for me is SE)

@lejon
Copy link
Owner

lejon commented Oct 19, 2022

Yes, it is due to locale and it is a bit of a mess now unfortunately, the combination of Locale and possibility of selecting separator makes it complicated... I'll have a look and see if I can re-design to a better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants