Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

the line count of text file is different between prediction result count #344

Closed
chuangys opened this issue Oct 24, 2017 · 12 comments
Closed

Comments

@chuangys
Copy link

My text file exist 1537584 lines, but after the prediction process, there are only 1537490 results.
I've already check that my file didn't include any empty line. Can someone explain why?

Besides, is it possible to add document number into test file? So that even the number count mismatch, I still can mapping the results with raw test file.

@cpuhrsch
Copy link
Contributor

Hello @chuangys,

Thank you for your post. This used to be an error in the past, did you checkout the latest version of fastText?

Thank you,
Christian

@chuangys
Copy link
Author

@cpuhrsch In fact, I running at windows environment and suffered many issue in the beginning. However, find the fasttext execution file build by "xiami" at github. It worked fine. He built up the file at May/2017. Not sure if it the the latest update version code. But here is the only one resource I can run fasttext now.
Besides, do you have any idea is there any wrapper build up by windows environment?

@chuangys
Copy link
Author

chuangys commented Oct 26, 2017

@cpuhrsch After downloading the last source code, and build the exe file by WinGW, and can execute now. But still show the sample problem. Input 1537584 lines for prediction but return only 1537490 lines results.

@jazoom
Copy link

jazoom commented Dec 8, 2017

Recently I started seeing a similar issue. I'm seeing 667 results when only 600 lines are fed in. The results have a few portions that look like this:

__label__home_&_garden 0.994563
__label__books 0.999752
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__video 0.984672
__label__electronics 0.852069

It doesn't seem right that there should be a number greater than 1.

This has been happening for a week or so. Source is built fresh on every push via CI.

The number of seemingly duplicated erroneous "books" labels is 57, so does not quite account for the extra lines, but is close enough and silly enough to make me think this is part of the problem.

@cpuhrsch
Copy link
Contributor

Hello @chuangys and @jazoom,

Is it feasible to share a portion or ideally the entirety of these datasets so that I can reproduce the issue on my end? For our test data (see tests/fetch_test_data.sh) these issues have not occurred and all our integrations tests run clear.

Thanks,
Christian

@jazoom
Copy link

jazoom commented Dec 19, 2017

@cpuhrsch Sure. I can probably get you the trained model too if you can suggest a good way to privately send you a 1GB file. Maybe I can email you a link to a file on Dropbox or something?

@cpuhrsch
Copy link
Contributor

Hello @jazoom,

If you're comfortable with this, you could post the link here. I'll then let you know once I got the data and you can invalidate the link to prevent additional traffic.

Thanks,
Christian

@jazoom
Copy link

jazoom commented Dec 23, 2017

@cpuhrsch I was able to track down the problem to some sneaky carriage returns. Sorry for the confusion.

@cpuhrsch
Copy link
Contributor

Hello @jazoom,

Thank you for resolving this! I'm happy to hear that this wasn't on our end. I'm going to close this issue now as it appears to be resolved, but please feel encouraged to reopen it if this isn't the case.

Thanks,
Christian

@jazoom
Copy link

jazoom commented Dec 23, 2017

No problem @cpuhrsch. I should mention that I still see "1.00001" as a confidence value for some results. It doesn't make sense to me, but perhaps it's intentional? I built a fresh model with new data and it also gives some results above 1.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Dec 24, 2017

Hello @jazoom,

Thank you for noticing this. We'll need to have access to the data or you'll need to reproduce this for one of your test datasets in order for me to be able to investigate this. Ideally you'd also be able to reproduce this within a docker image so that we can be sure that we're in the same environment.

Having said that, if the value is limited to 1.00001 there should be no need to worry. We are adding 1e-5 to std_log (which is used for prediction) in order to deal with very small values. Otherwise, please open a separate issue so that we can keep the topics clearly separated.

Thanks,
Christian

@jazoom
Copy link

jazoom commented Dec 24, 2017

It's not bothering me since I just take it to mean essentially 100% confident. I just wasn't sure if your team knew about it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants