the line count of text file is different between prediction result count #344

chuangys · 2017-10-24T23:50:12Z

My text file exist 1537584 lines, but after the prediction process, there are only 1537490 results.
I've already check that my file didn't include any empty line. Can someone explain why?

Besides, is it possible to add document number into test file? So that even the number count mismatch, I still can mapping the results with raw test file.

cpuhrsch · 2017-10-25T00:43:42Z

Hello @chuangys,

Thank you for your post. This used to be an error in the past, did you checkout the latest version of fastText?

Thank you,
Christian

chuangys · 2017-10-25T01:02:08Z

@cpuhrsch In fact, I running at windows environment and suffered many issue in the beginning. However, find the fasttext execution file build by "xiami" at github. It worked fine. He built up the file at May/2017. Not sure if it the the latest update version code. But here is the only one resource I can run fasttext now.
Besides, do you have any idea is there any wrapper build up by windows environment?

chuangys · 2017-10-26T03:26:50Z

@cpuhrsch After downloading the last source code, and build the exe file by WinGW, and can execute now. But still show the sample problem. Input 1537584 lines for prediction but return only 1537490 lines results.

jazoom · 2017-12-08T22:45:14Z

Recently I started seeing a similar issue. I'm seeing 667 results when only 600 lines are fed in. The results have a few portions that look like this:

__label__home_&_garden 0.994563
__label__books 0.999752
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__books 1.00001
__label__video 0.984672
__label__electronics 0.852069

It doesn't seem right that there should be a number greater than 1.

This has been happening for a week or so. Source is built fresh on every push via CI.

The number of seemingly duplicated erroneous "books" labels is 57, so does not quite account for the extra lines, but is close enough and silly enough to make me think this is part of the problem.

cpuhrsch · 2017-12-19T08:46:47Z

Hello @chuangys and @jazoom,

Is it feasible to share a portion or ideally the entirety of these datasets so that I can reproduce the issue on my end? For our test data (see tests/fetch_test_data.sh) these issues have not occurred and all our integrations tests run clear.

Thanks,
Christian

jazoom · 2017-12-19T09:20:51Z

@cpuhrsch Sure. I can probably get you the trained model too if you can suggest a good way to privately send you a 1GB file. Maybe I can email you a link to a file on Dropbox or something?

cpuhrsch · 2017-12-19T21:25:50Z

Hello @jazoom,

If you're comfortable with this, you could post the link here. I'll then let you know once I got the data and you can invalidate the link to prevent additional traffic.

Thanks,
Christian

jazoom · 2017-12-23T22:05:04Z

@cpuhrsch I was able to track down the problem to some sneaky carriage returns. Sorry for the confusion.

cpuhrsch · 2017-12-23T22:45:01Z

Hello @jazoom,

Thank you for resolving this! I'm happy to hear that this wasn't on our end. I'm going to close this issue now as it appears to be resolved, but please feel encouraged to reopen it if this isn't the case.

Thanks,
Christian

jazoom · 2017-12-23T22:50:15Z

No problem @cpuhrsch. I should mention that I still see "1.00001" as a confidence value for some results. It doesn't make sense to me, but perhaps it's intentional? I built a fresh model with new data and it also gives some results above 1.

cpuhrsch · 2017-12-24T01:16:50Z

Hello @jazoom,

Thank you for noticing this. We'll need to have access to the data or you'll need to reproduce this for one of your test datasets in order for me to be able to investigate this. Ideally you'd also be able to reproduce this within a docker image so that we can be sure that we're in the same environment.

Having said that, if the value is limited to 1.00001 there should be no need to worry. We are adding 1e-5 to std_log (which is used for prediction) in order to deal with very small values. Otherwise, please open a separate issue so that we can keep the topics clearly separated.

Thanks,
Christian

jazoom · 2017-12-24T01:38:00Z

It's not bothering me since I just take it to mean essentially 100% confident. I just wasn't sure if your team knew about it.

chuangys mentioned this issue Oct 26, 2017

the line count of text file is different between prediction result count xiamx/fastText#4

Open

cpuhrsch closed this as completed Dec 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the line count of text file is different between prediction result count #344

the line count of text file is different between prediction result count #344

chuangys commented Oct 24, 2017

cpuhrsch commented Oct 25, 2017

chuangys commented Oct 25, 2017

chuangys commented Oct 26, 2017 •

edited

Loading

jazoom commented Dec 8, 2017 •

edited

Loading

cpuhrsch commented Dec 19, 2017

jazoom commented Dec 19, 2017

cpuhrsch commented Dec 19, 2017

jazoom commented Dec 23, 2017

cpuhrsch commented Dec 23, 2017

jazoom commented Dec 23, 2017

cpuhrsch commented Dec 24, 2017 •

edited

Loading

jazoom commented Dec 24, 2017

the line count of text file is different between prediction result count #344

the line count of text file is different between prediction result count #344

Comments

chuangys commented Oct 24, 2017

cpuhrsch commented Oct 25, 2017

chuangys commented Oct 25, 2017

chuangys commented Oct 26, 2017 • edited Loading

jazoom commented Dec 8, 2017 • edited Loading

cpuhrsch commented Dec 19, 2017

jazoom commented Dec 19, 2017

cpuhrsch commented Dec 19, 2017

jazoom commented Dec 23, 2017

cpuhrsch commented Dec 23, 2017

jazoom commented Dec 23, 2017

cpuhrsch commented Dec 24, 2017 • edited Loading

jazoom commented Dec 24, 2017

chuangys commented Oct 26, 2017 •

edited

Loading

jazoom commented Dec 8, 2017 •

edited

Loading

cpuhrsch commented Dec 24, 2017 •

edited

Loading