-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Bad Parsing #1656
Labels
question
Further information is requested
Comments
I just saw that I cloned the repo around 2weeks ago and a few hours afterwards the parser was updated. Will have a look at it and write again |
After using the adapted script, results got much better. However, there are still following points (for improvement)
|
KevinHuSh
pushed a commit
that referenced
this issue
Jul 25, 2024
Halfknow
pushed a commit
to Halfknow/ragflow
that referenced
this issue
Nov 11, 2024
### What problem does this PR solve? infiniflow#1407 infiniflow#1656 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe your problem
Hi there, during my testing it became more and more clear that something is quite wrong witht he parsing/ocr method. When e.g. inputting a 30page scientific paper, and setting some default parameters (using Paper as ChunkingStrategy and once Laws) it only gave quite bad results. An example image showing the chunk and the respective original part is attached.
I guess it was already tried to fix this in issue #1407 but it's still quite bad performance wise.
For many chunks usually the beginning and end are quite bad, but I also noticed a lot of chunk are also entirely bad, no matter what method I use.
The text was updated successfully, but these errors were encountered: