Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Bad Parsing #1656

Closed
Said-Apollo opened this issue Jul 23, 2024 · 3 comments
Closed

[Question]: Bad Parsing #1656

Said-Apollo opened this issue Jul 23, 2024 · 3 comments
Labels
question Further information is requested

Comments

@Said-Apollo
Copy link

Describe your problem

Hi there, during my testing it became more and more clear that something is quite wrong witht he parsing/ocr method. When e.g. inputting a 30page scientific paper, and setting some default parameters (using Paper as ChunkingStrategy and once Laws) it only gave quite bad results. An example image showing the chunk and the respective original part is attached.
I guess it was already tried to fix this in issue #1407 but it's still quite bad performance wise.

image

For many chunks usually the beginning and end are quite bad, but I also noticed a lot of chunk are also entirely bad, no matter what method I use.

@Said-Apollo Said-Apollo added the question Further information is requested label Jul 23, 2024
@Said-Apollo
Copy link
Author

I just saw that I cloned the repo around 2weeks ago and a few hours afterwards the parser was updated. Will have a look at it and write again

@Said-Apollo
Copy link
Author

After using the adapted script, results got much better. However, there are still following points (for improvement)

  • For some chunks, most of the words are in low letters, even though in the original text there were many capital letters included

image

  • Next up, it has no problem detecting the tables themselves, but struggles to "understand" it's content

image

  • Lastly, regarding detecting equations, it indeed detects them, but I'm not sure whether it correctly formats them. How about instead using some equation to latex converter (if not used already)
    image

KevinHuSh pushed a commit that referenced this issue Jul 25, 2024
### What problem does this PR solve?

#1407 #1656 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
@Said-Apollo
Copy link
Author

Hi, this issue is still not completely fixed.
Look at below example (german text of an EU law)
image

Upon closer inspection, if one looks at the first three words on the left side(lichen oder Auftragsverarbeiter) where lichen is part of the word "Verantwortlichen", we can find them again on the 3rd row of the right side (yellow marked)
Its as if the sentences/words are mixed.
I updated the DeepDoc code last week, which improved overall quality, but still not goog enough.

Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
### What problem does this PR solve?

infiniflow#1407 infiniflow#1656 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants