You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working with OCR'ed scans of historical documents where often the blocks of text have been rotated by a small amount (usually less than 5°) during the scanning process.
If the columns were originally printed straight, column detection along with rotation detection yields parallelograms. If the columns were printed wonky, then some other kind of polygon results from detecting the block of text.
So, what I'd like is to be able to specify (SVG-style) a list of coordinates [(x₀,y₀), (x₁,y₁), … (xₙ,yₙ)] that specify a closed polygon, and then to be able to select only the [characters|words] that fall [fully|partially] within that polygon as per pdfplumber's current tools for cropboxes.
An alternative might be providing a bitmap mask the same shape as the page - I think that I could reasonably easily use a third-party SVG-rendering package to generate such a thing.
The text was updated successfully, but these errors were encountered:
Hi @pseudomonas, and thanks for the intriguing suggestion. Do you have any interest in developing a PR for this feature? If so, I'd be happy to discuss a general strategy with you.
I can give it a try. I see that there are various packages with an "is point within polygon" things so I could probably hack together something using the .filter method that tests each candidate object against the polygon. Not sure what performance would be like or what you think about extra dependencies.
My initial project I found I could get away with just increasing the size of the boxes a little bit to allow for rotation, and then filtering any stray characters out of the output later.
Thanks, @pseudomonas! Given the niche-ness of this feature, I'm reluctant to add another required dependency, but I could see adding an optional dependency for this — something like:
defwithin_path(self, svg_style_path: list[tuple[int, int]]) ->DerivedPage:
try:
importname_of_dependencyexceptImportError:
sys.stderr.write("Please install name_of_dependency to use .within_path; exiting.\n")
exit()
[actuallogic]
I'm working with OCR'ed scans of historical documents where often the blocks of text have been rotated by a small amount (usually less than 5°) during the scanning process.
If the columns were originally printed straight, column detection along with rotation detection yields parallelograms. If the columns were printed wonky, then some other kind of polygon results from detecting the block of text.
So, what I'd like is to be able to specify (SVG-style) a list of coordinates
[(x₀,y₀), (x₁,y₁), … (xₙ,yₙ)]
that specify a closed polygon, and then to be able to select only the [characters|words] that fall [fully|partially] within that polygon as per pdfplumber's current tools for cropboxes.An alternative might be providing a bitmap mask the same shape as the page - I think that I could reasonably easily use a third-party SVG-rendering package to generate such a thing.
The text was updated successfully, but these errors were encountered: