-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error While extracting Non-English tables (Arabic) reversed Text #1159
Comments
Thanks for noting this. Could you share the PDF? That will make it easier to diagnose the issue and suggest solutions. |
@jsvine im waiting.... |
Thanks for your patience, @iiAmeer, and thank your July 16 response above. This is indeed something I'd like |
@jsvine it's actually easy (i think) , as the module loops on every character, just try to convert str to int. so when its an integer it won't be reversed. Pdfplumber is my favourite pdf management tool So please do your best <3. |
Fixing issue jsvine#1159 that opend by me (:
I encountered the same issue with Arabic content, where some strings contain both text and numbers. While the numbers are displayed correctly, the text appears reversed. To address this, I created a function as a workaround. import re
def fix_arabic_with_numbers(text):
"""
First reverses characters in each word, then reverses word order.
Numbers remain unchanged.
"""
if not isinstance(text, str):
return text
# Convert multiple spaces to single space and strip
text = ' '.join(text.split())
# Split into words
words = text.split(' ')
# Process each word - reverse characters but keep numbers
fixed_words = []
for word in words:
# Split word into number and non-number parts
parts = re.findall(r'\d+|[^\d]+', word)
fixed_parts = []
for part in parts:
if part.isdigit():
fixed_parts.append(part) # Keep numbers as is
else:
fixed_parts.append(part[::-1]) # Reverse characters in text
fixed_words.append(''.join(fixed_parts))
# Reverse the order of words
fixed_words = fixed_words[::-1]
# Join words back together with spaces
return ' '.join(fixed_words) I hope that languages with RTL (Right-to-Left) will have support in the future. |
I have a problem, when i use extract_tables function, i get a reversed arabic text. Like the actual name is joe, but i get eoj. (it's arabic but i wrote english so u can understand), there is no need for screenshots its clear.
Environment
The text was updated successfully, but these errors were encountered: