Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error While extracting Non-English tables (Arabic) reversed Text #1159

Open
iiAmeer opened this issue Jun 23, 2024 · 8 comments
Open

Error While extracting Non-English tables (Arabic) reversed Text #1159

iiAmeer opened this issue Jun 23, 2024 · 8 comments
Labels

Comments

@iiAmeer
Copy link

iiAmeer commented Jun 23, 2024

I have a problem, when i use extract_tables function, i get a reversed arabic text. Like the actual name is joe, but i get eoj. (it's arabic but i wrote english so u can understand), there is no need for screenshots its clear.

Environment

  • pdfplumber version: latest
  • Python version: 3.11
  • OS: Windows (later linux)
@iiAmeer iiAmeer added the bug label Jun 23, 2024
@jsvine
Copy link
Owner

jsvine commented Jun 25, 2024

Thanks for noting this. Could you share the PDF? That will make it easier to diagnose the issue and suggest solutions.

@iiAmeer
Copy link
Author

iiAmeer commented Jun 25, 2024

Thanks for noting this. Could you share the PDF? That will make it easier to diagnose the issue and suggest solutions.

087_ثانوية الإمام علي بن ابي طالب الأهلية للبنات (1).pdf

This is the file i used, if you want my code & run preview wait me until tomorrow .

IMG_20240626_092555.jpg

@jsvine
Copy link
Owner

jsvine commented Jul 2, 2024

Thanks for providing the PDF. Ignoring the numbers for now, and instead focusing on the text ... does the output of this look more like what you want?:

import pandas as pd
table = page.extract_table(dict(text_char_dir_render="rtl"))
pd.DataFrame(table, columns=None).fillna("")
Screenshot 2024-07-01 at 11 09 20 PM

@iiAmeer
Copy link
Author

iiAmeer commented Jul 16, 2024

Thanks for providing the PDF. Ignoring the numbers for now, and instead focusing on the text ... does the output of this look more like what you want?:

import pandas as pd
table = page.extract_table(dict(text_char_dir_render="rtl"))
pd.DataFrame(table, columns=None).fillna("")
Screenshot 2024-07-01 at 11 09 20 PM

yes, exactlly. BUT numbers are IMPORTANT and can't be reversed. Note second col (13 in ur viewer) refer to index , so it can't be reversed. Please apply reverstion in RTL only langs like Arabic,
Aramaic.
Azeri.
Dhivehi/Maldivian.
Hebrew.
Kurdish (Sorani)
Persian/Farsi.
Urdu. etc...

@iiAmeer
Copy link
Author

iiAmeer commented Aug 10, 2024

@jsvine im waiting....

@jsvine
Copy link
Owner

jsvine commented Aug 10, 2024

Thanks for your patience, @iiAmeer, and thank your July 16 response above. This is indeed something I'd like pdfplumber to handle well. It just happens to be a bit tricky.

@iiAmeer
Copy link
Author

iiAmeer commented Aug 14, 2024

@jsvine it's actually easy (i think) , as the module loops on every character, just try to convert str to int. so when its an integer it won't be reversed. Pdfplumber is my favourite pdf management tool So please do your best <3.

iiAmeer added a commit to iiAmeer/pdfplumber that referenced this issue Aug 15, 2024
Fixing issue jsvine#1159 that opend by me (:
@moamen270
Copy link

I encountered the same issue with Arabic content, where some strings contain both text and numbers. While the numbers are displayed correctly, the text appears reversed. To address this, I created a function as a workaround.

import re

def fix_arabic_with_numbers(text):
    """
    First reverses characters in each word, then reverses word order.
    Numbers remain unchanged.
    """
    if not isinstance(text, str):
        return text
    
    # Convert multiple spaces to single space and strip
    text = ' '.join(text.split())
    
    # Split into words
    words = text.split(' ')
    
    # Process each word - reverse characters but keep numbers
    fixed_words = []
    for word in words:
        # Split word into number and non-number parts
        parts = re.findall(r'\d+|[^\d]+', word)
        fixed_parts = []
        for part in parts:
            if part.isdigit():
                fixed_parts.append(part)  # Keep numbers as is
            else:
                fixed_parts.append(part[::-1])  # Reverse characters in text
        fixed_words.append(''.join(fixed_parts))
    
    # Reverse the order of words
    fixed_words = fixed_words[::-1]
    
    # Join words back together with spaces
    return ' '.join(fixed_words)

I hope that languages with RTL (Right-to-Left) will have support in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants