Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional encoding and errors parameters to LanguageModel constructor #26

Open
serhanaya opened this issue Mar 13, 2023 · 0 comments

Comments

@serhanaya
Copy link

serhanaya commented Mar 13, 2023

Currently, the LanguageModel constructor in the wordninja.py file opens the word file using gzip.open() without any option to specify the file encoding. This means that users who have word files with non-UTF-8 encoding may encounter decoding errors when using the wordninja package.

To address this issue, I propose modifying the __init__ function in the wordninja.py file to include an optional encoding parameter that can be used to specify the encoding of the word file. Additionally, I suggest adding an optional errors parameter to allow users to customize how decoding errors are handled.

Here's an example of what the modified function could look like:

def __init__(self, word_file, encoding='utf-8', errors='strict'):
    # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
    with gzip.open(word_file) as f:
        words = f.read().decode(encoding=encoding, errors=errors).split()
    self._wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    self._maxword = max(len(x) for x in words)

By adding these optional parameters, users can specify the encoding and error handling behavior of the word file when they create a LanguageModel instance, allowing them to use files in different encodings without having to modify the source code.

I plan to submit a pull request with these changes. Please let me know if there are any concerns or suggestions for improvement.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant