You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the LanguageModel constructor in the wordninja.py file opens the word file using gzip.open() without any option to specify the file encoding. This means that users who have word files with non-UTF-8 encoding may encounter decoding errors when using the wordninja package.
To address this issue, I propose modifying the __init__ function in the wordninja.py file to include an optional encoding parameter that can be used to specify the encoding of the word file. Additionally, I suggest adding an optional errors parameter to allow users to customize how decoding errors are handled.
Here's an example of what the modified function could look like:
def__init__(self, word_file, encoding='utf-8', errors='strict'):
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).withgzip.open(word_file) asf:
words=f.read().decode(encoding=encoding, errors=errors).split()
self._wordcost=dict((k, log((i+1)*log(len(words)))) fori,kinenumerate(words))
self._maxword=max(len(x) forxinwords)
By adding these optional parameters, users can specify the encoding and error handling behavior of the word file when they create a LanguageModel instance, allowing them to use files in different encodings without having to modify the source code.
I plan to submit a pull request with these changes. Please let me know if there are any concerns or suggestions for improvement.
Thank you.
The text was updated successfully, but these errors were encountered:
Currently, the
LanguageModel
constructor in thewordninja.py
file opens the word file usinggzip.open()
without any option to specify the file encoding. This means that users who have word files with non-UTF-8 encoding may encounter decoding errors when using thewordninja
package.To address this issue, I propose modifying the
__init__
function in thewordninja.py
file to include an optional encoding parameter that can be used to specify theencoding
of the word file. Additionally, I suggest adding an optionalerrors
parameter to allow users to customize how decoding errors are handled.Here's an example of what the modified function could look like:
By adding these optional parameters, users can specify the encoding and error handling behavior of the word file when they create a
LanguageModel
instance, allowing them to use files in different encodings without having to modify the source code.I plan to submit a pull request with these changes. Please let me know if there are any concerns or suggestions for improvement.
Thank you.
The text was updated successfully, but these errors were encountered: