Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode on Windows #100

Closed
schaeferc-si opened this issue Mar 9, 2017 · 4 comments
Closed

unicode on Windows #100

schaeferc-si opened this issue Mar 9, 2017 · 4 comments
Assignees
Labels
bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release
Milestone

Comments

@schaeferc-si
Copy link

schaeferc-si commented Mar 9, 2017

Dev Effort

1D

Description

I'm running fido 1.3.4 on Ubuntu and on Windows 10. On Ubuntu, fido handles non-ascii file names correctly. On Windows fido will not process file/directory names with unicode (presumably utf16) names.

@schaeferc-si
Copy link
Author

Additional details... Here's output for one run:

FIDO v1.3.4 (formats-v88.xml, container-signature-20160121.xml, format_extensions.xml)
OK,198,fmt/394,"DS_store file (MAC)","DS_Store",6148,"R:\Repository\Accessions\2008-2013\09-228_R\source\box_7\China_Constructs-Song_Dong\D_1.DS_Store","None","signature"
bad repeat interval
bad repeat interval
FIDO: Error in identify_file: Path is R:\Repository\Accessions\2008-2013\09-228_R\source\box_7\China_Constructs-Song_Dong\D_1????
FIDO: Error in identify_file: Path is R:\Repository\Accessions\2008-2013\09-228_R\source\box_7\China_Constructs-Song_Dong\D_1?????
FIDO: Processed 1 files in 290.67 msec, 3 files/sec

The first directory ('????') is 人民建筑.
The second directory ('?????') is 尹秀珍讲座

In Fido.list_files, converting 'root' to unicode before the call to 'os.walk(root)' seems to allow the walk proceed; however, there is still an issue outputting results using the default handler 'handle_matches'.

@bitsgalore
Copy link
Member

FWIW, I remember we once had the same problems in jpylyzer, which were solved by (if I remember well):

Path walk

Instead of root = os.path.normpath(root), use root = unicode(root, 'utf-8') (but this is only needed in Python 2.x; in Python 3.x root = os.path.normpath(root) should work fine).

Output

For the outputting it is important that the encoding is explicitly set. In jpylyzer we use this:

# Set encoding of the terminal to UTF-8
if sys.version.startswith("2"):
    out = codecs.getwriter("UTF-8")(sys.stdout)
elif sys.version.startswith("3"):
    out = codecs.getwriter("UTF-8")(sys.stdout.buffer)

And then write to stdout using something like this:

out.write(whatever)

See also the code in jpylyzer.py.

@ghost ghost added the bug A product defect that needs fixing label Mar 13, 2019
@ghost ghost added the P1 High priority issues to be scheduled in the upcoming release label Mar 13, 2019
@ghost ghost added this to the v1.4.0-m4 milestone Mar 13, 2019
@sromkey
Copy link

sromkey commented Mar 21, 2019

@bitsgalore has also pointed out, it might be best to address this after the Python 3 upgrade, rather than before.

@carlwilson carlwilson removed this from the v1.4.0-m4 milestone May 5, 2020
@carlwilson carlwilson added this to the v1.6 milestone May 5, 2020
@carlwilson
Copy link
Member

closed by #200 in v1.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing P1 High priority issues to be scheduled in the upcoming release
Projects
None yet
Development

No branches or pull requests

6 participants