Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Transliterators on Some Systems #376

Closed
billdenney opened this issue Apr 9, 2020 · 10 comments
Closed

Missing Transliterators on Some Systems #376

billdenney opened this issue Apr 9, 2020 · 10 comments

Comments

@billdenney
Copy link

This is somewhat related to #305 and #269

In sfirke/janitor#365, we are getting errors due to the fact that on some systems the Any-ASCII transliterator is not available. Something similar appears to be an issue on Solaris (https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/janitor-00check.html).

We are trying to do a very broad translation in a locale-independent way to ASCII using:

stringi::stri_trans_general(replaced_names, id = "Greek-Latin;Latin-ASCII;Accents-Any;Any-ASCII")

As I started tracking down the issue, I found that I don't have a transliterator on my system called "Any-ASCII", but the code works for me.

Is there a reason why not having "Any-ASCII" in stri_trans_list() would cause an error on one system and not cause an error on another system?

FYI, the code I used to find that I don't have "Any-ASCII" is below:

library(stringi)

all_trans <- stringi::stri_trans_list()
desired_trans <- c("Greek-Latin", "Latin-ASCII", "Accents-Any", "Any-ASCII")
available_trans <- intersect(desired_trans, all_trans)
if (!identical(desired_trans, available_trans)) {
  warning("Not all translations to ASCII are available.  Results may differ when run on a different system.")
}
trans_id <- paste(available_trans, collapse=";")
stringi::stri_trans_general(replaced_names, id = trans_id)

I'm running stringi version 1.4.6 from CRAN.

@gagolews
Copy link
Owner

gagolews commented Apr 9, 2020

It is related to whether the ICU installed on your system is equipped with this.

I'd recommend installing via:
install.packages("stringi", configure.args="--disable-pkg-config")
which will build ICU from sources

@billdenney
Copy link
Author

@gagolews, Thanks for the suggestion. As noted in the comment linked just above here, that fixed it!

As we don't have the ability to control CRAN (that I know of) and this doesn't happen by default across systems, is there a way that "--disable-pkg-config" could somehow become default for building stringi?

If that is not feasible, would you recommend detection code like what I suggest above or what other method would you suggest to ensure that the translators are setup correctly?

As a tangential question, is the set that we have listed the best way to translate from almost any system to ASCII in the most readable way possible such as converting non-ASCII (e.g. "δ" to "d") or accented characters (e.g. "ś" to "s") and combination charactes (i.e. "œ" to "oe")? (I looked, but I don't see that in the documentation. And, I'm happy to open this question as a separate issue, if that is preferred.)

@gagolews
Copy link
Owner

This is the default for the CRAN Windows ans OS X binary builds.

Linux -- we've had a long discussion long time ago and decided not to change this default.

However, this transliterator is indeed an important one. I guess could make it obligatory. If it's not present, I will require icu4c be built from sources.

1 similar comment
@gagolews
Copy link
Owner

This is the default for the CRAN Windows ans OS X binary builds.

Linux -- we've had a long discussion long time ago and decided not to change this default.

However, this transliterator is indeed an important one. I guess could make it obligatory. If it's not present, I will require icu4c be built from sources.

@gagolews
Copy link
Owner

gagolews commented Apr 15, 2020

Before that happens, though, I need to do some "research", maybe Any-ASCII was deprecated or something

By the way, why Any-Latin; Latin-ASCII is not good enough for your task?

@billdenney
Copy link
Author

While working on this, I settled on Greek-Latin, Any-Latin, Latin-ASCII as the best for my needs. That said, I do think that something to ensure all are available is best.

@gagolews
Copy link
Owner

gagolews commented Apr 15, 2020

Any-ASCII is not amongst the "legal" transliterators at:
https://github.com/unicode-org/icu/tree/master/icu4c/source/data/translit

Maybe that was an alias for something?

@billdenney
Copy link
Author

Hmm, I can't imagine having made it up. Maybe it was in an older version that I was using and was deprecated away?

I made a quick attempt to look for it in older versions, but I couldn't find it there. If you want to close the issue, I can reopen if I find something similar again.

@gagolews
Copy link
Owner

I also remember that one, but on my Ubuntu 20.04beta, she's not there.

@gagolews
Copy link
Owner

gagolews commented May 4, 2021

(thread inactive for >= 12 months; closing)

(note that as per #401 stringi is now shipped with ICU 69.1, so the above might be fixed now)

@gagolews gagolews closed this as completed May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants