[text-to-speech] Spaces in text are encoded as + #635

arthurfabre · 2017-04-03T21:07:46Z

Steps to reproduce:
TSS.synthesize("Hello Bob", Voice.EN_LISA);
Expected behavior
Audio for "Hello Bob"
Actual behavior
Audio for "Hello+Bob"
JDK version: OpenJDK 1.8.0_121
java-sdk version: 3.7.1

With commit 7d9bbd7, to resolve #602, the text is now url-encoded before being passed off to okhttp.

Unfortunately, RequestUtils.encode calls URLEncoder.encode(), which performs form-encoding instead of %-encoding. okhttp then does proper %-encoding, which results in requests for synthesizing "Hi Bob" becoming:

https://stream.watsonplatform.net/text-to-speech/api/v1/synthesize?text=Hi%252C2BBob&voice=en-US_LisaVoice&accept=audio/l16;%20rate%3D48000

Issue #602 seems to be caused by the é character being encoded as UTF-8 by okhttp (0xC3 0xA9) but decoded as ASCII by the backend, hence the BadRequestException: 'ascii' codec can't decode byte 0xc3 error.

The text was updated successfully, but these errors were encountered:

arthurfabre · 2017-04-03T23:19:03Z

Actually, it seems the 400 - Bad Request issue is only caused when there is a semicolon followed by a unicode character. curl also receives a BadRequest using this:

curl -X GET -u "user:pass" --output test-get.wav "https://stream.watsonplatform.net/text-to-speech/api/v1/synthesize?voice=es-ES_EnriqueVoice&text=;%C3%A9" -w "%{http_code}"

However a POST request works successfully:

curl -X POST -u "user:pass" --header "Content-Type: application/json" --data '{"text":";é"}' --output test-post.wav "https://stream.watsonplatform.net/text-to-speech/api/v1/synthesize?voice=es-ES_EnriqueVoice" -w "{http_code}"

Semi-colons are a reserved character in URLs, escaping them works as expected (both with the java-sdk, and curl). I'm not sure why the presence of semi-colons causes the backend to decode the rest of the text as ASCII. URLEncoder.encode() is not the correct solution as it encodes certain things wrongly, and okhttp then encodes the result once more.

I suppose switching to POST requests would be a suitable workaround, or modifying RequestBuilder to escape semicolons properly.

wturosz · 2017-04-04T18:57:54Z

Similar issue reported to IBM customer support:
TTS inserts '+' in converted audio
"[DEBUG] Temperature today will be 45 fahrenheits and the wind speed will be 3 miles per hourApr 03, 2017 8:20:22 PM okhttp3.internal.platform.Platform logINFO: --> GET https://stream.watsonplatform.net/text-to-speech/api/v1/synthesize?text=Temperature%2Btoday%2Bwill%2Bbe%2B45%2Bfahrenheits%2Band%2Bthe%2Bwind%2Bspeed%2Bwill%2Bbe%2B3%2Bmiles%2Bper%2Bhour&voice=en-US_MichaelVoice&accept=audio/wav http/1.1Apr 03, 2017 8:20:22 PM okhttp3.internal.platform.Platform logINFO: <-- 200 OK

The IBM Java Bluemix SDK version 3.7.1 includes a jar file called “text-to-speech-3.7.1.jar” ultimately makes the following encode call:
publicstatic String encode(String content) { try { return URLEncoder.encode(content, "UTF-8"); } catch (final UnsupportedEncodingException e) { thrownew AssertionError(e); } }

The URLEncoder.encode(content,”UTF-8”) is adding the “plus” signs to the text.

germanattanasio added the bug label Apr 4, 2017

germanattanasio assigned blakesteve Apr 4, 2017

This was referenced Apr 4, 2017

[text to speech] Special Character failure resolution #638

Merged

[Text-to-Speech] failed with combine of special characters, e.g Spanish + ";" #602

Closed

germanattanasio closed this as completed in 5ae0d24 Apr 7, 2017

germanattanasio mentioned this issue Apr 7, 2017

v3.7.2 #645

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[text-to-speech] Spaces in text are encoded as + #635

[text-to-speech] Spaces in text are encoded as + #635

arthurfabre commented Apr 3, 2017 •

edited

Loading

arthurfabre commented Apr 3, 2017

wturosz commented Apr 4, 2017

[text-to-speech] Spaces in text are encoded as + #635

[text-to-speech] Spaces in text are encoded as + #635

Comments

arthurfabre commented Apr 3, 2017 • edited Loading

arthurfabre commented Apr 3, 2017

wturosz commented Apr 4, 2017

arthurfabre commented Apr 3, 2017 •

edited

Loading