Fix saving unicode MAM messages. #1748

kzemek · 2018-03-02T11:55:24Z

This PR allows storing and retrieving unicode messages in MAM. It changes a few places to properly handle unicode, as well as sidesteps escaping needed for some databases by using prepared queries also in synchronous writes (previously only in asynchronous)

arcusfelis · 2018-03-02T14:22:00Z

src/mod_mam_muc_odbc_arch.erl

@@ -138,6 +138,8 @@ archive_message(_Result, Host, MessID, RoomID,
    try
        archive_message_unsafe(Host, MessID, RoomID, FromNick, Packet)
    catch _Type:Reason ->
+            ?ERROR_MSG("event=archive_message_failed reason='~p' stacktrace=~p",


We want also see MessID, RoomID and FromNick in logs for debugging.

arcusfelis · 2018-03-02T14:22:29Z

src/mod_mam_odbc_arch.erl

@@ -212,6 +212,8 @@ archive_message(Result, Host, MessID, UserID,
        do_archive_message(Result, Host, MessID, UserID,
                           LocJID, RemJID, SrcJID, Dir, Packet)
    catch _Type:Reason ->
+            ?ERROR_MSG("event=archive_message_failed reason='~p' stacktrace=~p",


same here, extra fields are good to have.

arcusfelis · 2018-03-02T14:34:28Z

test.disabled/ejabberd_tests/tests/mam_SUITE.erl

+save_unicode_messages(Config) ->
+    P = ?config(props, Config),
+    F = fun(Alice, Bob) ->
+                escalus:send(Alice, escalus_stanza:chat_to(Bob, <<"Hi! this is an unicode character ȥ"/utf8>>)),


should be information with a link to codepoints.

LATIN SMALL LETTER Z WITH HOOK
http://www.fileformat.info/info/unicode/char/0225/index.htm

arcusfelis · 2018-03-02T14:34:49Z

test.disabled/ejabberd_tests/tests/mam_SUITE.erl

+    P = ?config(props, Config),
+    F = fun(Alice, Bob) ->
+                escalus:send(Alice, escalus_stanza:chat_to(Bob, <<"Hi! this is an unicode character ȥ"/utf8>>)),
+                escalus:send(Alice, escalus_stanza:chat_to(Bob, <<"this is another one ȸ"/utf8>>)),


LATIN SMALL LETTER DB DIGRAPH

arcusfelis · 2018-03-02T15:03:42Z

test.disabled/ejabberd_tests/tests/mam_SUITE.erl

+                [Msg3] = respond_messages(Res2),
+                #forwarded_message{message_body = Body3} = parse_forwarded_message(Msg3),
+                ?assert_equal(<<"this is another one ȸ"/utf8>>, Body3),
+


Don't forget to test stuff that requires surrogates when encoded in UTF16.
For example:
𐀀 (Unicode Linear B Syllable B008 A)

unicode:characters_to_binary([65536]). <<240,144,128,128>>

And smiles
https://en.wikipedia.org/wiki/Emoticons_(Unicode_block)

The tested characters are already multibyte, and the fix/test are about unicode characters storage not about (currently not-our-implementation-defined) search capabilities.

At least in mysql emoticons were broken without mb4 encoding (while lower multibyte code points were working just fine), so it's a real case.
It we have a test with them, than we would be sure that it works with all of our backends.

Search is completely different topic...

codecov-io · 2018-03-07T14:19:06Z

Codecov Report

Merging #1748 into master will decrease coverage by 0.03%.
The diff coverage is 75%.

@@            Coverage Diff             @@
##           master    #1748      +/-   ##
==========================================
- Coverage    74.6%   74.57%   -0.04%     
==========================================
  Files         283      283              
  Lines       26578    26569       -9     
==========================================
- Hits        19829    19814      -15     
- Misses       6749     6755       +6

Impacted Files	Coverage Δ
src/mod_mam_riak_timed_arch_yz.erl	`87.96% <100%> (ø)`	⬆️
src/rdbms/mongoose_rdbms_pgsql.erl	`90.9% <100%> (ø)`	⬆️
src/rdbms/mongoose_rdbms_odbc.erl	`84.78% <40%> (-8.08%)`	⬇️
src/mod_mam_odbc_arch.erl	`89.34% <75%> (-0.82%)`	⬇️
src/mod_mam_muc_odbc_arch.erl	`82.12% <80%> (-0.94%)`	⬇️
src/mod_mam_utils.erl	`87.41% <83.33%> (+0.04%)`	⬆️
src/rdbms/mongoose_rdbms_mysql.erl	`86.95% <0%> (-4.35%)`	⬇️
src/rdbms/mongoose_rdbms.erl	`71.22% <0%> (-1.44%)`	⬇️
src/ejabberd_c2s.erl	`84.86% <0%> (+0.15%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7be65c4...161ea50. Read the comment docs.

arcusfelis

good

kzemek force-pushed the fix-unicode-mam branch from 2075fee to 25388fb Compare March 2, 2018 14:10

arcusfelis reviewed Mar 2, 2018

View reviewed changes

kzemek force-pushed the fix-unicode-mam branch from 25388fb to 0d20253 Compare March 2, 2018 14:59

arcusfelis reviewed Mar 2, 2018

View reviewed changes

fenek added the WIP 🚧 label Mar 5, 2018

kzemek force-pushed the fix-unicode-mam branch 3 times, most recently from 2cea6a8 to 9624cdc Compare March 6, 2018 16:42

Fix storing unicode message bodies in MAM.

161ea50

kzemek force-pushed the fix-unicode-mam branch from 9624cdc to 161ea50 Compare March 7, 2018 10:42

kzemek added waiting-for-review and removed WIP 🚧 labels Mar 7, 2018

esl deleted a comment from fenek Mar 12, 2018

arcusfelis approved these changes Mar 12, 2018

View reviewed changes

arcusfelis added ready and removed waiting-for-review labels Mar 12, 2018

kzemek merged commit b076e4a into master Mar 13, 2018

kzemek deleted the fix-unicode-mam branch March 13, 2018 10:03

fenek added this to the 3.0.0 milestone Mar 14, 2018

kzemek mentioned this pull request Mar 16, 2018

[#17433] Saving unicode messages in Mysql #1459

Closed

kzemek mentioned this pull request Jul 19, 2018

Some unicode characters are not stored in MAM for Postgres #1703

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix saving unicode MAM messages. #1748

Fix saving unicode MAM messages. #1748

kzemek commented Mar 2, 2018 •

edited

Loading

arcusfelis Mar 2, 2018

arcusfelis Mar 2, 2018

arcusfelis Mar 2, 2018

arcusfelis Mar 2, 2018

arcusfelis Mar 2, 2018

kzemek Mar 2, 2018

arcusfelis Mar 2, 2018

codecov-io commented Mar 7, 2018

arcusfelis left a comment

Fix saving unicode MAM messages. #1748

Fix saving unicode MAM messages. #1748

Conversation

kzemek commented Mar 2, 2018 • edited Loading

arcusfelis Mar 2, 2018

Choose a reason for hiding this comment

arcusfelis Mar 2, 2018

Choose a reason for hiding this comment

arcusfelis Mar 2, 2018

Choose a reason for hiding this comment

arcusfelis Mar 2, 2018

Choose a reason for hiding this comment

arcusfelis Mar 2, 2018

Choose a reason for hiding this comment

kzemek Mar 2, 2018

Choose a reason for hiding this comment

arcusfelis Mar 2, 2018

Choose a reason for hiding this comment

codecov-io commented Mar 7, 2018

Codecov Report

arcusfelis left a comment

Choose a reason for hiding this comment

kzemek commented Mar 2, 2018 •

edited

Loading