Parse Markdown in mdbook-xgettext #449

djmitche · 2023-02-22T23:19:50Z

This uses a full Markdown parser to parse the book contents into messages for translation.

The existing functionality broke messages on double-newlines (\n\n), otherwise returning the text exactly as found in the original file. The updated approach still takes string slices from the original file, but uses the markdown parser to better identify the breaks between messages.

Status: work in progress

Fixes #318.

i18n-helpers/src/lib.rs

mgeisler · 2023-02-23T11:16:27Z

This looks like a great start, thanks for working on it!

djmitche · 2023-02-24T23:04:44Z

A bit of experimentation with the existing translations shows no difference for some of them, especially de and da. I suspect that these are just out of date? I do see differences for ko and pt-BR.

My thinking is that I will run a msgmerge and then update the .po file by hand to where the resulting output is the same as it currently is on main. But, I suspect that this is going to drop a lot of translations (probably almost everything for de and da) that have not yet been updated due to other changes in the text that have landed on main.

I'm not sure what the other options are, though. What do you think?

mgeisler · 2023-02-25T15:43:44Z

A bit of experimentation with the existing translations shows no difference for some of them, especially de and da. I suspect that these are just out of date? I do see differences for ko and pt-BR.

Yes, the German and Danish translations are very incomplete right now. The only full translations are Korean and Portuguese.

My thinking is that I will run a msgmerge and then update the .po file by hand to where the resulting output is the same as it currently is on main. But, I suspect that this is going to drop a lot of translations (probably almost everything for de and da) that have not yet been updated due to other changes in the text that have landed on main.

Yeah, if the translations are out of date, there's not much we can do about it.

However, I was not thinking that you would msgmerge anything into the existing translations. Instead, I was thinking that you would do something like this:

Use the msgid fields to reassemble the original English input to the .po file. This should be possible since the current .po files are complete: they only leave out blank lines.
Do the same with the msgstr fields to get a (partially) translated file.
Run the new extract_msgs function on both files. This ought to give you pairs of messages which you can use as msgid and msgstr for a .po file.

The effect of this would be a .po file where the code blocks are no longer split. It should also give you a .po file where headings can be saved without #, and where list items can be saved item-by-item.

However, this only works when the translated paragraphs have the same Markdown structure as the original text. @jooyunghan indicated that this is not always the case for Korean. @jiyongp, @rastringer, @hugojacob, and @ronaldfw what do you think about this?

See also the discussion in #318.

djmitche · 2023-03-02T23:54:27Z

There are a lot of complications and edge cases in trying to follow this clever approach:

the .po file contains a concatenation of all messages, with duplicates removed, so only sort of in order
concatenating the msgids leads Markdown to sometimes combine lists from multiple files (e.g., where a header is duplicated and thus omitted)
there are a lot of occurrences of <details> and </details> in the .po file that cause spurious differences; fixing these causes lots more cases of merging lists
there are places where lists are separated by \n \n and not \n\n, which causes different parses

I've pushed a commit with my helper in it, after having manually edited po/ko.po enough that it ran successfully. You can replicate what I'm seeing with

cargo run -p i18n-helpers --bin helper
bash compare.sh

Here's what's happening:

The helper binary takes in po/ko.po, reconstructs a large markdown file from the msgids it finds there, including only messages which actually have a translation. It then re-parses that markdown with the new parser and uses the offsets of the old and new parses to generate a new set of messages by concatenting msgstr's from the old catalog. The edits to po/ko.po in this commit are there to line up the boundaries properly and avoid the bail! case. It then writes the result out to po/ko-new.po.
The compare.sh script generates a full markdown output of both the original (from origin/main) translation and the new (po/ko-new.po) translation, and diffs the two.

The result is that lots of things aren't translated anymore, for various fiddly reasons suggested above. Surprisingly, some things are newly translated, for example

18,21c15
< * Experience with Java, Go, Python, JavaScript...: You get the same memory safety
<   as in those languages, plus a similar high-level language feeling. In addition
<   you get fast and predictable performance like C and C++ (no garbage collector)
<   as well as access to low-level hardware (should you need it)
---
> * Java, Go, Python, JaveScript: 이 언어들과 동일한 메모리 안정성과 함께, '하이레벨'언어의 느낌을 느낄 수 있습니다. 거기에 더해, 가비지 컬렉터가 없는 C/C++와 유사한 수준의 빠르고 예측 가능한 성능을 기대할 수 있습니다. 그리고 필요한 경우 저수준 하드웨어를 다루는 코드로 작성할 수 있습니다.

I feel like this is a never-ending struggle, and maybe what I've got is good enough? What do you think?

djmitche · 2023-03-02T23:54:54Z

(and, I see there are conflicts in po/ko.po, suggesting my work is bitrotting, too!)

mgeisler · 2023-03-03T16:42:53Z

the .po file contains a concatenation of all messages, with duplicates removed, so only sort of in order

concatenating the msgids leads Markdown to sometimes combine lists from multiple files (e.g., where a header is duplicated and thus omitted)

Right! I was conceptually thinking that you would do this on a per-file basis. So for a given foo.md file, you can find all msgid entries that are relevant for this file. That should let you reconstruct the English version of foo.md at the time the xx.po file was last touched. From that you can get a foo-xx.md file by doing the same with the msgstr entries.

I feel like this is a never-ending struggle, and maybe what I've got is good enough? What do you think?

Yes, that's a good point... this should not turn into a marathon project where everything must be perfect. When I look at the differences between the messages.pot file before and after, most changes look exactly like what I would expect:

Here we have a code block which was split before and is now joined into a single message.

The only other big change is the bullet points in the lists:

Here individual bullet points have been turned into one big message. If they would be split into individual messages, then the output would be just like before (and thus safe for the existing translations).

I would suggest implementing this: splitting lists into individual bullet points. That transformation alone sounds like something that can be safely executed on all msgid and msgstr fields today in a lossless fashion. You could even detect when a msgid is results in n messages and the corresponding msgstr becomes m != n messages (then just mark the m new messages fuzzy so that the translators can quickly proofread them).

With that, we should have a drop-in replacement: the code blocks will lose their translation, but we don't have a lot of those strings (I looked for relevant strings in msggrep -K -e ';' po/messages.pot | msggrep -K -e '//' | msggrep -K -v -e '// ANCHOR' and found ~50 translatable comments in total).

Does that approach sound doable?

djmitche · 2023-03-06T18:44:46Z

Right! I was conceptually thinking that you would do this on a per-file basis. So for a given foo.md file, you can find all msgid entries that are relevant for this file.

Last week, I was thinking this was impossible due to the deduplication, but the `#:`` comments do, indeed, contain enough data to do this. And in fact, fixing this results in a surprisingly good result, according to the reproduction steps above. So, I think I will update the other translations as I've done for ko, and then leave it at that.

There are a few reasons to not want to delve into the per-list-item parse:

The formatting issues I mentioned in Extract text more carefully in mdbook-xgettext #318.
Some list items are separated into paragraphs (\n\n) while some are not. And some are separated by \n \n.
Doing this well gets into reformatting text - removing newlines, leading space, etc. - and that's a whole bunch of complexity that I don't want to add to this PR.

djmitche · 2023-03-06T20:02:02Z

OK, I've finished all four languages. There are still some discrepancies, but only a handful.

i18n-helpers/src/lib.rs

po/da.po

mgeisler

Wow, I think this looks great!

Please take out the commit with the helper — unless it's somehow useful in the future when we do further transformations?

I agree with you that we should get this merged so that we can start streamlining the PO files to the new parsing.

This upgrades from just splitting Markdown files on double-newlines, to using a Markdown parser to break them into more appropriate chunks. The upshot is that code samples are all in one message, lists are bundled together, and generally it should be easier to translate.

* Parse Markdown to support translation. This upgrades from just splitting Markdown files on double-newlines, to using a Markdown parser to break them into more appropriate chunks. The upshot is that code samples are all in one message, lists are bundled together, and generally it should be easier to translate. * [WIP] helper to update po files for new translation * process synthetic input file-by-file * review comments * remove temporary code * fix msgfmt lints

djmitche marked this pull request as draft February 22, 2023 23:19

djmitche mentioned this pull request Feb 22, 2023

Extract text more carefully in mdbook-xgettext #318

Closed

mgeisler reviewed Feb 23, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Feb 23, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Feb 23, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Feb 23, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Feb 23, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

djmitche force-pushed the issue318 branch 2 times, most recently from 8071fb4 to 5e730f4 Compare February 23, 2023 22:45

mgeisler mentioned this pull request Feb 24, 2023

Show Markdown diff as unified diff #454

Merged

djmitche force-pushed the issue318 branch from 5e730f4 to a3fc5cf Compare February 24, 2023 22:47

djmitche force-pushed the issue318 branch from 916c404 to 38ec934 Compare March 6, 2023 18:40

djmitche force-pushed the issue318 branch from 38ec934 to d79531a Compare March 6, 2023 20:00

djmitche marked this pull request as ready for review March 6, 2023 20:01

djmitche requested review from rastringer, hugojacob, jiyongp, jooyunghan, ronaldfw and fechu as code owners March 6, 2023 20:01

mgeisler reviewed Mar 7, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Mar 7, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Mar 7, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Mar 7, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Mar 7, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Mar 7, 2023

View reviewed changes

po/da.po Outdated Show resolved Hide resolved

mgeisler approved these changes Mar 7, 2023

View reviewed changes

mgeisler changed the title ~~Parse markdown in mdbook-xgettext~~ Parse Markdown in mdbook-xgettext Mar 7, 2023

djmitche added 5 commits March 7, 2023 17:48

[WIP] helper to update po files for new translation

5db27d0

process synthetic input file-by-file

6b2bcdf

review comments

309bf8c

remove temporary code

effed39

djmitche force-pushed the issue318 branch from d79531a to effed39 Compare March 7, 2023 17:50

fix msgfmt lints

15defa3

djmitche enabled auto-merge (squash) March 7, 2023 18:02

djmitche merged commit ba28dd2 into google:main Mar 7, 2023

mgeisler mentioned this pull request Mar 15, 2023

Normalize Markdown in .pot files google/mdbook-i18n-helpers#19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Markdown in mdbook-xgettext #449

Parse Markdown in mdbook-xgettext #449

djmitche commented Feb 22, 2023

mgeisler commented Feb 23, 2023

djmitche commented Feb 24, 2023

mgeisler commented Feb 25, 2023

djmitche commented Mar 2, 2023

djmitche commented Mar 2, 2023

mgeisler commented Mar 3, 2023

djmitche commented Mar 6, 2023

djmitche commented Mar 6, 2023

mgeisler left a comment

Parse Markdown in mdbook-xgettext #449

Parse Markdown in mdbook-xgettext #449

Conversation

djmitche commented Feb 22, 2023

mgeisler commented Feb 23, 2023

djmitche commented Feb 24, 2023

mgeisler commented Feb 25, 2023

djmitche commented Mar 2, 2023

djmitche commented Mar 2, 2023

mgeisler commented Mar 3, 2023

djmitche commented Mar 6, 2023

djmitche commented Mar 6, 2023

mgeisler left a comment

Choose a reason for hiding this comment