Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Markdown in mdbook-xgettext #449

Merged
merged 6 commits into from
Mar 7, 2023
Merged

Conversation

djmitche
Copy link
Collaborator

This uses a full Markdown parser to parse the book contents into messages for translation.

The existing functionality broke messages on double-newlines (\n\n), otherwise returning the text exactly as found in the original file. The updated approach still takes string slices from the original file, but uses the markdown parser to better identify the breaks between messages.

Status: work in progress

Fixes #318.

@djmitche djmitche marked this pull request as draft February 22, 2023 23:19
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
@mgeisler
Copy link
Collaborator

This looks like a great start, thanks for working on it!

@djmitche
Copy link
Collaborator Author

A bit of experimentation with the existing translations shows no difference for some of them, especially de and da. I suspect that these are just out of date? I do see differences for ko and pt-BR.

My thinking is that I will run a msgmerge and then update the .po file by hand to where the resulting output is the same as it currently is on main. But, I suspect that this is going to drop a lot of translations (probably almost everything for de and da) that have not yet been updated due to other changes in the text that have landed on main.

I'm not sure what the other options are, though. What do you think?

@mgeisler
Copy link
Collaborator

A bit of experimentation with the existing translations shows no difference for some of them, especially de and da. I suspect that these are just out of date? I do see differences for ko and pt-BR.

Yes, the German and Danish translations are very incomplete right now. The only full translations are Korean and Portuguese.

My thinking is that I will run a msgmerge and then update the .po file by hand to where the resulting output is the same as it currently is on main. But, I suspect that this is going to drop a lot of translations (probably almost everything for de and da) that have not yet been updated due to other changes in the text that have landed on main.

Yeah, if the translations are out of date, there's not much we can do about it.

However, I was not thinking that you would msgmerge anything into the existing translations. Instead, I was thinking that you would do something like this:

  1. Use the msgid fields to reassemble the original English input to the .po file. This should be possible since the current .po files are complete: they only leave out blank lines.
  2. Do the same with the msgstr fields to get a (partially) translated file.
  3. Run the new extract_msgs function on both files. This ought to give you pairs of messages which you can use as msgid and msgstr for a .po file.

The effect of this would be a .po file where the code blocks are no longer split. It should also give you a .po file where headings can be saved without #, and where list items can be saved item-by-item.

However, this only works when the translated paragraphs have the same Markdown structure as the original text. @jooyunghan indicated that this is not always the case for Korean. @jiyongp, @rastringer, @hugojacob, and @ronaldfw what do you think about this?

See also the discussion in #318.

@djmitche
Copy link
Collaborator Author

djmitche commented Mar 2, 2023

There are a lot of complications and edge cases in trying to follow this clever approach:

  • the .po file contains a concatenation of all messages, with duplicates removed, so only sort of in order
  • concatenating the msgids leads Markdown to sometimes combine lists from multiple files (e.g., where a header is duplicated and thus omitted)
  • there are a lot of occurrences of <details> and </details> in the .po file that cause spurious differences; fixing these causes lots more cases of merging lists
  • there are places where lists are separated by \n \n and not \n\n, which causes different parses

I've pushed a commit with my helper in it, after having manually edited po/ko.po enough that it ran successfully. You can replicate what I'm seeing with

cargo run -p i18n-helpers --bin helper
bash compare.sh

Here's what's happening:

  • The helper binary takes in po/ko.po, reconstructs a large markdown file from the msgids it finds there, including only messages which actually have a translation. It then re-parses that markdown with the new parser and uses the offsets of the old and new parses to generate a new set of messages by concatenting msgstr's from the old catalog. The edits to po/ko.po in this commit are there to line up the boundaries properly and avoid the bail! case. It then writes the result out to po/ko-new.po.
  • The compare.sh script generates a full markdown output of both the original (from origin/main) translation and the new (po/ko-new.po) translation, and diffs the two.

The result is that lots of things aren't translated anymore, for various fiddly reasons suggested above. Surprisingly, some things are newly translated, for example

18,21c15
< * Experience with Java, Go, Python, JavaScript...: You get the same memory safety
<   as in those languages, plus a similar high-level language feeling. In addition
<   you get fast and predictable performance like C and C++ (no garbage collector)
<   as well as access to low-level hardware (should you need it)
---
> * Java, Go, Python, JaveScript: 이 언어들과 동일한 메모리 안정성과 함께, '하이레벨'언어의 느낌을 느낄 수 있습니다. 거기에 더해, 가비지 컬렉터가 없는 C/C++와 유사한 수준의 빠르고 예측 가능한 성능을 기대할 수 있습니다. 그리고 필요한 경우 저수준 하드웨어를 다루는 코드로 작성할 수 있습니다.

I feel like this is a never-ending struggle, and maybe what I've got is good enough? What do you think?

@djmitche
Copy link
Collaborator Author

djmitche commented Mar 2, 2023

(and, I see there are conflicts in po/ko.po, suggesting my work is bitrotting, too!)

@mgeisler
Copy link
Collaborator

mgeisler commented Mar 3, 2023

  • the .po file contains a concatenation of all messages, with duplicates removed, so only sort of in order
  • concatenating the msgids leads Markdown to sometimes combine lists from multiple files (e.g., where a header is duplicated and thus omitted)

Right! I was conceptually thinking that you would do this on a per-file basis. So for a given foo.md file, you can find all msgid entries that are relevant for this file. That should let you reconstruct the English version of foo.md at the time the xx.po file was last touched. From that you can get a foo-xx.md file by doing the same with the msgstr entries.

I feel like this is a never-ending struggle, and maybe what I've got is good enough? What do you think?

Yes, that's a good point... this should not turn into a marathon project where everything must be perfect. When I look at the differences between the messages.pot file before and after, most changes look exactly like what I would expect:

image

Here we have a code block which was split before and is now joined into a single message.

The only other big change is the bullet points in the lists:

image

Here individual bullet points have been turned into one big message. If they would be split into individual messages, then the output would be just like before (and thus safe for the existing translations).

I would suggest implementing this: splitting lists into individual bullet points. That transformation alone sounds like something that can be safely executed on all msgid and msgstr fields today in a lossless fashion. You could even detect when a msgid is results in n messages and the corresponding msgstr becomes m != n messages (then just mark the m new messages fuzzy so that the translators can quickly proofread them).

With that, we should have a drop-in replacement: the code blocks will lose their translation, but we don't have a lot of those strings (I looked for relevant strings in msggrep -K -e ';' po/messages.pot | msggrep -K -e '//' | msggrep -K -v -e '// ANCHOR' and found ~50 translatable comments in total).

Does that approach sound doable?

@djmitche
Copy link
Collaborator Author

djmitche commented Mar 6, 2023

Right! I was conceptually thinking that you would do this on a per-file basis. So for a given foo.md file, you can find all msgid entries that are relevant for this file.

Last week, I was thinking this was impossible due to the deduplication, but the `#:`` comments do, indeed, contain enough data to do this. And in fact, fixing this results in a surprisingly good result, according to the reproduction steps above. So, I think I will update the other translations as I've done for ko, and then leave it at that.

There are a few reasons to not want to delve into the per-list-item parse:

  • The formatting issues I mentioned in Extract text more carefully in mdbook-xgettext #318.
  • Some list items are separated into paragraphs (\n\n) while some are not. And some are separated by \n \n.
  • Doing this well gets into reformatting text - removing newlines, leading space, etc. - and that's a whole bunch of complexity that I don't want to add to this PR.

@djmitche djmitche marked this pull request as ready for review March 6, 2023 20:01
@djmitche
Copy link
Collaborator Author

djmitche commented Mar 6, 2023

OK, I've finished all four languages. There are still some discrepancies, but only a handful.

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved
po/da.po Outdated Show resolved Hide resolved
Copy link
Collaborator

@mgeisler mgeisler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I think this looks great!

Please take out the commit with the helper — unless it's somehow useful in the future when we do further transformations?

I agree with you that we should get this merged so that we can start streamlining the PO files to the new parsing.

@mgeisler mgeisler changed the title Parse markdown in mdbook-xgettext Parse Markdown in mdbook-xgettext Mar 7, 2023
djmitche added 5 commits March 7, 2023 17:48
This upgrades from just splitting Markdown files on double-newlines, to
using a Markdown parser to break them into more appropriate chunks. The
upshot is that code samples are all in one message, lists are bundled
together, and generally it should be easier to translate.
@djmitche djmitche enabled auto-merge (squash) March 7, 2023 18:02
@djmitche djmitche merged commit ba28dd2 into google:main Mar 7, 2023
NoahDragon pushed a commit to wnghl/comprehensive-rust that referenced this pull request Jul 19, 2023
* Parse Markdown to support translation.

This upgrades from just splitting Markdown files on double-newlines, to
using a Markdown parser to break them into more appropriate chunks. The
upshot is that code samples are all in one message, lists are bundled
together, and generally it should be easier to translate.

* [WIP] helper to update po files for new translation

* process synthetic input file-by-file

* review comments

* remove temporary code

* fix msgfmt lints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract text more carefully in mdbook-xgettext
2 participants