Normalize Markdown in `.pot` files #19

mgeisler · 2023-03-10T11:28:08Z

When mdbook-xgettext extracts translatable text, it would be great if it could normalize the strings. This would make it possible for us to reformat the entire course without fearing that the translations get destroyed while doing so.

The normalization would take Markdown like this

# This is a heading

This is another heading
=======================

A _little_
paragraph.

```rust,editable
fn main() {
    println!("Hello world!");
}
```

* First
* Second

and turn it into these messages in the .pot file:

"This is a heading" (atx heading is stripped)
"This is another heading" (setext heading is stripped)
"A _little_ paragraph." (soft-wrapped lines are unfolded)
"fn main() {\n println!("Hello world!");\n}" (info string is stripped, we should instead use a #, flag)
"First" (bullet point extracted individually)
"Second"

Like in google/comprehensive-rust#318, we should do this in a step-by-step fashion and make sure to apply the transformations to the existing translations. It would also be good if we have a way to let translators update their not-yet-submitted translations.

The text was updated successfully, but these errors were encountered:

mgeisler · 2023-03-13T08:20:03Z

When writing this, it occured to me that we can get to the end-state of nicely formatted Markdown by running a formatter on

The .md files
The msgid fields in all .po files
The msgstr fields in all .po files

If we do this in lock-step, we ought to end up with a lossless process. I was experimenting with this over the weekend: the msgfilter program can process the msgstr fields, but we need our own little helper to do the same with the msgid fields.

I was using https://dprint.dev/ for the formatting since it seems fast and extensible. It can format code blocks inside Markdown which is something I really want 😄

jooyunghan · 2023-03-13T09:54:20Z

Hey Martin,
While updating Korean last week, i found a few missing statements in msgstr entries. Is it from any recent updates of xgettext?

mgeisler · 2023-03-15T14:09:23Z

While updating Korean last week, i found a few missing statements in msgstr entries. Is it from any recent updates of xgettext?

A few entries were lost and changed as part of google/comprehensive-rust#449 — @djmitche would know the details 😄

djmitche · 2023-03-15T14:11:23Z

Yeah, that was a somewhat lossy process. Hopefully those appear in fuzzy or old messages?

djmitche · 2023-03-15T15:29:48Z

@mgeisler I'm not sure what you mean about the formatter. Can you describe that in more detail?

mgeisler · 2023-03-16T11:49:00Z

Can you describe that in more detail?

Sure! My thinking is that we can safely format the Markdown files if we we know that it won't create more/fewer entries in the PO files when mdbook-xgettext is executed on the formatted files.

To avoid losing translations, we'll then run the same formatting on all msgid fields.

We can already safely run the formatter on the msgstr entries today with msgfilter and that might be something we should encourage translators to do to get smaller diffs.

djmitche · 2023-04-05T13:49:57Z

I don't think I'm going to get a chance to work on this issue soon.

simonsan · 2023-04-06T14:11:45Z

I was using https://dprint.dev/ for the formatting since it seems fast and extensible. It can format code blocks inside Markdown which is something I really want 😄

https://rust-lang.github.io/mdBook/format/mdbook.html?highlight=code#inserting-runnable-rust-files

I'm still thinking if it wouldn't be better to use rustfmt in that case and just refactor the markdown files to embed the code from rs files. This would make it possible to have a crate next to the documentation/tutorial and be able to check, fix, fmt and test it. Need to think about further implications. But it would at least separate the code from the text – with its own dis-/advantages for sure - which could make it easier to translate the text that is needed. because right now I see the code blocks showing up as well in the translations.

djmitche · 2023-04-06T15:27:29Z

I see the appeal, but I'm also wary of making it too easy to add too much code to a slide. I think a wide-open .rs file in a text editor invites adding extra lines.

simonsan · 2023-04-09T04:35:21Z

Yesterday, I started the translation of rust-unofficial/patterns to German with cloud-translate. And I think the initial state could be much better with markdown normalization. Because I often end up manually going over the entries due to \r and \n messing up the English grammatic for auto-translation. I think this feature would be really nice to have to make adoption much easier! Especially if you have set the max-line-length to 80 ... 😅

mgeisler · 2023-04-09T21:36:53Z

I don't think I'm going to get a chance to work on this issue soon.

That's alright, I'll look at it in the background and see what I can come up with.

simonsan · 2023-04-10T08:46:15Z

@mgeisler I'm even hesitant, do you think I should wait a bit with rust-unofficial/patterns#359 and reinitialize the de.po-file when this feature is implemented?

Because so far I translated 30% of the book, it took me a few hours already. And it feels a bit inefficient and super fiddly to just move two words in a sentence in the right place (due to \r \n), because the rest is somewhat fine.

mgeisler · 2023-04-10T09:18:13Z

@mgeisler I'm even hesitant, do you think I should wait a bit with rust-unofficial/patterns#359 and reinitialize the de.po-file when this feature is implemented?

I would not wait for this feature. My plan is to make it easy to move to a new version and keep the existing translations will continue to work. People (like yourself!) have put a lot of work into translations and we need to preserve this.

Concretely, I'm planning on providing a small program which people can run on the existing translations to normalize the Markdown found in them. So if the messages.pot file or a xx.po file contains

#: src/welcome.md:10
msgid ""
"* Give you a comprehensive understanding of the Rust syntax and language.\n"
"* Enable you to modify existing programs and write new programs in Rust.\n"
"* Show you common Rust idioms."
msgstr ""

Then the tool will turn the msgid field into three, one for each bullet point:

#: src/welcome.md:10
msgid "Give you a comprehensive understanding of the Rust syntax and language."
msgstr ""

#: src/welcome.md:11
msgid "Enable you to modify existing programs and write new programs in Rust."
msgstr ""

#: src/welcome.md:12
msgid "Show you common Rust idioms."
msgstr ""

The msgstr field will be split the same way, but in this case it was empty. You should upgrade to a new version of mdbook-i18n-helpers and run this program on the existing translations in a single step. Translators can run the same program on their in-progress work.

it feels a bit inefficient and super fiddly to just move two words in a sentence in the right place (due to \r \n), because the rest is somewhat fine.

I'm not sure why you have to fiddle with individual words like this? Markdown doesn't care about single newlines, so you can break your paragraphs any way you like in the msgstr fields. So your de.po file can look like this if you want:

msgid "Hallo, I am a little text."
msgstr ""
"Hallo,\n"
"ich bin\n"
"ein kleiner\n"
"Text."

The this will result in intermetiary Markdown looking like

Hallo,
ich bin
ein kleiner
Text.

and this in turn renders into HTML exactly the same way as if it the Markdown had been

Hallo, ich bin ein kleiner Text.

A different way to put this is that the translators are building up a full Markdown document, but doing it paragraph by paragraph. This implies that the translators must obey the Markdown formatting rules. If you mis-translate the above to

msgid "Hallo, I am a little text."
msgstr ""
"Hallo,\n\n"
"ich bin\n"
"ein kleiner\n"
"Text."

Then mdbook sees this Markdown

Hallo,

ich bin
ein kleiner
Text.

Now you have two paragraphs in the final book because of how Markdown uses empty lines to separate paragraphs:

<p>Hallo,</p>

<p>ich bin
ein kleiner
Text.</p>

Does that help to explain things?

simonsan · 2023-04-10T10:28:07Z

I'm not sure why you have to fiddle with individual words like this? Markdown doesn't care about single newlines, so you can break your paragraphs any way you like in the msgstr fields.

Ah, maybe I wrote it in a confusing way. Because of the line breaks it was sending that to cloud-translate in the exact same way and it didn't translate well, because it broke the grammatic of the sentence. Maybe it could be an issue with cloud-translate then?

This sentence is a good one as it usually\n
fits into a single line.

Would get translate by cloud-translate to something like this (abstract):

Dieser Satz ist ein guter, weil er normalerweise\n
passt in eine einzige Zeile.

It wouldn't factor in the second line of the translation as it only translates line by line, it seems?

So my assumption is, that if I would restart and regenerate the translation process, and send everything again via cloud-translate after formatting it would translate it better than now, ending up in less work overall.

mgeisler · 2023-04-13T11:51:58Z

It wouldn't factor in the second line of the translation as it only translates line by line, it seems?

Aha... thanks for explaining, now I understand what you mean!

I don't actually know how or if the line break influences the translations done with cloud-translate... in any case, I think removing the newlines would be something best left for that project.

So my assumption is, that if I would restart and regenerate the translation process, and send everything again via cloud-translate after formatting it would translate it better than now, ending up in less work overall.

From speaking to translators of the Rust course, I think people have had very mixed experiences with using cloud-translate: it gets a lot of things wrong because of the specialized context. Perhaps it works better for larger books, I'm not sure.

Feel free to open an issue about this over in https://github.com/mgeisler/cloud-translate — we could perhaps have the tool strip out things like \n and other formatting characters.

Before, we would extract text based on the byte offsets in the original document. As a consequence of this, the extracted text would look precisely like the original: the Markdown was copied directly from the original. In particular, text from a block quote would contain the leading ‘>’ characters and paragraphs in list items would contain leading whitespace. Now, we instead extract text by grouping the Markdown parse events into those which should be translated and those who should be skipped. We use this in two ways: - When extracting messages in ‘mdbook-xgettext’, we turn the translatable events back into Markdown. The structure of the document (headings, lists, block quotes, …) is no longer present in the extracted messages: only the text content itself it extracted. - When translating, we replace the sequence of translatable events with the events from the translation. We do this while leaving the structure of the document unchanged. The result of this is a much more robust system: editing one list item no longer impacts adjacent list items, moving a paragraph into a block quote no longer changes the paragraph. As a side effect of how we turn events into messages, links are now all expanded. This makes the messages larger, but it removes a common source of errors where ‘[foo][1]’ would end up pointing to the wrong location if the reference link was updated. Part of #19.

This change makes the extracted messages ignore any wrapping done for readability of the Markdown source. So This is a paragraph. and This is a paragraph. now becomes the same message in the PO file. This makes it possible for people to freely reformat the source files, without having to worry about invalidating existing translations. Part of #19.

The dprint formatter is a flexible system which will use sandboxed WebAssembly formatters to format our code (mostly: it calls out to `rustfmt` for Rust code). A particularly interesting feature is that dprint can format Rust code blocks in the Markdown files. However, before we turn that on, we need to have a way to normalize the Markdown text as it is extracted[1]. That is so that the word put into the translations is kept after the reformatting. [1]: google/mdbook-i18n-helpers#19

mgeisler · 2023-05-30T06:49:59Z

I wanted to add a fuzz test to ensure that #25 doesn't "invent" new Markdown events. However, this is proving more difficult than I thought since the underlying pulldown-cmark-to-cmark library isn't completely round-tripping the Markdown input. See Byron/pulldown-cmark-to-cmark#55 for the discussion.

The dprint formatter is a flexible system which will use sandboxed WebAssembly formatters to format our code (mostly: it calls out to `rustfmt` for Rust code). A particularly interesting feature is that dprint can format Rust code blocks in the Markdown files. However, before we turn that on, we need to have a way to normalize the Markdown text as it is extracted[1]. That is so that the word put into the translations is kept after the reformatting. [1]: google/mdbook-i18n-helpers#19

mgeisler · 2023-08-23T11:45:31Z

This was fixed with the 0.2.0 release! 🚀 Please remember to run mdbook-i18n-normalize on our PO files to get the benefits of this.

simonsan · 2023-08-23T15:11:19Z

This was fixed with the 0.2.0 release! 🚀 Please remember to run mdbook-i18n-normalize on our PO files to get the benefits of this.

Great! Thank you for the continuous work on mdbook-i18n-helpers! <3

mgeisler · 2023-08-23T15:55:12Z

@simonsan, thanks, you're very welcome! If you have a project which uses it, please add it to the README!

simonsan · 2023-08-24T01:37:36Z

@simonsan, thanks, you're very welcome! If you have a project which uses it, please add it to the README!

Yes, will do. Currently, I set up rust-unofficial/patterns to use an older version and working on another project more eagerly. When I have time to set up the German translation of it completely, I will for sure add it to the readme!

mgeisler assigned djmitche Mar 10, 2023

mgeisler transferred this issue from google/comprehensive-rust Apr 5, 2023

djmitche removed their assignment Apr 5, 2023

This was referenced Apr 5, 2023

Translation: Newer simplified Chinese rust-unofficial/patterns#292

Open

Translations rust-unofficial/patterns#345

Closed

mgeisler self-assigned this Apr 9, 2023

mgeisler mentioned this issue May 1, 2023

Implement fine-grained extraction of translatable text #25

Merged

mgeisler mentioned this issue May 1, 2023

Normalize soft breaks to space #27

Merged

mgeisler mentioned this issue May 26, 2023

Missing word in welcome day speaker notes google/comprehensive-rust#699

Merged

mgeisler mentioned this issue May 27, 2023

Format files with dprint google/comprehensive-rust#711

Merged

mgeisler mentioned this issue May 27, 2023

Add round-trip fuzz test Byron/pulldown-cmark-to-cmark#55

Draft

mgeisler added the enhancement New feature or request label Jul 13, 2023

mgeisler closed this as completed Aug 23, 2023

mgeisler mentioned this issue Apr 22, 2024

Eliminate horizontal scrolling in rust code blocks google/comprehensive-rust#2012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize Markdown in `.pot` files #19

Normalize Markdown in `.pot` files #19

mgeisler commented Mar 10, 2023

mgeisler commented Mar 13, 2023

jooyunghan commented Mar 13, 2023

mgeisler commented Mar 15, 2023

djmitche commented Mar 15, 2023

djmitche commented Mar 15, 2023

mgeisler commented Mar 16, 2023

djmitche commented Apr 5, 2023

simonsan commented Apr 6, 2023

djmitche commented Apr 6, 2023

simonsan commented Apr 9, 2023 •

edited

Loading

mgeisler commented Apr 9, 2023

simonsan commented Apr 10, 2023

mgeisler commented Apr 10, 2023

simonsan commented Apr 10, 2023 •

edited

Loading

mgeisler commented Apr 13, 2023

mgeisler commented May 30, 2023

mgeisler commented Aug 23, 2023

simonsan commented Aug 23, 2023

mgeisler commented Aug 23, 2023

simonsan commented Aug 24, 2023 •

edited

Loading

Normalize Markdown in .pot files #19

Normalize Markdown in .pot files #19

Comments

mgeisler commented Mar 10, 2023

mgeisler commented Mar 13, 2023

jooyunghan commented Mar 13, 2023

mgeisler commented Mar 15, 2023

djmitche commented Mar 15, 2023

djmitche commented Mar 15, 2023

mgeisler commented Mar 16, 2023

djmitche commented Apr 5, 2023

simonsan commented Apr 6, 2023

djmitche commented Apr 6, 2023

simonsan commented Apr 9, 2023 • edited Loading

mgeisler commented Apr 9, 2023

simonsan commented Apr 10, 2023

mgeisler commented Apr 10, 2023

simonsan commented Apr 10, 2023 • edited Loading

mgeisler commented Apr 13, 2023

mgeisler commented May 30, 2023

mgeisler commented Aug 23, 2023

simonsan commented Aug 23, 2023

mgeisler commented Aug 23, 2023

simonsan commented Aug 24, 2023 • edited Loading

Normalize Markdown in `.pot` files #19

Normalize Markdown in `.pot` files #19

simonsan commented Apr 9, 2023 •

edited

Loading

simonsan commented Apr 10, 2023 •

edited

Loading

simonsan commented Aug 24, 2023 •

edited

Loading