Implement fine-grained extraction of translatable text #25

mgeisler · 2023-05-01T07:15:46Z

Before, we would extract text based on the byte offsets in the original document. As a consequence of this, the extracted text would look precisely like the original: the Markdown was copied directly from the original. In particular, text from a block quote would contain the leading ‘>’ characters and paragraphs in list items would contain leading whitespace.

Now, we instead extract text by grouping the Markdown parse events into those which should be translated and those who should be skipped. We use this in two ways:

When extracting messages in ‘mdbook-xgettext’, we turn the translatable events back into Markdown. The structure of the document (headings, lists, block quotes, …) is no longer present in the extracted messages: only the text content itself it extracted.
When translating, we replace the sequence of translatable events with the events from the translation. We do this while leaving the structure of the document unchanged.

The result of this is a much more robust system: editing one list item no longer impacts adjacent list items, moving a paragraph into a block quote no longer changes the paragraph.

As a side effect of how we turn events into messages, links are now all expanded. This makes the messages larger, but it removes a common source of errors where ‘[foo][1]’ would end up pointing to the wrong location if the reference link was updated.

Part of #19.

djmitche

Very nice! How will you handle migrating existing translations?

src/lib.rs

djmitche · 2023-05-01T13:53:30Z

src/lib.rs

+    let new_state = cmark_resume_with_options(
+        events.clone(),
+        String::new(),
+        state.clone(),
+        options.clone(),
+    )
+    .unwrap();
+
+    // Block quotes and lists add padding to the state. This is
+    // reflected in the rendered Markdown. We want to capture the
+    // Markdown without the padding to remove the effect of these
+    // structural elements.
+    let state_without_padding = state.map(|state| State {
+        padding: Vec::new(),
+        ..state
+    });
+    cmark_resume_with_options(events, &mut markdown, state_without_padding, options).unwrap();


It's not clear to me why this calls cmark_resume_with_options twice.

Is the idea to return an accurate state (new_state) but return markdown rendered without the padding?

Is the idea to return an accurate state (new_state) but return markdown rendered without the padding?

Yes, precisely! The padding is the "> " and list indents and I'm trying to avoid putting that into the .po file.

mgeisler · 2023-05-01T20:02:10Z

Very nice!

Thanks!

How will you handle migrating existing translations?

My next step is to write a little normalization tool: it should be enough to go through a .po file and run both msgid and msgstr fields through the extract_messages function: hopefully we end up with n messages from the msgid and n messages from the msgstr. This way we have n new pairs. If we get a different number of messages, we can mark some (or all) of the pairs as fuzzy in the .po file.

src/lib.rs

Before, we would extract text based on the byte offsets in the original document. As a consequence of this, the extracted text would look precisely like the original: the Markdown was copied directly from the original. In particular, text from a block quote would contain the leading ‘>’ characters and paragraphs in list items would contain leading whitespace. Now, we instead extract text by grouping the Markdown parse events into those which should be translated and those who should be skipped. We use this in two ways: - When extracting messages in ‘mdbook-xgettext’, we turn the translatable events back into Markdown. The structure of the document (headings, lists, block quotes, …) is no longer present in the extracted messages: only the text content itself it extracted. - When translating, we replace the sequence of translatable events with the events from the translation. We do this while leaving the structure of the document unchanged. The result of this is a much more robust system: editing one list item no longer impacts adjacent list items, moving a paragraph into a block quote no longer changes the paragraph. As a side effect of how we turn events into messages, links are now all expanded. This makes the messages larger, but it removes a common source of errors where ‘[foo][1]’ would end up pointing to the wrong location if the reference link was updated. Part of #19.

mgeisler requested a review from djmitche May 1, 2023 07:16

djmitche approved these changes May 1, 2023

View reviewed changes

mgeisler commented May 1, 2023

View reviewed changes

src/lib.rs Show resolved Hide resolved

mgeisler force-pushed the fine-grained-extraction branch from a82ef13 to 107484c Compare May 1, 2023 20:15

mgeisler enabled auto-merge May 1, 2023 20:16

mgeisler merged commit 44b4b46 into main May 1, 2023

mgeisler deleted the fine-grained-extraction branch May 1, 2023 20:16

mgeisler mentioned this pull request May 30, 2023

Normalize Markdown in .pot files #19

Closed

mgeisler mentioned this pull request Sep 17, 2023

Add support for translation comments #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fine-grained extraction of translatable text #25

Implement fine-grained extraction of translatable text #25

mgeisler commented May 1, 2023

djmitche left a comment

djmitche May 1, 2023

mgeisler May 1, 2023

mgeisler commented May 1, 2023

Implement fine-grained extraction of translatable text #25

Implement fine-grained extraction of translatable text #25

Conversation

mgeisler commented May 1, 2023

djmitche left a comment

Choose a reason for hiding this comment

djmitche May 1, 2023

Choose a reason for hiding this comment

mgeisler May 1, 2023

Choose a reason for hiding this comment

mgeisler commented May 1, 2023