Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow auto-linkification of non-standard schemas without calling mdurl.decode #183

Open
black-puppydog opened this issue Jan 5, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@black-puppydog
Copy link

black-puppydog commented Jan 5, 2022

Description / Summary

I propose to allow the unmodified handling of link text during auto-linkification.
Think something like this:

md.linkify.add("%", {"validate": message_regex, "normalize": normalize_message_sigil})

def normalize_message_sigil(obj, match):
  old_url = match.url
  match.url = urllib.parse.quote(old_url, safe="")
  match.text = f"%{old_url[1:6]}..."
  match.safe_decode = False  # means "don't touch match.text, render it as is in the HTML"

Value / benefit

I launched this as a discussion before but after thinking about it a little more I don't see a workaround for this.

I'm trying to implement some custom extensions to markdown for the scuttlebutt markdown flavour as implemented in ssb-markdown which is the JS implementation and relies on markdown-it. Hence it makes sense to me to make the re-implementation using markdown-it-py. 🙂

One of the key features of ssb is that messages are referenced by ids like this: %9eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256
(feeds have an @ identifier, and blobs a & so they may have similar issues, but let's talk about message ids only for the sake of this discussion)

Anyway, so these message ids should be linked to urls like this:

 <a href="#/msg/%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">
  %9eJYI...
 </a>

Note that the link text is an abbreviated version of the full id, but still begins with a % sigil.

So I have this regex to match message IDs:

MESSAGE_SIGIL_REGEX = r'[a-zA-Z0-9+/=]{44}\.sha256'

To automatically linkify these ids I set the % character up as a schema:

# this is in the main rendering method
md.linkify.add("%", {"validate": message_regex, "normalize": normalize_message_sigil})

def normalize_message_sigil(obj, match):
  old_url = match.url
  match.url = urllib.parse.quote(old_url, safe="")
  match.text = f"{old_url[:6]}..."

The problem I have with this is that once matched by linkify, the link text that results is actually interpreted as a url-encoded string, i.e. the %9e gets decoded to a (non-displayable) character.
The resulting link isn't exactly what I had hoped for:

  <a href="#/msg/%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">
   �JYI...
  </a>

I've stepped through this a while now and I haven't figured out yet whether this is a bug or just me holding this wrong...
The resulting text gets put through state.md.normalizeLinkText here:

urlText = state.md.normalizeLinkText(urlText)

That function in turn passes the whole thing through mdurl.decode(mdurl.format(parsed), mdurl.DECODE_DEFAULT_CHARS + "%"):

return mdurl.decode(mdurl.format(parsed), mdurl.DECODE_DEFAULT_CHARS + "%")

And I don't see any way to prevent it from doing so...

But I thought I could just try to replace the % with %25 and let mdurl.decode replace it back to %. Alas, if I try that, it indeed produces %259eJYI... as the output. Not what I wanted...

Now, I realize I could just generate the text to escape the % into something like &percnt;, but the result is then that the & sign is escaped into &amp;percnt;9eJYI... which is also not quite what I want...

So... is this an issue of usage? Is there something obvious I'm missing?

Implementation details

As I said in the beginning, I think this would best be signalled while setting up the schema. But I'm not sure how to do this cleanly, since the matches themselves are actually directly added to a linkify instance, not a class of markdown-it-py.
So assigning the flag for "raw/pass-through" mode to the match instance seems a bit iffy...

Tasks to complete

No response

@black-puppydog black-puppydog added the enhancement New feature or request label Jan 5, 2022
@welcome
Copy link

welcome bot commented Jan 5, 2022

Thanks for opening your first issue here! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out EBP's Code of Conduct. Also, please try to follow the issue template as it helps other community members to contribute more effectively.

If your issue is a feature request, others may react to it, to raise its prominence (see Feature Voting).

Welcome to the EBP community! 🎉

@chrisjsewell
Copy link
Member

Heya, Perhaps @tsutsu3 (as the maintainer of linkify-it-py) and @hukkin would like to comment?

@hukkin
Copy link
Contributor

hukkin commented Jan 5, 2022

There's a lot to intake here, but I'll start with a question: If there's a JS implementation, why not copy what it does? Are you trying to achieve something that the JS implementation does not do?

@black-puppydog
Copy link
Author

black-puppydog commented Jan 5, 2022

Hey thanks folks for the quick replies!
Yeah, I did look at that, that's the reason I went with markdown-it-py 🙂

Thing is, they're doing pretty much what I (think) I am doing...
I'm looking at the JS code here where they call formatSigilText() which is simply this: return sigilText.replace(/^%/, '%25').slice(0, 8) + '...'

Inserting some console.print() calls in normalize() and then for the final rendered text, I get this:

// the match object after modification
Match {
  schema: '%',
  index: 14,
  lastIndex: 66,
  raw: '%9eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256',
  text: '%259eJYI...',
  url: '%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256'
}
result = "<p>Hey check out <a href="#/msg/%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">%9eJYI...</a> to see my mad ssb skillz!</p>"

So this is my minimal example to reproduce this:

import json
import re
import urllib.parse
from markdown_it import MarkdownIt

MESSAGE_SIGIL_REGEX = r'[a-zA-Z0-9+/=]{44}\.sha256'
message_regex = re.compile(f"^{MESSAGE_SIGIL_REGEX}")


def normalize_message_sigil(obj, match):
  old_url = match.url
  match.url = urllib.parse.quote(old_url, safe="")
  match.text = f"%25{old_url[1:6]}..."
  print(json.dumps(match.__dict__, indent=2))
  print()


md = MarkdownIt("js-default", {
  "typographer": True,
  "linkify": True,
  "breaks": True,
})
md.linkify.add("%", {"validate": message_regex, "normalize": normalize_message_sigil})


markdown_str = "Hey check out %9eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256 it's epic"
print(md.render(markdown_str))

This generates the same kind of match:

{
  "schema": "%",
  "index": 14,
  "last_index": 66,
  "raw": "%9eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256",
  "text": "%259eJYI...",
  "url": "%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256"
}

But the result is this:

<p>Hey check out <a href="%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">%259eJYI...</a> it’s epic</p>

While if I change the match.text = f"{old_url[0:6]}..." (so, no % encoding) then I get this:

<p>Hey check out <a href="%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">�JYI...</a> it’s epic</p>

It works fine if I edit my local markdown_it to call mdurl.decode() without the extra %, so like this:

# in markdown-it-py/markdown_it/common/normalize_url.py
return mdurl.decode(mdurl.format(parsed)

But I understand that this was actually introduced for a reason, so not sure how to proceed here...

Sorry for the infodump... it's a bit late here, this is all free-time stuff for me 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants