Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add source position/maps to AST #1

Closed
robinst opened this issue Jul 23, 2015 · 22 comments · Fixed by #185
Closed

Add source position/maps to AST #1

robinst opened this issue Jul 23, 2015 · 22 comments · Fixed by #185

Comments

@robinst
Copy link
Collaborator

robinst commented Jul 23, 2015

Having source positions is a useful feature for editors, as it allows linking blocks between the source and the rendered output. commonmark.js supports it, see highlighted blocks in preview of dingus.

There is some code for adding source positions to blocks, but it's untested and not currently exposed.

@robinst robinst changed the title Finish and test source position support Add source position/maps to AST Jul 27, 2015
@robinst
Copy link
Collaborator Author

robinst commented Jul 27, 2015

See also commonmark/commonmark.js#29. We need to implement this for both blocks and inlines. And in the following case, having only a start and end position for the emphasized text isn't enough:

> *emphasized
> text*

robinst added a commit that referenced this issue Jul 28, 2015
This is probably not the right data structure, and it's untested. The
option for the renderer was already removed earlier.

We can always resurrect the code when it's more clear what we want.
@vsch
Copy link

vsch commented Aug 31, 2015

I am using pegdown as a parser for an IntelliJ IDEA plugin and had to make changes to use it for syntax highlighting. I can tell you that the more information you have in the AST linking back to the source the better.

For inline emphasis style nodes (this includes strikethrough, bold, italic) you need to be able to separate the 'marker' from the contained text. That way the text and the lead/trailing marker can be highlighted separately.

I also had to use a stack of all parents' source maps during post processing of the AST for syntax highlighting. Each child excludes its source map from the parents'. After all the children are processed the parent will have a source map that is a list of disjoint regions, or empty. This step also uses a configuration table to prevent some types of child regions from excluding regions in given types of parents. This is needed if you want to syntax highlight using merged attributes of the child and parent properties.

The case of commonmark/commonmark.js#29, is not handled at the moment but can be done by modifying pegdown so that it generates a PARENT_SOURCED node for all regions that were recognized by the parent node. In the post processing this would exclude this region from the immediate parent's source map and leave it for the other parents to highlight. This would require modifying pegdown parser extensively, so I left it on the wish list.

@pcj
Copy link
Contributor

pcj commented Sep 16, 2015

Any thoughts/progress on labeling nodes with source line numbers/column numbers? In my case I'm transforming Markdown into a different representation and would like to be able to cross-reference the position of nodes in the transformed representation back to the Markdown source. I've looked through the source code and didn't seem obvious how to implement it.

@robinst
Copy link
Collaborator Author

robinst commented Sep 16, 2015

@vsch: That's very valuable input, thanks! I hadn't even thought about needing to know the positions for the markers (*) and contents separately. Maybe that concept could be applied to block nodes too, e.g. > for a block quote is marker, and the text to the right is the content. Same for 1. for an ordered list.

@robinst
Copy link
Collaborator Author

robinst commented Sep 16, 2015

@pcj We should start looking at adding source positions for block nodes, and then go on to inline nodes. As for hints on how to implement this, see 80e22b3 where we removed the old support (which was straight port from the JS version).

I haven't looked into this in great detail yet, but some thoughts:

  • When starting a new block, we know the current line number and index within the line. From that position to the end of the line can be added as a source range to the block. It seems like that should be the responsibility of BlockParserFactory, but maybe with some help from DocumentParser.
  • When continuing, we again need to know at which position in the line the block actually begins. This can be more complicated than "the same position as it started" because of lazy continuations.

How the data structure for source positions look is also interesting. Maybe something like List<SourceRange>, where SourceRange has lineNumber, beginColumn, endColumn (all 1-based, inclusive?), and maybe type to distinguish between marker and content?

@pcj
Copy link
Contributor

pcj commented Sep 16, 2015

@robinst Curious: can you clarify in what circumstances a node requires a List<SourceRange> rather than just a single SourceRange? Probably relates to lazy continuations, which I am not familiar with what this means at the moment.

@colinodell
Copy link

@pcj commonmark/commonmark.js#29 has a perfect example of this:

> *emphasized
> text*

That emphasis inline is not contiguous - it's interrupted by a > character, so it wouldn't really be accurate to have a single SourceRange of 1:3-2:7 (because 2:1-2:2 contains that blockquote indicator). You'd probably want two ranges instead: 1:3-1:13 and 2:3-2:7.

@pcj
Copy link
Contributor

pcj commented Sep 16, 2015

@colinodell I see. Good explanation.

@radicaled
Copy link

(arriving because I, too, am trying to trying to transform the Markdown into a different representation)

What are people doing to work around this until source positions are back in?

@JFormDesigner
Copy link

Are there any plans to implement this in the near future?

Like to use commonmark in Markdown Writer FX, but need source positions for syntax highlighting.

@robinst
Copy link
Collaborator Author

robinst commented Feb 24, 2016

I may look into this next week.

@JFormDesigner
Copy link

@robinst many thx for your work on the 'issue-1-source-positions' branch 👍

I've re-implemented syntax highlighting in Markdown Writer FX using commonmark-java and it works very well. https://github.com/JFormDesigner/markdown-writer-fx/tree/commonmark

The only bug I've found so far is that line numbers are not correct after fenced code blocks. E.g. if having

1 ```
2 foo
3 ```
4 bar

then line number for 'bar' is 3 instead of 4. If there are more fenced code blocks, then each one decreases line numbers by one.

Hope you can implement source positions for inlines too. 😉

@robinst
Copy link
Collaborator Author

robinst commented Mar 5, 2016

Thanks for giving it a try :)! Good catch, should be a simple fix.

Do you have any thoughts on SourceSpan? 1-based vs 0-based, inclusive lastColumn vs exclusive vs a length?

Inlines might need further changes to the parser APIs, so I want to do those before releasing what I currently have.

@JFormDesigner
Copy link

Do you have any thoughts on SourceSpan? 1-based vs 0-based, inclusive lastColumn vs exclusive vs a length?

Well, I think I understand the reason why you used 1-based indices (for "compatibility" with commonmark.js), but it's unusual to have 1-based indices in Java APIs, isn't it?

And you already use 0-based indices (also for line numbers and columns) in other commonmark-java API (e.g. in interface ParserState). Having 0-based indices in ParserState and 1-based in SourceSpan may result in wrong source positions if extension developers forget to add +1 when creating SourceSpans.

I would prefer 0-based indices and length (or exclusive lastColumn), because it is easier to work with.
With 1-based indices and inclusive lastColumn you often have to add +/- 1, which is not necessary for 0-based and length.

1-based and inclusive lastColumn (current implementation):

beginIndex = firstColumn - 1
endIndex = lastColumn
length = lastColumn - firstColumn + 1

0-based and length (no +/- 1 necessary):

beginIndex = firstColumn
endIndex = firstColumn + length
length = length

0-based and exclusive lastColumn (no +/- 1 necessary):

beginIndex = firstColumn
endIndex = lastColumn
length = lastColumn - firstColumn

@radicaled
Copy link

@JFormDesigner

Have you noticed any other problems with this branch? I'd like to migrate to commonmark-java (currently using pegdown), but I need source positions, so this has blocked me so far. I'm not using fenced codeblocks, so not worried about that particular bug.

@vsch
Copy link

vsch commented Apr 17, 2016

@robinst, I would agree with @JFormDesigner on the 0 vs 1 based source indices.

I would keep the indices 0 bases and last column being exclusive. That way there is no need to manipulate the indices when using them for sub-string/sequence since these are 0 based and exclusive in Java. Otherwise it will be a source of many trivial bugs because it is counter-intuitive to the Java platform.

I am currently working on porting my idea-multimarkdown plugin to commonmark-java from pegdown because the latter has parser performance issues that cause the parser to hang on some markdown sources.

Rudimentary performance testing shows that commonmark-java is much faster and does not suffer from the aberrant parser performance.

I used the simple code from intellij-markdown test suite that does a loop of 10 iterations for
the warm up and then computes an average of 100 runs to arrive at a result. The results blew me
away:

File pegdown-markdown intellij-markdown commonmark-java
VERSION.md 51.172913ms 6.886450ms 1.7238140ms
commonMarkSpec.md 685.310908ms 647.420365ms 42.146776ms
spec.txt 316.491533ms 35.156909ms 8.863662ms
hang-pegdown.md 673.086523ms 0.296050ms 0.086940ms
hang-pegdown2.md 1387.808977ms 0.294905ms 0.085369ms

Ratio of performances is stunning:

File pegdown-markdown intellij-markdown commonmark-java
VERSION.md 29.69 3.99 1.00
commonMarkSpec.md 16.26 15.36 1.00
spec.txt 35.71 3.97 1.00
hang-pegdown.md 7741.97 3.41 1.00
hang-pegdown2.md 16256.59 3.45 1.00
  • VERSION.md is the version log file I use for idea-multimarkdown
  • commonMarkSpec.md is a 33k line file used in intellij-markdown test suite for performance
    evaluation.
  • spec.txt commonmark spec markdown file in the commonmark-java project
  • hang-pegdown.md is a file containing a single line of 17 characters [[[[[[[[[[[[[[[[[ which
    causes pegdown to go into an hyper-exponential parse time.
  • a file containing a single line of 18 characters [[[[[[[[[[[[[[[[[[ which causes pegdown to
    go into an hyper-exponential parse time.

This is not an exhaustive performance test but combined with other advantages of
commonmark-java over intellij-markdown I have enough data to convince me to make
commonmark-java as my choice for the parser to use for the MultiMarkdown plugin.

I look forward to having a better performing parser in the plugin.

@robinst
Copy link
Collaborator Author

robinst commented Apr 17, 2016

@radicaled, the bug with fenced code blocks is already fixed on the branch.

Cool, you all convinced me that we should be using 0 based indexes, I'll change it :). As for endIndex vs length, I'm thinking of going with length. The reason is that I might want to add offset (within the whole document) in addition to line/column as the start of a span. When we already have length, there is no reason to add a corresponding endOffset.

@vsch, wow, that are some nice stats indeed. Kudos to @jgm for commonmark.js!

Also, I wonder, do you need source positions for inline nodes as well?

@vsch
Copy link

vsch commented Apr 17, 2016

@robinst, inline nodes are just as important for source positions for the purpose of syntax highlighting.

The IntelliJ plugin uses all nodes for creating the parse tree for syntax highlighting, completions and inspections.

I understand that this requirement is quite specific to my use of the parser in the plugin and I will most likely have to keep a customized fork for my use since most users of the parser will not need most of the features required for the plugin.

IntelliJ needs to have every character in the source appear in the token stream without any being skipped, including whitespace. Pegdown does not do this so I have extra code to fill in the blanks, which are only whitespace. I also do post pegdown parsing on nodes to synthesize nodes representing parts of a node.

Here are an example of using completions and refactoring that require inline node source details. I find these features a must have since it makes working with Markdown much easier and less error prone:

multimarkdownautomation1

@robinst
Copy link
Collaborator Author

robinst commented Jun 17, 2016

@vsch It would be good if you could implement all your additional needs on top as post-processing and not have to maintain a fork. I'm very interested in that, and am willing to extend commonmark-java to accommodate your needs.

Short progress update:

I started working on source positions for inline nodes. I've pushed a couple of commits that simplify InlineParserImpl a bit, clearing some obstacles.

The difficult part now is that inline parsing currently relies on executing regular expressions that can span multiple lines against a blob of text. With that, it would be hard (or ugly) to figure out the corresponding source positions when we have a match. So, my current plan is to change that so that we have a list of lines, and change the matching for multi-level constructs to work on those. Fun 🎉.

@vsch
Copy link

vsch commented Jun 17, 2016

@robinst, it was impossible to implement a parser for my needs through the current extension API.

I love the current implementation's simplicity and elegance. That was one of the reasons I picked it as my parser of choice and in hindsight I am very satisfied with my decision. It is a pleasure to work with elegant code. I would recommend that you do not stray too far from the elegance of the current parsing approach.

I have implemented complete source tracking in my fork but have opted out to preserve the original parsing approach with a 35% to 40% loss of performance trade off. For my use case, it will still be about 17x faster on average than pegdown and without the hanging during parsing of pathological input.

The plugin's parser has to be able to:

  • emulate parsing of other markdown dialects and implementations,
  • represent all source elements in the AST
  • handle a lot of extensions at the same time, efficiently
  • commonmark compliance is just one configurable option, not a requirement across all extensions.
  • ability to disable core parsers and delimiter processors to allow overrides in extensions or just not have those elements processed
  • many options to fine tune the parsing and rendering results

This requires deeper integration into the parsing process to allow an extension to parse the correct syntax and do it without incurring unnecessary overhead. Since, the parser will be used with a lot of extensions at the same time it is not an option for each of these to go through hoops and post processing to do its job. Otherwise, the benefit of a fast core parser will be lost in real use cases.

I looked forward to contributing to the original project but once I realized the extent of the needed changes it became clear that the two are addressing different needs and should really be different projects.

It makes no sense for those that need only HTML rendering capability to pay a 40% hit in performance and break API compatibility for source tracking or extensions that they don't need.

On my side, the schedule is very aggressive and the changes touch every aspect of the parser. I just don't have the time to consider backwards API compatibility this early in the process nor be able to organize the changes into one PR per feature. I routinely rework the API to fix issues exposed by the next extension I implement.

My fork is https://github.com/vsch/flexmark-java, and I think that I am close to completing the core API changes but those are famous last words before implementing the next extension.

The latest glitch was the need to have correct table parsing extension. You cannot do table parsing before inline processing. Otherwise, your tables cannot contain | embedded in code spans. Escaping the pipe is not an option in a code span. Likewise, you cannot do it after inline processing because then you get pegdown's problem of an unclosed delimiter in a cell, absorbs everything to the end of the block instead of being treated as just text within a cell. The only option is to allow extensions deeper modification to the inline processing.

I added a universal options API to allow configuration of the parser, renderer, extensions and extension options in one uniform manner. Again, in my case it is a must. With lots of extensions, each having options that affect parsing behaviour, I needed a way of configuring everything with a single argument to the parser and renderer. Otherwise, mapping configuration options from the plugin's configuration UI to extension specific implementation would become a rat's nest of bugs.

I also enhanced the test utils to allow for expected AST to be tested and generation of the full spec file but with actual HTML and AST used for the expected portions. Love the spec.txt format and wanted an easy way to generate test cases. I also added an options mechanism per test case so that a single test resource file can test optional settings. Even with a few new extensions I got tired of generating multiple test classes just to test non-default options.

If someone has the time and the motivation, many of these enhancements would be a good addition to commonmark-java and an easy port. I will only have time to help and assist for the next two months or so. I am aiming for pegdown replacement parser being ready within the next two weeks and early release of the plugin two weeks later with final release before Aug 8. The schedule is very aggressive.

@adamvoss
Copy link

I also have similar needs to @vsch. In simplified form, my use case is to render the markdown to plain text (intelligently removing Markdown syntax), run grammar checking on the text, then map the grammar errors back to the original location in the markdown file for display.

@robinst
Copy link
Collaborator Author

robinst commented Sep 8, 2020

Support for source spans for block nodes is now implemented in master and will be included in the next release (see #185). I'm also looking into support for inline nodes.

robinst pushed a commit that referenced this issue Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants