-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add source position/maps to AST #1
Comments
See also commonmark/commonmark.js#29. We need to implement this for both blocks and inlines. And in the following case, having only a start and end position for the emphasized text isn't enough:
|
This is probably not the right data structure, and it's untested. The option for the renderer was already removed earlier. We can always resurrect the code when it's more clear what we want.
I am using pegdown as a parser for an IntelliJ IDEA plugin and had to make changes to use it for syntax highlighting. I can tell you that the more information you have in the AST linking back to the source the better. For inline emphasis style nodes (this includes strikethrough, bold, italic) you need to be able to separate the 'marker' from the contained text. That way the text and the lead/trailing marker can be highlighted separately. I also had to use a stack of all parents' source maps during post processing of the AST for syntax highlighting. Each child excludes its source map from the parents'. After all the children are processed the parent will have a source map that is a list of disjoint regions, or empty. This step also uses a configuration table to prevent some types of child regions from excluding regions in given types of parents. This is needed if you want to syntax highlight using merged attributes of the child and parent properties. The case of commonmark/commonmark.js#29, is not handled at the moment but can be done by modifying pegdown so that it generates a PARENT_SOURCED node for all regions that were recognized by the parent node. In the post processing this would exclude this region from the immediate parent's source map and leave it for the other parents to highlight. This would require modifying pegdown parser extensively, so I left it on the wish list. |
Any thoughts/progress on labeling nodes with source line numbers/column numbers? In my case I'm transforming Markdown into a different representation and would like to be able to cross-reference the position of nodes in the transformed representation back to the Markdown source. I've looked through the source code and didn't seem obvious how to implement it. |
@vsch: That's very valuable input, thanks! I hadn't even thought about needing to know the positions for the markers ( |
@pcj We should start looking at adding source positions for block nodes, and then go on to inline nodes. As for hints on how to implement this, see 80e22b3 where we removed the old support (which was straight port from the JS version). I haven't looked into this in great detail yet, but some thoughts:
How the data structure for source positions look is also interesting. Maybe something like |
@robinst Curious: can you clarify in what circumstances a node requires a |
@pcj commonmark/commonmark.js#29 has a perfect example of this:
That emphasis inline is not contiguous - it's interrupted by a |
@colinodell I see. Good explanation. |
(arriving because I, too, am trying to trying to transform the Markdown into a different representation) What are people doing to work around this until source positions are back in? |
Are there any plans to implement this in the near future? Like to use commonmark in Markdown Writer FX, but need source positions for syntax highlighting. |
I may look into this next week. |
@robinst many thx for your work on the 'issue-1-source-positions' branch 👍 I've re-implemented syntax highlighting in Markdown Writer FX using commonmark-java and it works very well. https://github.com/JFormDesigner/markdown-writer-fx/tree/commonmark The only bug I've found so far is that line numbers are not correct after fenced code blocks. E.g. if having
then line number for 'bar' is 3 instead of 4. If there are more fenced code blocks, then each one decreases line numbers by one. Hope you can implement source positions for inlines too. 😉 |
Thanks for giving it a try :)! Good catch, should be a simple fix. Do you have any thoughts on Inlines might need further changes to the parser APIs, so I want to do those before releasing what I currently have. |
Well, I think I understand the reason why you used 1-based indices (for "compatibility" with commonmark.js), but it's unusual to have 1-based indices in Java APIs, isn't it? And you already use 0-based indices (also for line numbers and columns) in other commonmark-java API (e.g. in interface ParserState). Having 0-based indices in ParserState and 1-based in SourceSpan may result in wrong source positions if extension developers forget to add +1 when creating SourceSpans. I would prefer 0-based indices and length (or exclusive lastColumn), because it is easier to work with. 1-based and inclusive lastColumn (current implementation):
0-based and length (no +/- 1 necessary):
0-based and exclusive lastColumn (no +/- 1 necessary):
|
Have you noticed any other problems with this branch? I'd like to migrate to commonmark-java (currently using pegdown), but I need source positions, so this has blocked me so far. I'm not using fenced codeblocks, so not worried about that particular bug. |
@robinst, I would agree with @JFormDesigner on the 0 vs 1 based source indices. I would keep the indices 0 bases and last column being exclusive. That way there is no need to manipulate the indices when using them for sub-string/sequence since these are 0 based and exclusive in Java. Otherwise it will be a source of many trivial bugs because it is counter-intuitive to the Java platform. I am currently working on porting my idea-multimarkdown plugin to commonmark-java from pegdown because the latter has parser performance issues that cause the parser to hang on some markdown sources. Rudimentary performance testing shows that commonmark-java is much faster and does not suffer from the aberrant parser performance. I used the simple code from intellij-markdown test suite that does a loop of 10 iterations for
Ratio of performances is stunning:
This is not an exhaustive performance test but combined with other advantages of I look forward to having a better performing parser in the plugin. |
@radicaled, the bug with fenced code blocks is already fixed on the branch. Cool, you all convinced me that we should be using 0 based indexes, I'll change it :). As for @vsch, wow, that are some nice stats indeed. Kudos to @jgm for commonmark.js! Also, I wonder, do you need source positions for inline nodes as well? |
@robinst, inline nodes are just as important for source positions for the purpose of syntax highlighting. The IntelliJ plugin uses all nodes for creating the parse tree for syntax highlighting, completions and inspections. I understand that this requirement is quite specific to my use of the parser in the plugin and I will most likely have to keep a customized fork for my use since most users of the parser will not need most of the features required for the plugin. IntelliJ needs to have every character in the source appear in the token stream without any being skipped, including whitespace. Pegdown does not do this so I have extra code to fill in the blanks, which are only whitespace. I also do post pegdown parsing on nodes to synthesize nodes representing parts of a node. Here are an example of using completions and refactoring that require inline node source details. I find these features a must have since it makes working with Markdown much easier and less error prone: |
@vsch It would be good if you could implement all your additional needs on top as post-processing and not have to maintain a fork. I'm very interested in that, and am willing to extend commonmark-java to accommodate your needs. Short progress update: I started working on source positions for inline nodes. I've pushed a couple of commits that simplify The difficult part now is that inline parsing currently relies on executing regular expressions that can span multiple lines against a blob of text. With that, it would be hard (or ugly) to figure out the corresponding source positions when we have a match. So, my current plan is to change that so that we have a list of lines, and change the matching for multi-level constructs to work on those. Fun 🎉. |
@robinst, it was impossible to implement a parser for my needs through the current extension API. I love the current implementation's simplicity and elegance. That was one of the reasons I picked it as my parser of choice and in hindsight I am very satisfied with my decision. It is a pleasure to work with elegant code. I would recommend that you do not stray too far from the elegance of the current parsing approach. I have implemented complete source tracking in my fork but have opted out to preserve the original parsing approach with a 35% to 40% loss of performance trade off. For my use case, it will still be about 17x faster on average than pegdown and without the hanging during parsing of pathological input. The plugin's parser has to be able to:
This requires deeper integration into the parsing process to allow an extension to parse the correct syntax and do it without incurring unnecessary overhead. Since, the parser will be used with a lot of extensions at the same time it is not an option for each of these to go through hoops and post processing to do its job. Otherwise, the benefit of a fast core parser will be lost in real use cases. I looked forward to contributing to the original project but once I realized the extent of the needed changes it became clear that the two are addressing different needs and should really be different projects. It makes no sense for those that need only HTML rendering capability to pay a 40% hit in performance and break API compatibility for source tracking or extensions that they don't need. On my side, the schedule is very aggressive and the changes touch every aspect of the parser. I just don't have the time to consider backwards API compatibility this early in the process nor be able to organize the changes into one PR per feature. I routinely rework the API to fix issues exposed by the next extension I implement. My fork is https://github.com/vsch/flexmark-java, and I think that I am close to completing the core API changes but those are famous last words before implementing the next extension. The latest glitch was the need to have correct table parsing extension. You cannot do table parsing before inline processing. Otherwise, your tables cannot contain I added a universal options API to allow configuration of the parser, renderer, extensions and extension options in one uniform manner. Again, in my case it is a must. With lots of extensions, each having options that affect parsing behaviour, I needed a way of configuring everything with a single argument to the parser and renderer. Otherwise, mapping configuration options from the plugin's configuration UI to extension specific implementation would become a rat's nest of bugs. I also enhanced the test utils to allow for expected AST to be tested and generation of the full spec file but with actual HTML and AST used for the expected portions. Love the spec.txt format and wanted an easy way to generate test cases. I also added an options mechanism per test case so that a single test resource file can test optional settings. Even with a few new extensions I got tired of generating multiple test classes just to test non-default options. If someone has the time and the motivation, many of these enhancements would be a good addition to commonmark-java and an easy port. I will only have time to help and assist for the next two months or so. I am aiming for pegdown replacement parser being ready within the next two weeks and early release of the plugin two weeks later with final release before Aug 8. The schedule is very aggressive. |
I also have similar needs to @vsch. In simplified form, my use case is to render the markdown to plain text (intelligently removing Markdown syntax), run grammar checking on the text, then map the grammar errors back to the original location in the markdown file for display. |
Support for source spans for block nodes is now implemented in master and will be included in the next release (see #185). I'm also looking into support for inline nodes. |
Having source positions is a useful feature for editors, as it allows linking blocks between the source and the rendered output. commonmark.js supports it, see highlighted blocks in preview of dingus.
There is some code for adding source positions to blocks, but it's untested and not currently exposed.
The text was updated successfully, but these errors were encountered: