Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component: Parsing, Pretty-Printing #5259

Open
13 tasks
StachuDotNet opened this issue Jan 14, 2024 · 1 comment
Open
13 tasks

Component: Parsing, Pretty-Printing #5259

StachuDotNet opened this issue Jan 14, 2024 · 1 comment
Labels
ready-for-contribs There's work here that's relatively approachable, if you're interested!

Comments

@StachuDotNet
Copy link
Member

StachuDotNet commented Jan 14, 2024

This Issue exists to collect many items that relate to Dark's parser(s), pretty-printer(s), name resolution, etc.

Here's our current state:

  • in dark-classic, we didn't have a parser used for user code
  • that said, we did have a hacky parser used internally, for running many tests stored in .dark test files
  • that parser was a simple wrapper around F#'s parser, and so our syntax was limited somewhat by what the 'upper' parser could handle

These are tasks currently available to be worked on:

  • plug tree-sitter-tests-formatter in our repository, to (auto-) format the test files at tree-sitter-darklang/test/corpus, and fail in CI upon seeing unformatted tree-sitter test files (note: this task is probably the lowest-hanging-fruit here, with no blockers)
  • generally, expand the tree-sitter grammar to match our language
    • each expansion requires companion work in the Dark code that consumes the resultant tree-sitter nodes (at time of writing, parser.dark)
  • restrict usage of Builtins, so that only specific stdlib package functions may call upon them
  • support aliases to unambiguously refer to package items while also presenting succinct code
  • lots of formatting improvements
  • try building tree-sitter and tree-sitter-darklang together. we could consolidate some code, etc
  • get our parser to a point where it's usable easily by folks outside of ourselves
  • revisit Add semantic tokenization tests #5381 (comment)

Once the tree-sitter grammar and parser has 'caught up' with our full language:

  • throw away the F#-wrapper parser entirely

Once that is done, we can tackle the fun stuff:

  • add ! ? to language, to assist with ergonomic error-handling
  • refer to package items with a @paul.module1.module2-like syntax, rather than PACKAGE.Paul.Module1.Module2
  • prevent conflicts of type names
    • e.g. users shouldn't be allowed to define a List type
    • in addition to preventing conflicts of existing types, keywords and other reserved word as well (i.e. Set)
    • potentially something in the name resolver
    • or maybe we allow users to use whatever type names they want, and deal with things closer to how Unison does

All of these tasks are worth some discussion, either here or in Discord, before starting.

@StachuDotNet StachuDotNet added needs-review I plan on going through each of the issues and clarifying them -- this is to mark remaining issues and removed needs-review I plan on going through each of the issues and clarifying them -- this is to mark remaining issues labels Feb 14, 2024
@StachuDotNet StachuDotNet added the ready-for-contribs There's work here that's relatively approachable, if you're interested! label Mar 5, 2024
@StachuDotNet
Copy link
Member Author

Copying this from some thoughts I posted on Discord recently:

tl;dr: is tree-sitter really the best tool for our parser, or should we reconsider writing a parser combinator thing in Darklang?

The way we're currently set up for the new/tree-sitter parser is:
A. write Darklang source code
B. use tree-sitter and tree-sitter-darklang to parse to tree-sitter's internal representation of the syntax tree
C. map that to a Dark type "ParsedNode," via a built-in function (the type:

type ParsedNode =
{
// e.g., a node of `typ` `let_expression` has a child node with a `body` field name
fieldName: Stdlib.Option.Option<String>
/// e.g. `source_file`, `fn_decl`, `expression`, `let_expression`
typ: String
/// The text of this node as it was in the unparsed source code
text: String
/// Where in the source code is this node written/contained
/// i.e. Line 1, Column 2 to Line 1, Column 5
sourceRange: Range
children: List<ParsedNode>
}
; the builtin fn:
[ { name = fn "parserParseToSimplifiedTree" 0
)
D. map ParsedNode to WrittenTypes

those WrittenTypes are used:

  • to map to ProgramTypes, where relevant
  • to map to semantic tokens, for VS Code syntax highlighting

I've been questioning whether depending on tree-sitter for all of our parsing is a good idea.

An alternative would be that we write the parser in Darklang instead, potentially as wrapper/equivalent to Farkle or FParsec, via minimal Builtins.
(relevant links:

Here are some potential trade-offs to consider:

  • the current A->B step:
    • requires us to build tree-sitter as an .so, as well as our grammar's .so. This is all set up now, but takes a few seconds of time, esp CI time.
    • requires our cli app to be ~1MB larger, to package those .sos along with our exe
    • requires a fancy extract-and-load setup to use both of those at run-time ()
  • the current C->D step:
  • we're broadly missing out on immediate feedback, throughout the process. We wait for the parser to be built, and have to follow each of those changes with ParsedNode-> WrittenTypes functions. And every grammar upgrade depends on a full build/release cycle, waiting for CI etc, to get things to users
  • I've no clear path forward on versioning the parser with our langauge, in a reasonably seamless way. as opposed to an in-Dark solution that would allow us to properly version the parser fns like anything else in the package manager.
  • our current setup provides only one big parser for a 'file', but what if we want to allow/disallow different parseable things if we're parsing a Canvas, vs parsing a Script, etc. I've been hoping we'd figure out a proper solution for that eventually, but everything I've come up with so far feels like a hack (i.e. passing a 'header' to the tree-sitter grammar where we). I think the composability of a parser combinator would prepare us for these scenarios much better.
  • broadly, it feels like we're doing (more than) double-work: we're writing the grammar.js, which builds into a parser, and writing a bunch of "parser.dark code" to map that back to WrittenTypes.

I suspect we'd still need a tree-sitter parser around, for highlighting and such in contexts outside of our VS Code plugin.

Am I forgetting a bit reason why we chose tree-sitter rather than exploring writing a parser in Dark/F#?
Or maybe we've just learned more since and it makes sense to reconsider?
Maybe we're making ParsedNode -> WrittenTypes more complicated than it needs to be?

Paul's response:

As I recall, the reasons to use tree sitter:

  • performance
  • ability to adapt to use in existing syntax highlighting frameworks and therefore reuse the definition

I would add that parser combinator frameworks are, afaik, possibly not powerful enough for real programming languages. But I could be wrong on that note

I don't think there's anything to do here, and we're close to a successful use of tree-sitter such that we'll be able to abandon our old F#-based parser, but I think it's worth reflecting here more, if we're doing the right thing fundamentally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-contribs There's work here that's relatively approachable, if you're interested!
Projects
None yet
Development

No branches or pull requests

1 participant