No more `clang-format` #3376

MathiasMagnus · 2023-11-09T13:33:10Z

MathiasMagnus
Nov 9, 2023

I am an HPC and GPGPU developer and not a compiler developer, so bear with me if this idea is nonsense, however it had been in the back of my mind for a while now.

Code formatting has become such a hassle in the C++ world, that it's often enfuriating. If you don't enforce any type of formatting, it descends into bikeshedding and meaningless changes, but if you do, you impose a great deal of burden on your contributors to keep a specific Clang version on their system for stable formatting, or have it be a moving version which makes code render differently at parts of the file if you use "Format changes only"... projects will cook commit hooks for checking formatting so one doesn't burn CI time for trivial formatting mistakes, people cook CI jobs tailored to check formatting... There is SOO MUCH effort going into this without any real value.

Take SVG however, which just about nobody edits manually, people don't care or even know that's it's XML and yet editors present them in a consumable, workable fashion. Why can't source code be the same?

If the language were stored as some normalized serialization format (like MSVC's IPR has XPR) and have the IDE-editor convert it to some human readable format based on user preference. By storing the sources as XPR (let's go with this for a moment), version control and projects need not be conerned how contributors like their source code to render. Storing XPR would also turbocharge semantic diff tools for various purposes, version control being the prime consumer.

This also has some interesting side-effects as not having to store source location information, because that location depends on the configured rendering of the source code, error messages would refer to token ids which would be translated to rendered line/column numbers. Lexing/parsing becomes much simpler (much like turning XPR into IPR). Also, saving ill-formed code would not be possible, as the serializer wouldn't know how to complete the transformation. (It could fallback to the textual format to save unfinished work, but it's a solvable problem.)

(Taking it to an arguable extreme, one could go as far as to write renderers/deserializers for the language that render it in a Python-like indentation sensitive fashion or as a regular curly brace-style language. I'm not saying one should go this far, but it' a possibility.)

I've long been interested in serialized data structures and it's on my shortlist to write a SAX-enabled coroutine-driven EXI-capable XML parser in C++ just to learn coroutines properly. I would probably give EXI which was specifically meant for this, exchanging XML-like tree structures with as little storage as possible. (EXI isn't tied to XML, there exists EXI for JSON, and really, any tree-like data serialization format.) EXI even has a schema-assisted optimized storage mode which a versioned programming language with a fixed grammar could even tap into. In the context of Carbon, I really wouldn't care what normalized serialization format it used, XML/EXI is just one option, it may as well just dump TokeniedBuffer on the disk.

In short: I think it would be interesting to explore this space, of a programming language sticking with non-textual representation as the default and have tooling assist with rendering it.

tkoeppe · 2023-11-09T15:49:56Z

tkoeppe
Nov 9, 2023

This has been considered before, e.g. see https://www.stroustrup.com/gdr-bs-macis09.pdf.

1 reply

MathiasMagnus Nov 9, 2023
Author

I read this a mild while back (and skimmed through again), but I didn't find hints of using XPR as the primary form of storage. The paper talks about using it to drive tools, such as static analysis or style checkers. But that's exactly my point: there should be no style checkers (at least formatting style).

I wanted to play with IPR, I even pimped it's CMake scripts a little, only to find that the Github hosted codebase is heavily out of date with it's documentation and even the simplest doc examples didn't compile. It's very disenchanting to find out that the public version is some outdated snapshot of some closed source version where the real development occurs.

So yes, I'm aware of IPR/XPR's goals, but there were no traces of even MS going anywhere with it, like MSVC having flags to directly ingest XPR or "preprocess" into it.

All I'm saying is that there's room for exploration here. (IMHO)

jonmeow · 2023-11-10T18:52:11Z

jonmeow
Nov 10, 2023
Maintainer

One of the Carbon language project's key goals is to support developer tooling. On one level, that's providing things like a toolchain, and eventually supporting CMake. Beyond that, it includes IDEs and other tooling that developers are familiar with and use daily.

While something like clang-format may feel like a nuisance, but it's optional and it fits into this tooling ecosystem because it works with text. Text allows many tools to be used cross-language. By contrast, IPR/XPR would create a significant new requirement, preventing use of the existing text-oriented tooling ecosystem.

For example, there's a breadth of IDEs developers use for C++ (VSCode, CLion, Eclipse, vim, emacs, etc). There's been a lot of unification around LSP for cross-language support, but IPR/XPR would probably require new LSP features for saving/loading content. If an IDE relies too heavily on displaying the on-disk representation, it could also be infeasible to get it to support editing the human-readable format. In that case, developers would need to either switch to an IDE that supports Carbon or manually convert between IPR/XPR in order to edit files.

But IDEs are just one tool that developers use. There's viewing code, such as with cat or on a website. There's diffing, such as diff, git diff, GitHub reviews, Phabricator, Gerrit, etc. There's searching, such as grep, GitHub searches, Sourcegraph, etc. There are many different tools that developers use. With text, tools don't even need to know about programming languages.

For each such tool, IPR/XPR creates a choice: the tool gets Carbon-specific support (either from tool owners or Carbon contributors), Carbon provides an alternative tool (e.g., cat -> carbon-cat, diff -> carbon-diff, grep -> carbon-grep), or developers need to find an alternative. Even bespoke scripts would need to translate between IPR/XPR.

The paper tkoeppe linked says IPR/XPR provides a significant benefit for C++ in gcc. There aren't numbers in the paper, but I believe it: clang's C++ frontend is around two thirds of compile time (with the LLVM backend being the other third). Carbon's more efficient. The Carbon frontend should be less than a third of compile time, and LLVM backends more than two thirds. Within the frontend, checking is higher cost than lexing or parsing; IPR also just changes these steps to lex and parse the IPR. I'd expect no significant compile time performance benefit for Carbon.

The tooling to convert Carbon to a parsed format will exist: similar can already be done with carbon compile --phase=parse --dump-parse-tree. A semantic diff tool could make use of the libraries behind that to parse and lex code. However, the Carbon language project is experimenting with several things, and we need to be thoughtful of each experiment's costs -- both for the Carbon language project, and for developers who would want to learn Carbon. We need to be careful about how much of the tooling ecosystem we take ownership of, and that probably means a clang-format equivalent.

1 reply

MathiasMagnus Nov 10, 2023
Author

Thank you @jonmeow for taking the time to writing such a fleshed out response.

I understand that a project doesn't have infinite resources and one has to consider cost/benefit before engaging with an experiment. I may have underestimated slightly the number of workflows and related tooling (factor of 2-3, but not an order of magnitude), I didn't expect that to be zero work. LSP additions, semantic diff tool, convenient deserializer. Many of the tooling can reuse others, such as a web renderer would trivially use corabon-cat (which is a cool name, btw :)).

It's good to know that 1/3 : 2/3 is the expected ratio of front-end vs. back-end in typical programs. The performance targets of parsing outlined by @chandlerc are also commendable. During lunch-time discussions with a colleague (about tooling, build servers vs. one-shot build systems) I was told there's no need for persisting data or keeping it in memory when you can compile this fast. I replied it feels hubris to think that if one knows of a faster way to compile, then not reach for it. By accelerating the front-end to Carbon levels, it opens the way for new use cases. Think std::embeding the Vulkan APIs XML and generating all the bindings via meta-classes, reflection, or whatever on every compilation. (All this in Carbon ofc, but soon it will be available in C++, albeit probably in a dreadfully slow fashion.) Is this use or abuse of said features? It's an interesting prospect to generate bindings inside the compiler may that be Khronos APIs, Windows Runtime components .winmd files, etc.

At the end of day, one will need measurements to justify the costs of writing all the tooling. (All the more reason to write my XML parser.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No more `clang-format` #3376

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

No more clang-format #3376

MathiasMagnus Nov 9, 2023

Replies: 2 comments · 2 replies

tkoeppe Nov 9, 2023

MathiasMagnus Nov 9, 2023 Author

jonmeow Nov 10, 2023 Maintainer

MathiasMagnus Nov 10, 2023 Author

No more `clang-format` #3376

MathiasMagnus
Nov 9, 2023

Replies: 2 comments 2 replies

tkoeppe
Nov 9, 2023

MathiasMagnus Nov 9, 2023
Author

jonmeow
Nov 10, 2023
Maintainer

MathiasMagnus Nov 10, 2023
Author