Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COBOL: "simple" crafted parser #2076

Merged
merged 20 commits into from
Nov 29, 2020
Merged

Conversation

b4n
Copy link
Member

@b4n b4n commented Apr 9, 2019

Here is a proposal for a replacement COBOL parser. Disclaimer: I don't know COBOL. But still, this parser seems fairly nicer:

  • quite a lot faster (I couldn't find real large COBOL sources, but still seems to be more than 10× fatser on any input)
  • support for Free and Variable formats in addition to the Fixed format (currently with one parser name for each, and keeping Fixed format as the default)
  • support for continuation lines
  • scoped for data items (including handling of levels 66, 77 and 88)

If anybody is COBOL-literate, review of the test cases results is most welcome, as well as additional test cases, or any input.


For the story, the main reason why I wrote that is not a very good one: in the process of synchronizing Geany's tag parsing with U-CTags, we currently don't have the regex based infrastructure, but have 2 parsers that require it, including the COBOL one. So I decided to give rewriting the COBOL parser a shot, taking the opportunity to also get a little closer look at that language. While at it, I figured that it ought to be a significant improvement on the current one to be a reasonable change to merge here, so here we go.


PS: This includes a fully tokenizing parser in the history, but I decided to step back with a simpler line-based approach that fits COBOL not too badly, and is a lot less fooled by COBOL's trickyness.
I can leave it there for future inspiration if anybody wants, or I can squash it away, as preferred.

@coveralls
Copy link

coveralls commented Apr 9, 2019

Coverage Status

Coverage increased (+0.03%) to 87.082% when pulling 4361bab on b4n:parser/cobol-simple into 710529f on universal-ctags:master.

@masatake masatake self-requested a review April 9, 2019 23:18
@masatake
Copy link
Member

masatake commented Apr 9, 2019

Does Geany provide the way to choose one of CobolFree, CobolVariable, or Cobol(Fixed) parsers?
Alternative ways I think of:

  • Switching the parsers manual with introducing --param-Cobol:format=fixed|variable|free option.
  • Introducing a selector that chooses one of 3 before parsing.

ObjectiveC and Matlab parsers use the selector mechanism. A selector chooses one of them for an input file having ".m" as its extension.

I wonder which one is acceptable to Geany.
Geany may choose a parser by Geany side. So Geany just specifies one of the three. As far as I know, there is no way to pass the value of a per-parser parameter in Geany. So you will not be happy with introducing --param-Cobol:format=fixed|variable|free to switch the behavior of unified Cobol parser.

A selector may be acceptable because Geany may not use the selector mechanism; the selector is run in a very early stage of parsing.

@b4n
Copy link
Member Author

b4n commented Apr 10, 2019

Does Geany provide the way to choose one of CobolFree, CobolVariable, or Cobol(Fixed) parsers?

Not ATM. Geany doesn't even have this parser yet :)
However, if it's three distinct parsers, it's easy to make it selectable in Geany's side.

  • Switching the parsers manual with introducing --param-Cobol:format=fixed|variable|free option.

That probably makes sense for uctags, as it's basically the kind of options compilers have. However, how would that play with your other suggestion, auto-selection?

Introducing a selector that chooses one of 3 before parsing.

That would probably be nice, but I'm not sure how easy it is to properly discriminate those. Free format might be okayish as it's allowing code in columns 1-6, but actually in fixed format columns 1-6 are simply discarded so a silly programmer could put some code there that simply has no effect. Fixed vs. Variable is just whether content after the 72th (IIRC) column is ignored or not. Not sure whether that can be detected properly. However, the parser as it is now doesn't care much about stuff at the end of lines if it's not for continuation lines -- well, it does care if the interesting part of a line extends past the 72th column, but that's probably highly unusual.

So yeah that'd be nice, but I'm not sure I can think of a solid algorithm for that.

I wonder which one is acceptable to Geany. […]

Well, I don't think it's the right question here. So long as the chosen technique would work for ctags-as-a-library, it's fine. Before this PR there was no support for Free and Variable formats anyway, so it's acceptable for us if we don't get this right away.
I think however that it's very important that the library interface is unambiguous, so that a caller knows what will happen. In this regard, it's probably slightly more obvious to have one parser to call for each formats, but OTOH if a libuctags had a way to expose parser options, it'd probably be OK as well.

@masatake
Copy link
Member

However, if it's three distinct parsers, it's easy to make it selectable in Geany's side.
I see.

Could you give me more time to think about this topic? If there is a consistent and Geany-friendly way, I would like to merge three parsers into one. If there is no such way, at least we have to provide kinds tables for every three parsers. I assume that a kind table is not shared between parsers; each parser has its specific kind table (and role tables). Kinds of C and C++ parsers are synchronized but not shared.

I don't care ctags to have too many parsers. Making different things different is rather better.
However, I would like to avoid removing a name once exposed with --list-languages option.

@masatake
Copy link
Member

BTW, I have a question about English.
I wonder "crafter" parser is a misspelling of "crafted" parser. Am I wrong?
I'm looking for a simple word representing a parser written in C language manually.
optlib2c and packcc can generate a parser written in C automatically.
So "a parser written in C" doesn't specify
what I would like to say. I have used "crafted" parser in our documents and discussion.
I wonder this is good or not. Today, I see the word "crafter". Maybe what you wrote is what I would like to say. Do you think "crafter parser" is better than "crafted parser"?

@b4n
Copy link
Member Author

b4n commented Apr 11, 2019

Could you give me more time to think about this topic?

Sure, take the time you need. I wasn't sure myself as how to expose the three parsers, and kind of expected you'd tell me :)

I wonder "crafter" parser is a misspelling of "crafted" parser. Am I wrong?

You are right, that's a typo on my end. And I have no idea what a good wording for this would be, and I meant to use "crafted" simply because that's what I was seeing you use for this, and for a lack of a better idea I though it'd be clear in the uctags context at least.

@b4n b4n changed the title COBOL: "simple" crafter parser COBOL: "simple" crafted parser Apr 11, 2019
@masatake
Copy link
Member

@masatake
Copy link
Member

Thank you. I made the memorandum because I can use it as an example input when I make SQL parser as a guest parser run on COBOL parser as its host parser:-)

@b4n
Copy link
Member Author

b4n commented Apr 11, 2019

@masatake I added the test cases you pointed to, thanks!
It also showed a bug in the current version from master, where any * after column 6 would ignore the line, instead of if it's on column 7 precisely. The parser from this PR doesn't have the bug :) (which is unfair though, as it's trivial to fix in master).

@masatake
Copy link
Member

MEMO, Ignore this. http://itdoc.hitachi.co.jp/manuals/3000/30003D0800/GD080489.HTM (In Japanese).

@masatake
Copy link
Member

Could you see #2079?
After rethinking, kinds table can be shared between 3 (or 4 parsers).
Par parser specific datum are stored to kindControlBlock, not to kind(Definition) table.

@b4n
Copy link
Member Author

b4n commented Apr 19, 2019

Pushed 2 small fixups, nothing fancy (see the diff).

b4n added a commit to b4n/geany that referenced this pull request Apr 20, 2019
This fixes support for COBOL symbols after the recent breakage of regex
parsers, as well as introducing additional features and bug fixes.

Also import some of the tests.

universal-ctags/ctags#2076

Part of geany#2119.
@masatake
Copy link
Member

It seems that KIND_NULL is not used anywhere. I have to remove it in the future.

@codecov
Copy link

codecov bot commented Oct 30, 2020

Codecov Report

Merging #2076 (4361bab) into master (710529f) will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2076      +/-   ##
==========================================
+ Coverage   86.96%   86.99%   +0.03%     
==========================================
  Files         190      190              
  Lines       40350    40445      +95     
==========================================
+ Hits        35092    35187      +95     
  Misses       5258     5258              
Impacted Files Coverage Δ
parsers/cobol.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 710529f...4361bab. Read the comment docs.

@masatake
Copy link
Member

I'm focusing on this pull request.

@masatake masatake force-pushed the parser/cobol-simple branch from 23caab0 to 4361bab Compare November 28, 2020 22:21
@masatake
Copy link
Member

masatake commented Nov 29, 2020

NOTE for myself:

The COBOL families share their kind definitions. When I redesigned the way to manage kind definitions on the main side, I assumed kind definitions were given language by language; they were not shared between languages. Maybe this is about dynamically allocated kind definitions given via --kinddef-<LANG> option. I must revise this area.

#2076 (comment)

@masatake masatake merged commit 59e53e3 into universal-ctags:master Nov 29, 2020
@masatake
Copy link
Member

I'm very sorry to be late.
After long thinking, I found using a parameter like #2079 was not a good approach.

@masatake
Copy link
Member

>>SOURCE FORMAT directive can be handled with guest/host parser combination. I would like to try to add the code for supporting the directive.

@b4n
Copy link
Member Author

b4n commented Nov 29, 2020

I'm very sorry to be late.

Don't be, I was inactive for about one and a half years ;)

After long thinking, I found using a parameter like #2079 was not a good approach.

I don't know, I think it makes some sense from a CLI point of view as it's dialects of the same language, but I'm still unsure how it fits. Anyway, I believe this could be altered later, at least before next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants