Python: Handle diagnostics writing for `BuiltinModuleExtractable` #16940

RasmusWL · 2024-07-09T12:17:51Z

In the hopes this will fix a problem encountered in the wild (that I couldn't reproduce locally).

I couldn't find any tests that I could easily extend to test the new behavior either, so what I did was to locally alter BuiltinExtractor to always throw an exception. I observed this causes us to generate Exception: 'BuiltinModuleExtractable' object has no attribute 'path' in the logs.

With the commits from this PR, we now properly log the failure reason:

[2024-07-09 14:07:25] [build-stdout] [ERROR] [15] Failed to extract module _datetime: <exception-reason>
[2024-07-09 14:07:25] [build-stdout] [TRACEBACK] [15] "semmle/worker.py", line 263, in _extract_loop
[2024-07-09 14:07:25] [build-stdout] [TRACEBACK] [15] "semmle/extractors/super_extractor.py", line 34, in process
[2024-07-09 14:07:25] [build-stdout] [TRACEBACK] [15] "semmle/extractors/builtin_extractor.py", line 24, in process

and generate diagnostics such as (EDIT: updated to not include location after
354394d)

{"timestamp": "2024-07-12T13:40:41.728383", "source": {"id": "py/diagnostics/internal-error", "name": "Internal error in Python extractor", "extractorName": "python"}, "severity": "error", "plaintextMessage": "Internal error", "visibility": {"statusPage": false, "cliSummaryTable": false, "telemetry": true}, "attributes": {"traceback": ["\"semmle/worker.py\", line 263, in _extract_loop", "\"semmle/extractors/super_extractor.py\", line 34, in process", "\"semmle/extractors/builtin_extractor.py\", line 24, in process"], "args": ["<exception-reason>"]}}

I'm not 100% sure if the tooling will be happy with the fake "file" here, so want to confirm that before merging this PR.

EDIT: I forgot to mention I also tried inserting an exception in internal_error_message function, to verify that error handling also works 👍

[2024-07-09 14:15:07] [build-stdout] [WARN] [7] Failed to write diagnostics: <exception-reason>
[2024-07-09 14:15:07] [build-stdout] [TRACEBACK] [7] "semmle/worker.py", line 263, in _extract_loop
[2024-07-09 14:15:07] [build-stdout] [TRACEBACK] [7] "semmle/extractors/super_extractor.py", line 34, in process
[2024-07-09 14:15:07] [build-stdout] [TRACEBACK] [7] "semmle/extractors/builtin_extractor.py", line 24, in process
[2024-07-09 14:15:07] [build-stdout] [TRACEBACK] [7] "semmle/worker.py", line 290, in _extract_loop
[2024-07-09 14:15:07] [build-stdout] [TRACEBACK] [7] "semmle/logging.py", line 387, in internal_error_message

So we have the info in the logs if the diagnostics processing fails

sidshank

LGTM! I like the extra information being logged, and presumably the shift from unhandled exception to warning should allow extraction to continue (and not fail).

tausbn

Looks good to me. 👍
I'm still somewhat surprised that this error is occurring at all, but I think this should fix it.

molson504x · 2024-07-10T14:29:29Z

Hey @RasmusWL - chiming in because you'd commented you weren't sure how to test this out and couldn't reproduce locally. This is how we found the problem in the wild, and how I kind of diagnosed it as a hybrid between an environment problem and a python extractor problem. This client that we are rolling out CodeQL to is using AmazonLinux to build their runner images (yep, I know, that's where this becomes an environment issue), and AmazonLinux handles cryptography functions in a weird way because of FIPS compliance. I targeted the python hashlib library because of finding this in the debug output of CodeQL:

[2024-07-09 15:42:05] [build-stdout] [DEBUG] [24] Trying built-in extractor on module _blake2
Error: 7-09 15:42:05] [build-stdout] [ERROR] [24] Exception: 'BuiltinModuleExtractable' object has no attribute 'path'

I wrote a small python script which would try to create a blake2 hash, and ran it in python 3.9 in AL2, AL2023, and Ubuntu (Ubuntu was running 3.10). Here's the script:

from hashlib import blake2b
h = blake2b()
h.update(b'Hello world')
print(h.hexdigest())

On AmazonLinux, it failed because blake2b (and blake2s) don't exist in hashlib. On Ubuntu, it works fine and produced a hex digest. According to the docs hashlib should exist in 3.9, as should blake2b and blake2s. Weird, right?

In theory, running this same python script through CodeQL's Python Extractor in an AmazonLinux container and on Ubuntu should be a good sanity check to test this. AmazonLinux is available from dockerhub, so in theory you should be able to spin up a AL2 or AL2023 container locally and just run the CodeQL CLI there to reproduce the error.

Some of the internal tooling would not be too happy about this :D

RasmusWL · 2024-07-12T12:46:25Z

Thanks @molson504x, that's really valuable context 👍 When the change from this PR is applied, we will still have the same underlying problem with blake2, it just wont cause the extractor to crash. (and we'll investigate the underlying problem separately 👍)

oh, I forgot to actually submit this comment 🙈

molson504x · 2024-07-12T12:57:52Z

Thanks! Yeah I'm not saying blake2 was the only one, just the one I happened to notice and used as my test. My theory is amazonlinux is using some kind of fips crypto library that doesn't include, among others, blake2 algorithms which would explain the pathing problem.

RasmusWL added 4 commits July 9, 2024 13:47

Python: Always log errors before writing diagnostics

a8b976b

So we have the info in the logs if the diagnostics processing fails

Python: Gracefully handle exceptions in diagnostics writing

c1da2c1

Python: Handle diagnostics writing for BuiltinModuleExtractable

6b3625e

Python: Bump extractor version

60d1dc8

RasmusWL requested a review from a team as a code owner July 9, 2024 12:17

github-actions bot added the Python label Jul 9, 2024

sidshank previously approved these changes Jul 9, 2024

View reviewed changes

tausbn previously approved these changes Jul 9, 2024

View reviewed changes

Python: Don't use fake locations in diagnostics

354394d

Some of the internal tooling would not be too happy about this :D

RasmusWL dismissed stale reviews from tausbn and sidshank via 354394d July 12, 2024 11:38

RasmusWL requested a review from tausbn July 12, 2024 11:38

tausbn approved these changes Jul 12, 2024

View reviewed changes

RasmusWL merged commit 1de2943 into github:main Jul 12, 2024
8 checks passed

RasmusWL deleted the rasmuswl/BuiltinModuleExtractable branch July 12, 2024 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Handle diagnostics writing for `BuiltinModuleExtractable` #16940

Python: Handle diagnostics writing for `BuiltinModuleExtractable` #16940

RasmusWL commented Jul 9, 2024 •

edited

Loading

sidshank left a comment

tausbn left a comment

molson504x commented Jul 10, 2024 •

edited

Loading

RasmusWL commented Jul 12, 2024

molson504x commented Jul 12, 2024

Python: Handle diagnostics writing for BuiltinModuleExtractable #16940

Python: Handle diagnostics writing for BuiltinModuleExtractable #16940

Conversation

RasmusWL commented Jul 9, 2024 • edited Loading

sidshank left a comment

Choose a reason for hiding this comment

tausbn left a comment

Choose a reason for hiding this comment

molson504x commented Jul 10, 2024 • edited Loading

RasmusWL commented Jul 12, 2024

molson504x commented Jul 12, 2024

Python: Handle diagnostics writing for `BuiltinModuleExtractable` #16940

Python: Handle diagnostics writing for `BuiltinModuleExtractable` #16940

RasmusWL commented Jul 9, 2024 •

edited

Loading

molson504x commented Jul 10, 2024 •

edited

Loading