Add commits range option for `scan` in Git repositories #29

Coruscant11 · 2023-02-17T14:10:19Z

Hi 👋

A great option in secret scanner is to be able to scan a range of commits, for example by adding an option to scan.

In my case, we use scanners for very large repositories. Once reported, in futures runs there will be no need to scan previously scanned commits. Only new commits are relevant. It saves a lot of time in large repositories.

Gitleaks has this feature , and Trufflehog too.

For example a since_commits option, scanning between a specific commit and HEAD. And why not a until_commits option.

Do you see any blocking issues for this enhancement?

😄

The text was updated successfully, but these errors were encountered:

bradlarsen · 2023-02-17T17:12:22Z

Hi @Coruscant11. This is a good use case and a feature that would be nice to have.

One challenge with implementing this in Nosey Parker is that Git repos are not scanned commit-by-commit, but instead, all blobs found in the repository are scanned. (This Git scanning technique uncovers more things than going commit-by-commit.)

To add this feature to Nosey Parker, we would need to add some alternative Git enumeration mechanism that would walk commit-by-commit and only select blobs reachable from the desired set of commits. The current source for Git repo enumeration is here.

Another thing to consider is the CLI for this added feature. noseyparker scan currently takes a list of paths as inputs; these paths can be files or directories. Would the --since_commits COMMIT and --until_commits COMMIT options apply to all the specified paths? It might be better to extend the newly-added --git-url URL input specifier to accept not just an HTTPS URL, but additionally with a Git revision specifier. So it might look something like --git-url https://github.com/praetorian-inc/[email protected].

bradlarsen · 2023-02-17T17:21:25Z

Another related change to this that I'd like to make in Nosey Parker is to keep track of which inputs have already been scanned, and avoid rescanning them if possible.

Currently, noseyparker scan -d DATASTORE INPUT will completely enumerate and scan INPUT from scratch. Nosey Parker is fast, but for large repositories (like the LInux kernel, with 100+GB of blobs), it still takes a couple minutes. However, simply enumerating contents goes pretty quickly, especially in Git repositories (e.g., the Linux kernel repo can be enumerated in 13-25 seconds, depending on filesystem cache). If Nosey Parker kept track of which blobs it had scanned and with which set of rules, it could avoid re-scanning things.

I'm going to make a separate issue for this.

bradlarsen · 2023-02-17T17:31:28Z

See also: #30

Coruscant11 · 2023-02-17T18:32:59Z

One challenge with implementing this in Nosey Parker is that Git repos are not scanned commit-by-commit, but instead, all blobs found in the repository are scanned. (This Git scanning technique uncovers more things than going commit-by-commit.)

That is what I thought. With other scanners which takes the commits by commits way, some repos can take few hours to scan while noseyparker took only 15 seconds. The purpose of this issue is to save time, but if the scanner is that fast, it is not necessarily worth to implement this issue very quickly.

But even so, a feature to scan specific revision would be very nice I think! And why not specify a datastore as you said in order to not duplicate scans. For the rare people which are working on insanely huge repositories 😄

For the git revision scan, here is my personal use case :

You scan the whole repository
You fetch all commits hashes list
Save the commits list somewhere in order to keep the history
Maybe one week later, you fetch only newest commits hashes, and make noseyparker scan all newest git revisions.

Datastore are nice, but I think also that in some cases you do not want to rely too much on that, for example on CI/CD when you do not know where can your program run. That is what I am doing at work, I have a very tiny API which has the role to save only commits scan history, but not secrets.

But this scanner seems so fast that it become a way more tiny problem.

I had a question, in some repositories, the scanner will found the same amount of distinct match at every run but not the same amount of total matches. Do you know why ? I do not think that it is an issue but I was wondering why.

Eitherway, the scan method of noseyparker seems very awesome. Very fast, and as you said, discover way more things. 😄

bradlarsen · 2023-02-20T16:00:24Z

I had a question, in some repositories, the scanner will found the same amount of distinct match at every run but not the same amount of total matches. Do you know why ? I do not think that it is an issue but I was wondering why.

I think you're talking about the summary table? For example, from scanning Nosey Parker's repo itself, you get something like this:

 Rule                                                      Distinct Matches   Total Matches
────────────────────────────────────────────────────────────────────────────────────────────
 PEM-Encoded Private Key                                                 76             276
 bcrypt Hash                                                             32             226
 Generic API Key                                                         25             131
 md5crypt Hash                                                           23             953
 Generic Secret                                                          23             245
 AWS API Key                                                             17              95
 Microsoft Teams Webhook                                                 12              12
 Credentials in PsExec                                                   11              12
 Azure App Configuration Connection String                               10              36

The numbers here for each rule indicate how many times that rule matched across all the scanned inputs.

Distinct Matches is the number of distinct groups extracted from the rule's regex (e.g., 951bc382db9abad29c68634761dd6e19 from the input - 'API_KEY = "951bc382db9abad29c68634761dd6e19"' for Generic API Key). This number is more representative of the number of unique things found from scanning.

Total Matches, in contrast, is simply the total number of times that rule matched across all the scanned inputs, without any concern for the content of regex groups. If some secret appears in 10 different files, those will all be included in Total Matches, even though they are all the same.

Distinct Matches will never be greater than Total Matches.

Coruscant11 · 2023-02-20T22:53:17Z

Ho sorry, maybe I explained bad.

I think that an image will be more clear 😆

In two runs, the total matches amount are not the same in very larges repositories.

bradlarsen · 2023-02-20T23:02:13Z

@Coruscant11 That is surprising.

If you run noseyparker summarize --datastore np.vegas multiple times, does it always report the same numbers, or do those change from run to run?

Coruscant11 · 2023-02-21T08:38:35Z

It seems that it is related to the scan :

I will create an issue for this later 😄

bradlarsen · 2023-02-21T15:07:07Z

Yeah that doesn't look right! Thanks for reporting that. A separate issue would be perfect.

bradlarsen · 2023-02-22T13:35:14Z

@Coruscant11 I created a new issue for the strange behavior your see: #32

bradlarsen · 2023-03-24T22:12:39Z

I've heard it would also be useful to have an option to skip digging into Git history altogether. Noting that here.

Coruscant11 changed the title ~~Add commits range scans in Git repositories~~ Add commits range for scan in Git repositories Feb 17, 2023

bradlarsen added the enhancement New feature or request label Feb 17, 2023

bradlarsen mentioned this issue Feb 22, 2023

Repeated noseyparker scan invocations produce different results #32

Closed

bradlarsen mentioned this issue Mar 21, 2023

Improve GitHub repository enumeration with filtering mechanisms #40

Closed

bradlarsen added the content discovery Related to enumerating or specifying content to scan label Apr 5, 2023

bradlarsen mentioned this issue Dec 1, 2023

Use noseyparker in CI #98

Closed

bradlarsen changed the title ~~Add commits range for scan in Git repositories~~ Add commits range option for scan in Git repositories Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add commits range option for `scan` in Git repositories #29

Add commits range option for `scan` in Git repositories #29

Coruscant11 commented Feb 17, 2023

bradlarsen commented Feb 17, 2023 •

edited

Loading

bradlarsen commented Feb 17, 2023

bradlarsen commented Feb 17, 2023

Coruscant11 commented Feb 17, 2023 •

edited

Loading

bradlarsen commented Feb 20, 2023

Coruscant11 commented Feb 20, 2023

bradlarsen commented Feb 20, 2023

Coruscant11 commented Feb 21, 2023

bradlarsen commented Feb 21, 2023

bradlarsen commented Feb 22, 2023

bradlarsen commented Mar 24, 2023

Add commits range option for scan in Git repositories #29

Add commits range option for scan in Git repositories #29

Comments

Coruscant11 commented Feb 17, 2023

bradlarsen commented Feb 17, 2023 • edited Loading

bradlarsen commented Feb 17, 2023

bradlarsen commented Feb 17, 2023

Coruscant11 commented Feb 17, 2023 • edited Loading

bradlarsen commented Feb 20, 2023

Coruscant11 commented Feb 20, 2023

bradlarsen commented Feb 20, 2023

Coruscant11 commented Feb 21, 2023

bradlarsen commented Feb 21, 2023

bradlarsen commented Feb 22, 2023

bradlarsen commented Mar 24, 2023

Add commits range option for `scan` in Git repositories #29

Add commits range option for `scan` in Git repositories #29

bradlarsen commented Feb 17, 2023 •

edited

Loading

Coruscant11 commented Feb 17, 2023 •

edited

Loading