Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting file.metadata.selection to none still results in files in the SBOM #2989

Open
tomersein opened this issue Jun 25, 2024 · 10 comments · May be fixed by #3132
Open

Setting file.metadata.selection to none still results in files in the SBOM #2989

tomersein opened this issue Jun 25, 2024 · 10 comments · May be fixed by #3132
Assignees
Labels
bug Something isn't working files relating to file nodes in the SBOM

Comments

@tomersein
Copy link
Contributor

What happened:
When I use "none" I still get "files" entry in the final json.
What you expected to happen:
If I use "none" remove the "files" entry.
Steps to reproduce the issue:
use this config.yaml:

file:

   metadata: 
      # select which files should be captured by the file-metadata cataloger and included in the SBOM. 
      # Options include:
      #  - "all": capture all files from the search space
      #  - "owned-by-package": capture only files owned by packages
      #  - "none", "": do not capture any files
      # SYFT_FILE_METADATA_SELECTION env var
      selection: "none"

scan an image\directory
Anything else we need to know?:
did a little check and the issue is this function:
func toFile(s sbom.SBOM) []model.File
I think that in case of none it shouldn't enter this function or use skip (if all variables like metadata, digest, etc.) are empty.

Environment:

  • Output of syft version: 1.8.0
  • OS (e.g: cat /etc/os-release or similar): macOS
@tomersein tomersein added the bug Something isn't working label Jun 25, 2024
@kzantow
Copy link
Contributor

kzantow commented Jun 25, 2024

I can confirm there seems to be something unexpected happening here:

SYFT_FILE_METADATA_SELECTION=none syft alpine:latest -o json

It results in a files section with no metadata or any other information such as digests:

  "files": [
    {
      "id": "a74cadfe8cda7a82",
      "location": {
        "path": "/bin/busybox",
        "layerID": "sha256:02f2bcb26af5ea6d185dcf509dc795746d907ae10c53918b6944ac85447a0c72"
      }
    },
   ...

For what it's worth: I think it might make sense for this flag to prevent metadata from being captured, rather than preventing files from being captured, and perhaps we should think about introducing a new configuration for the entire file section to disable all file data collection, e.g.:

file:
  # enable file cataloging
  enabled: true
  - or -
  selection: ...

  metadata:
    # select which files should be captured by the file-metadata cataloger and included in the SBOM. 
    # Options include:
    #  - "all": capture all files from the search space
    #  - "owned-by-package": capture only files owned by packages
    #  - "none", "": do not capture any files (env: SYFT_FILE_METADATA_SELECTION)
    selection: 'owned-by-package'
    
    # the file digest algorithms to use when cataloging files (options: "md5", "sha1", "sha224", "sha256", "sha384", "sha512") (env: SYFT_FILE_METADATA_DIGESTS)
    digests: 
      - 'sha1'
      - 'sha256'    
   ...

@kzantow kzantow moved this to Backlog in OSS Jun 25, 2024
@wagoodman wagoodman added the files relating to file nodes in the SBOM label Jul 2, 2024
@wagoodman wagoodman changed the title "none" under file selection in configuration doesn't work as expected Setting file.metadata.selection to none still results in files in the SBOM Nov 12, 2024
@wagoodman wagoodman moved this from Backlog to In Progress in OSS Nov 12, 2024
@wagoodman
Copy link
Contributor

I agree that someone could interpret setting this to none means that there should be no files, especially with the help wording. I agree with the file.enabled option here / same as the already-opened PR 👍 .

@wagoodman wagoodman self-assigned this Nov 12, 2024
@wagoodman wagoodman moved this from In Progress to In Review in OSS Nov 12, 2024
@willmurphyscode
Copy link
Contributor

I have some questions about the changes here:

  1. What about package catalogers that find files themselves? For example, the APK cataloger finds files that are owned by different APK packages. Should those get dropped?
  2. What about de-duping with ownership by file overlap between OS package manager packages and binary packages? If the answer to item 1 is "yes, drop those packages," what happens to this feature? Can we still detect the relationships?

@tomersein besides the surprise of the setting not doing what you expected, is there a reason you don't want a .files key in the output? I'm trying to understand the user motivation better to help me reason about the questions above.

@tomersein
Copy link
Contributor Author

I actually want a way to disable this cataloger.
I am trying to optimize syft running time, so during the process i've seen this behave.
I don't think this is a deal-breaker behave \ bug.

@willmurphyscode
Copy link
Contributor

I actually want a way to disable this cataloger.

You mean the file metadata cataloger?

@tomersein
Copy link
Contributor Author

yes, but when i put "none" i still see results, which make me think it still works

@willmurphyscode
Copy link
Contributor

I think we want to agree on what Syft currently does before we decide to change it. For that reason, I made 3 SBOMs with the possible settings, like this:

SYFT_FILE_METADATA_SELECTION=none syft alpine:latest -o json > syft-none.json
SYFT_FILE_METADATA_SELECTION=owned-by-package syft alpine:latest -o json > syft-owned.json
SYFT_FILE_METADATA_SELECTION=all syft alpine:latest -o json > syft-all.json

Then we can use jq to compare the SBOMs and see what's different:

❯ jq '.files | length' syft-none.json
77
❯ jq '.files | length' syft-owned.json
77
jq '.files | length' syft-all.json
517

So we can see here that when the file metadata cataloger is set to "all", it gets metadata for all the files in the image. But why do "syft-none" and "syft-owned" have the same number of files? It's because when Syft finds files that are owned by a package, it emits a relationship of type "contains" for that package, where the parent is the package and the child is the file.

So does setting SYFT_FILE_METADATA_SELECTION=none have no effect at all? Not quite; let's look at what keys are in the JSON:

# NONE
❯ jq -r '.files[] | keys | join("\n")' syft-none.json | sort | uniq
id
location
# OWNED
❯ jq -r '.files[] | keys | join("\n")' syft-owned.json | sort | uniq
digests
executable
id
location
metadata
# ALL
❯ jq -r '.files[] | keys | join("\n")' syft-all.json | sort | uniq
digests
executable
id
location
metadata
unknowns

So we can see that less information is captured in the files part of the SBOM when metadata is set to none.

This brings us to the PR feedback @wagoodman had and the question I had:

When Syft finds a package that owns files, this information is encoded in the SBOM via relationships. (If the metadata cataloger is set to "NONE," then these files lack metadata like digest and mode, but they are still present.) The feedback on the PR is: We need to have the files and the relationships that point to them, or neither. The question I have is: Are we willing to let the metadata cataloger selection interfere with other settings that rely on file relationships, like removing binaries that overlap by file ownership with OS packages.

I've added needs discussion, since I think the right course of action here isn't obvious.

@wagoodman
Copy link
Contributor

Based off of the discussion on the live stream about using the existing cataloger selection facilities in syft, I have a separate proposal for being able to augment how to turn off file catalogers: #3505

Essentially this would allow for syft myimage:latest --select-catalogers '-file' to turn off file cataloging entirely (the select-catalogers configuration option in yaml)

@willmurphyscode
Copy link
Contributor

Hi @tomersein! Thanks for your patience on this issue. We did some digging, and there are a few places file objects are created besides the file metadata cataloger:

  1. evident-by relationships: https://github.com/anchore/syft/blob/main/internal/task/relationship_tasks.go#L72-L75. For example, when you scan an Alpine image, you'll see a file at /lib/apk/db/installed that many APK packages point to with an evident-by relationship. This is not currently configurable. If you wish to turn off file cataloging entirely, you'd need to make it configurable
  2. Package file ownership at
    var relationships []artifact.Relationship
    for _, location := range locations {
    relationships = append(relationships, artifact.Relationship{
    From: p,
    To: location.Coordinates,
    Type: artifact.ContainsRelationship,
    })
    }
    - Basically, certain packages own files, and this causes files to be emitted for files owned by these packages. This is configurable via SYFT_RELATIONSHIPS_PACKAGE_FILE_OWNERSHIP, but defaults to on.
  3. Unknowns, at
    for _, r := range s.Relationships {
    _, fromPkgOk := r.From.(pkg.Package)
    fromFile, fromFileOk := r.From.(file.Coordinates)
    _, toPkgOk := r.To.(pkg.Package)
    toFile, toFileOk := r.To.(file.Coordinates)
    if fromPkgOk && toFileOk {
    allPackageCoords.Add(toFile)
    } else if fromFileOk && toPkgOk {
    allPackageCoords.Add(fromFile)
    }
    }
    - there is some configuration controlling this.

So if your goal is to have no .files in the resulting SBOM, you could set SYFT_RELATIONSHIPS_PACKAGE_FILE_OWNERSHIP to false, but there would still be evident-by files and maybe unknown files. If you want to move forward with the linked PR, the way to do it is to add configs that disable evident-by relationships and unknown files completely. However, I think maybe your goal is just "Syft should be faster", in which case you've already turned off all of the extra I/O associated with file catalogers.

I actually want a way to disable this cataloger.

SYFT_FILE_METADATA_SELECTION=none does disable the file catalogers. It's just that the file catalogers are not the only source of files.

I am trying to optimize syft running time, so during the process i've seen this behave.

I don't know that emitting these file relationships takes very much time. You can see which catalogers are taking time by passing -v to Syft, you'll see a bunch of lines like this:

[0001]  INFO task completed elapsed=36.167µs task=dotnet-portable-executable-cataloger
[0001]  INFO task completed elapsed=65.083µs task=python-installed-package-cataloger
[0001]  INFO task completed elapsed=1.716541ms task=go-module-binary-cataloger
[0001]  INFO task completed elapsed=65.709µs task=java-archive-cataloger

You can also set SYFT_DEV_PROFILE=mem or SYFT_DEV_PROFILE=cpu in the environment if you want to do lower level profiling than cataloger runtimes.

I hope this helps! Let me know how you'd like to proceed with this issue and the related PR.

@willmurphyscode
Copy link
Contributor

I think the next step here is to get #3505 merged. I'm removing needs-discussion because we've already discussed enough to get a specific PR.

@wagoodman wagoodman moved this from In Review to Stalled in OSS Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working files relating to file nodes in the SBOM
Projects
Status: Stalled
Development

Successfully merging a pull request may close this issue.

4 participants