Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete mime_type data for several ContentTypes #816

Closed
Nicba1010 opened this issue Nov 22, 2024 · 5 comments
Closed

Incomplete mime_type data for several ContentTypes #816

Nicba1010 opened this issue Nov 22, 2024 · 5 comments
Labels
python Pull requests that update Python package

Comments

@Nicba1010
Copy link

I do no know if this is by design, but some ContentTypes seem to be missing their associated mime_type values.

e.g. the MKV content type specifies its mime_type as None instead of video/x-matroska

@reyammer
Copy link
Collaborator

Thanks for reporting! If a content type that is actually used by any of the model has any of the metadata set to None, that's not by design. But I just checked MKV, and it has the correct mime_type (see https://github.com/google/magika/blob/main/assets/content_types_kb.min.json)? where did you see its mime_type to None?

Are you maybe checking an old version of the metadata?

@Nicba1010
Copy link
Author

Well I'm using the magika Python package (version 0.5.1) on Python 3.12.

This is the code I used to print all the MIME Types.

from magika.content_types import ContentTypesManager


def get_mime_type_list() -> list[str]:
    return list({
        str(content_type.mime_type)
        for name, content_type
        in ContentTypesManager().cts.items()
        if content_type.mime_type is not None
    })

if __name__ == '__main__':
    for mime_type in get_mime_type_list():
        print(mime_type)

Also second question, I see that on 0.6.0rc3 the ContentTypesManager is gone. I can't seem to find it to test this with 0.6.0rc3.

@reyammer
Copy link
Collaborator

reyammer commented Dec 9, 2024

[Sorry for late reply, was on holidays.]

Thanks for clarifying. 0.5.1 is indeed the old version; in the new update we should have proper metadata for all the content types supported by any of the models.

And yes, the ContentTypesManager class has been refactored out (and it was never supposed to be used by external clients). The same information is now in a private field _cts_infos (https://github.com/google/magika/blob/main/python/src/magika/magika.py#L98), stored in a better way (with enums and such), but again, we did not think about this as to be used by external clients.

One thing to keep in mind: there is a list of "content types a given model supports" vs. "the full list of content types we are aware of" (which is a super set of the former, and for which the metadata may not be ready for consumption).

What would be your use case for using it? If we were to add it, which kind of APIs would you like to have?

@Nicba1010
Copy link
Author

All good, thanks for replying at all. I send it out to the frontend of my app because a user can type in a mime type into a field and this is how I verify it actually exists. Might be quite a niche use case though and there are probably better solutions.

@reyammer
Copy link
Collaborator

I see, interesting. In general I believe the knowledge base of types may useful for use cases like yours, so I see a good argument to expose it in some way or another. Maybe we could add some methods to the Magika class itself, something that would allow you to "get all content types" (getting a list of ContentTypeLabel (which is an enum), and then another method to query the metadata associated to each of them. Opened #826 to track this. Closing this one as I believe we clarified why things were missing; feel free to re-open and follow up if something is still not clear. Thanks!

@reyammer reyammer added the python Pull requests that update Python package label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Pull requests that update Python package
Projects
None yet
Development

No branches or pull requests

2 participants