-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/REDBOX-138-tighten-the-definition-of-metadata #262
feature/REDBOX-138-tighten-the-definition-of-metadata #262
Conversation
@@ -67,43 +66,6 @@ def _create_embedding_function(self) -> SentenceTransformerEmbeddings: | |||
""" | |||
return SentenceTransformerEmbeddings() | |||
|
|||
def add_chunks_to_vector_store(self, chunks: list[Chunk]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isnt used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really smart PR, nice one! My only comments were mostly around using some Pydantic types to validate stuff like URLs. I've made one comment on URLs, but there's potentially more (like paths) in the commented-out fields. I leave implementation at your discretion as the underlying object is likely ultimately going to be the class you already type as.
@@ -24,13 +25,78 @@ class File(PersistableModel): | |||
bucket: str = Field(description="s3 bucket") | |||
|
|||
|
|||
class Link(BaseModel): | |||
text: Optional[str] | |||
url: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider typing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, but we cant add additional constraints on top of https://github.com/Unstructured-IO/unstructured/blob/df1f7bcd0e529c7af5e6848b72c65152c8665653/unstructured/documents/elements.py#L143 , i.e. if unstructured sends us back a dodgy url i think we just have to accept it
redbox/models/file.py
Outdated
# orig_elements: Optional[list[Element]] = None | ||
# page_name: Optional[str] = None | ||
page_number: Optional[int | list[int]] = None | ||
# parent_id: Optional[ UUID] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 774754c
link_texts: Optional[list[str]] = None | ||
link_urls: Optional[list[str]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not duplication given the Link
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is. This is just how unstructured does it im afraid https://github.com/Unstructured-IO/unstructured/blob/df1f7bcd0e529c7af5e6848b72c65152c8665653/unstructured/documents/elements.py#L183 (if the unstructured class was a pydantic class we could just use that)
Context
As an Engineer I want the model definitions to be well defined so that we do not end up supporting an unlimited number of features
Changes proposed in this pull request
I have added a pydantic class to cover part of the Unstructured Element Metadata class, this could be extended in future
Guidance to review
Relevant links
https://technologyprogramme.atlassian.net/browse/REDBOX-138
Things to check