Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/REDBOX-138-tighten-the-definition-of-metadata #262

Merged

Conversation

gecBurton
Copy link
Collaborator

@gecBurton gecBurton commented Apr 23, 2024

Context

As an Engineer I want the model definitions to be well defined so that we do not end up supporting an unlimited number of features

Changes proposed in this pull request

I have added a pydantic class to cover part of the Unstructured Element Metadata class, this could be extended in future

Guidance to review

Relevant links

https://technologyprogramme.atlassian.net/browse/REDBOX-138

Things to check

  • I have added any new ENV vars in all deployed environments
  • I have tested any code added or changed
  • I have run integration tests

@gecBurton gecBurton changed the title feature/locked down model for Metadata feature/REDBOX-138-tighten-the-definition-of-metadata Apr 23, 2024
@gecBurton gecBurton marked this pull request as draft April 23, 2024 13:51
@@ -67,43 +66,6 @@ def _create_embedding_function(self) -> SentenceTransformerEmbeddings:
"""
return SentenceTransformerEmbeddings()

def add_chunks_to_vector_store(self, chunks: list[Chunk]) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isnt used

@gecBurton gecBurton marked this pull request as ready for review April 23, 2024 14:44
@gecBurton gecBurton requested a review from lmwilkigov April 24, 2024 20:45
Copy link
Collaborator

@wpfl-dbt wpfl-dbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really smart PR, nice one! My only comments were mostly around using some Pydantic types to validate stuff like URLs. I've made one comment on URLs, but there's potentially more (like paths) in the commented-out fields. I leave implementation at your discretion as the underlying object is likely ultimately going to be the class you already type as.

@@ -24,13 +25,78 @@ class File(PersistableModel):
bucket: str = Field(description="s3 bucket")


class Link(BaseModel):
text: Optional[str]
url: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider typing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, but we cant add additional constraints on top of https://github.com/Unstructured-IO/unstructured/blob/df1f7bcd0e529c7af5e6848b72c65152c8665653/unstructured/documents/elements.py#L143 , i.e. if unstructured sends us back a dodgy url i think we just have to accept it

# orig_elements: Optional[list[Element]] = None
# page_name: Optional[str] = None
page_number: Optional[int | list[int]] = None
# parent_id: Optional[ UUID] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 774754c

Comment on lines +60 to +61
link_texts: Optional[list[str]] = None
link_urls: Optional[list[str]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not duplication given the Link class?

Copy link
Collaborator Author

@gecBurton gecBurton Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. This is just how unstructured does it im afraid https://github.com/Unstructured-IO/unstructured/blob/df1f7bcd0e529c7af5e6848b72c65152c8665653/unstructured/documents/elements.py#L183 (if the unstructured class was a pydantic class we could just use that)

@lmwilkigov lmwilkigov merged commit a7a4165 into main Apr 26, 2024
9 checks passed
@gecBurton gecBurton deleted the feature/REDBOX-138-tighten-the-definition-of-metadata branch May 3, 2024 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants