Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: increase the robustness of megaparse #121

Merged
merged 6 commits into from
Nov 13, 2024
Merged

feat: increase the robustness of megaparse #121

merged 6 commits into from
Nov 13, 2024

Conversation

jacopo-chevallard
Copy link
Contributor

  • passing base url and timeout explicitly, so that they can be configured from the module caling the sdk
  • improving the way we check for supported models, which are now defined in a separate module
  • using docker-compose instead of a Dockerfile directly

Dockerfile Show resolved Hide resolved
@@ -4,8 +4,8 @@


class MegaParseSDK:
def __init__(self, api_key: str):
self.client = MegaParseClient(api_key)
def __init__(self, api_key: str, base_url: str | None = None, timeout: int = 600):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout won't affect the http request. By default in production, the timeout is on the laod balancer and not the app. It is at 60 seconds. Adding this here gives a false sense of: you can increase the timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but this is the reason why the parsing fails, since it takes several minutes for a pdf documents with 30 pages...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and this is how I fixed it locally, but I understand that this is not a good solution for prod...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @StanGirard 👍🏼 . We should have faster parsing always or send the parser a page by page of the pdf from the workers

Copy link
Collaborator

@chloedia chloedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need docker compose in the repo?

def __init__(
self,
api_key: str | None = None,
base_url: str | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like those in env variables but as you want



class SupportedModel(Enum):
GPT_4O = ("gpt-4o", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats Nice !

Additionnal typeing can be used here (either NamedTuples or typped dict ...)

@@ -4,8 +4,8 @@


class MegaParseSDK:
def __init__(self, api_key: str):
self.client = MegaParseClient(api_key)
def __init__(self, api_key: str, base_url: str | None = None, timeout: int = 600):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @StanGirard 👍🏼 . We should have faster parsing always or send the parser a page by page of the pdf from the workers

@AmineDiro AmineDiro merged commit d21d8bb into main Nov 13, 2024
1 check passed
chloedia pushed a commit that referenced this pull request Nov 15, 2024
* chore: aligining build commands with what used in quivr-core

* chore: adding uvloop dependency

* feat: improving the way we check for supported models

* feat: passing url and timeout explicitly

* chore: restoring launch command in Docker

* feat: improving dealing with MegaParse configuration
chloedia added a commit that referenced this pull request Nov 18, 2024
…#123)

* fix: fixing the wrong passing of arguments to the parse_file endpoint

* feat: increase the robustness of megaparse (#121)

* chore: aligining build commands with what used in quivr-core

* chore: adding uvloop dependency

* feat: improving the way we check for supported models

* feat: passing url and timeout explicitly

* chore: restoring launch command in Docker

* feat: improving dealing with MegaParse configuration

* fix: version requirements (#126)

* fix: version requirements

* fix: fix versions

* fix: uvicorn version

* fix: uvicorn version (#127)

* chore(main): release megaparse 0.0.43 (#125)

* fix: typing and test error

* fix: change url config

* add: a SupportedModels enum for api

* fix: change name UploadFileInput to UploadFileConfig

---------

Co-authored-by: Chloé Daems <[email protected]>
Co-authored-by: Stan Girard <[email protected]>
Co-authored-by: chloedia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants