Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize mono-repo & extract ML core to separate PyPI-distributable library. #295

Open
stepan-anokhin opened this issue Feb 22, 2021 · 2 comments
Assignees

Comments

@stepan-anokhin
Copy link
Collaborator

stepan-anokhin commented Feb 22, 2021

Problem

Currently the repository contains multiple applications with some shared logic and but different dependencies in general:

  • Deduplication app
  • REST API server
  • repo-admin cli tool
  • just cli tool

API server and repo-admin requires some of the dependencies from winnow, but not all of them. Some of the reusable parts are extracted into the packages that are placed at the repository root (e.g. task_queue, db). Also there are a lot of files that are related to deduplication app at the root, but not to the rest of the applications.

Problems:

  • No standard way to share logic between different applications and manage dependencies
  • No standard way to build distributable python packages for different apps.
  • No clear repository structure. It will get worth as we add more and more complexity. A new developrs will spend a lot of time figuring out where things live in the repository and how to resolve dependencies.
  • We tend to place everything into the large winnow package. This prevents us from having small resusable libraries that could be utilized by other projects, not dependend on our entire infrastructure.
  • We tend to duplicate some logic. E.g. repo-admin needs to be tiny, PyPI-distributable and independent from winnow, but it needs some logic from just which depends on winnow. As a results some of the logic from just is duplicated in repo-admin.
  • As we don't isolate reusable pieces of logic from each other, it becomes hard to adopt our monolith architecture for new use cases (like investigative journalism).

As a result our monorepo gets disorganized and as we add more complexity the above problems will get worse.

Goals

Improve monorepo organization so that:

  • There is a simple and well-documented way to share code between different application without need for code-duplication or unnecessary dependencies.
  • There is a straight-forward way to build PyPI-package and a Docker container for each application or library if needed.
  • There is a straight-froward and convenient way to setup a development environment and start developing for new comers.
  • The most common ML features are extracted into separate library so that it could be easily reused in different environments.

Possible solution:

We can consider an approach described in https://medium.com/opendoor-labs/our-python-monorepo-d34028f2b6fa
A working example could be found here https://github.com/ya-mori/python-monorepo

The difficult part is that ML stuff uses conda dependency manager.

@stepan-anokhin
Copy link
Collaborator Author

stepan-anokhin commented Feb 28, 2021

Findings

poetry support for monorepositories is not complete yet but it is being actively discussed at the moment (see the corresponding feature request python-poetry/poetry#936). It seems like poetry supports some of the monorepo features though (namely it allows to mix versioned and editable local path dependencies; see the corresponding pypa/packaging.python.org#506 (comment)). I've tested this approach and it seems to work well: all projects/libs use editable installs from the current codebase while build artifacts have versioned dependencies. So this is a good news.

Remaining Challenges

Investigate how to manage conda dependencies in ML-related packages. Some of the projects (e.g. server) share some logic with the dedup-app while at the same time don't need ML dependencies and conda all together, so they could rely only on poetry and python's standards. At the same time for ML-related projects (e.g. dedup-app) it is nice to have conda packages as they come pre-compiled and all necessary .so libraries comes with the conda installation out of the box. We need to figure out how to resolve this contradiction. So either some of the poetry projects need to depend on conda projects, or some of the conda projects need to depend on poetry projects, or some of the dependency management systems should be dropped in favor of another one.

Some Related Links

@stepan-anokhin
Copy link
Collaborator Author

stepan-anokhin commented Mar 31, 2021

Possible solution:

  • Use poetry for all projects except for dedup application itself (pipeline)
  • Use conda-develop command to install non-conda packages when working with the dedup app.

Rationale:
We already do similar thing when we place db package at the repository root.

Links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants