You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the repository contains multiple applications with some shared logic and but different dependencies in general:
Deduplication app
REST API server
repo-admin cli tool
just cli tool
API server and repo-admin requires some of the dependencies from winnow, but not all of them. Some of the reusable parts are extracted into the packages that are placed at the repository root (e.g. task_queue, db). Also there are a lot of files that are related to deduplication app at the root, but not to the rest of the applications.
Problems:
No standard way to share logic between different applications and manage dependencies
No standard way to build distributable python packages for different apps.
No clear repository structure. It will get worth as we add more and more complexity. A new developrs will spend a lot of time figuring out where things live in the repository and how to resolve dependencies.
We tend to place everything into the large winnow package. This prevents us from having small resusable libraries that could be utilized by other projects, not dependend on our entire infrastructure.
We tend to duplicate some logic. E.g. repo-admin needs to be tiny, PyPI-distributable and independent from winnow, but it needs some logic from just which depends on winnow. As a results some of the logic from just is duplicated in repo-admin.
As we don't isolate reusable pieces of logic from each other, it becomes hard to adopt our monolith architecture for new use cases (like investigative journalism).
As a result our monorepo gets disorganized and as we add more complexity the above problems will get worse.
Goals
Improve monorepo organization so that:
There is a simple and well-documented way to share code between different application without need for code-duplication or unnecessary dependencies.
There is a straight-forward way to build PyPI-package and a Docker container for each application or library if needed.
There is a straight-froward and convenient way to setup a development environment and start developing for new comers.
The most common ML features are extracted into separate library so that it could be easily reused in different environments.
poetry support for monorepositories is not complete yet but it is being actively discussed at the moment (see the corresponding feature request python-poetry/poetry#936). It seems like poetry supports some of the monorepo features though (namely it allows to mix versioned and editable local path dependencies; see the corresponding pypa/packaging.python.org#506 (comment)). I've tested this approach and it seems to work well: all projects/libs use editable installs from the current codebase while build artifacts have versioned dependencies. So this is a good news.
Remaining Challenges
Investigate how to manage conda dependencies in ML-related packages. Some of the projects (e.g. server) share some logic with the dedup-app while at the same time don't need ML dependencies and conda all together, so they could rely only on poetry and python's standards. At the same time for ML-related projects (e.g. dedup-app) it is nice to have conda packages as they come pre-compiled and all necessary .so libraries comes with the conda installation out of the box. We need to figure out how to resolve this contradiction. So either some of the poetry projects need to depend on conda projects, or some of the conda projects need to depend on poetry projects, or some of the dependency management systems should be dropped in favor of another one.
Some Related Links
PEP 518 and PEP 517 - related standards, introduce pyproject.toml
Problem
Currently the repository contains multiple applications with some shared logic and but different dependencies in general:
repo-admin
cli tooljust
cli toolAPI server and
repo-admin
requires some of the dependencies fromwinnow
, but not all of them. Some of the reusable parts are extracted into the packages that are placed at the repository root (e.g.task_queue
,db
). Also there are a lot of files that are related to deduplication app at the root, but not to the rest of the applications.Problems:
repo-admin
needs to be tiny, PyPI-distributable and independent fromwinnow
, but it needs some logic fromjust
which depends onwinnow
. As a results some of the logic fromjust
is duplicated inrepo-admin
.As a result our monorepo gets disorganized and as we add more complexity the above problems will get worse.
Goals
Improve monorepo organization so that:
Possible solution:
We can consider an approach described in https://medium.com/opendoor-labs/our-python-monorepo-d34028f2b6fa
A working example could be found here https://github.com/ya-mori/python-monorepo
The difficult part is that ML stuff uses
conda
dependency manager.The text was updated successfully, but these errors were encountered: