Organize mono-repo & extract ML core to separate PyPI-distributable library. #295

stepan-anokhin · 2021-02-22T12:18:00Z

Problem

Currently the repository contains multiple applications with some shared logic and but different dependencies in general:

Deduplication app
REST API server
repo-admin cli tool
just cli tool

API server and repo-admin requires some of the dependencies from winnow, but not all of them. Some of the reusable parts are extracted into the packages that are placed at the repository root (e.g. task_queue, db). Also there are a lot of files that are related to deduplication app at the root, but not to the rest of the applications.

Problems:

No standard way to share logic between different applications and manage dependencies
No standard way to build distributable python packages for different apps.
No clear repository structure. It will get worth as we add more and more complexity. A new developrs will spend a lot of time figuring out where things live in the repository and how to resolve dependencies.
We tend to place everything into the large winnow package. This prevents us from having small resusable libraries that could be utilized by other projects, not dependend on our entire infrastructure.
We tend to duplicate some logic. E.g. repo-admin needs to be tiny, PyPI-distributable and independent from winnow, but it needs some logic from just which depends on winnow. As a results some of the logic from just is duplicated in repo-admin.
As we don't isolate reusable pieces of logic from each other, it becomes hard to adopt our monolith architecture for new use cases (like investigative journalism).

As a result our monorepo gets disorganized and as we add more complexity the above problems will get worse.

Goals

Improve monorepo organization so that:

There is a simple and well-documented way to share code between different application without need for code-duplication or unnecessary dependencies.
There is a straight-forward way to build PyPI-package and a Docker container for each application or library if needed.
There is a straight-froward and convenient way to setup a development environment and start developing for new comers.
The most common ML features are extracted into separate library so that it could be easily reused in different environments.

Possible solution:

We can consider an approach described in https://medium.com/opendoor-labs/our-python-monorepo-d34028f2b6fa
A working example could be found here https://github.com/ya-mori/python-monorepo

The difficult part is that ML stuff uses conda dependency manager.

The text was updated successfully, but these errors were encountered:

stepan-anokhin · 2021-02-28T15:35:01Z

Findings

poetry support for monorepositories is not complete yet but it is being actively discussed at the moment (see the corresponding feature request python-poetry/poetry#936). It seems like poetry supports some of the monorepo features though (namely it allows to mix versioned and editable local path dependencies; see the corresponding pypa/packaging.python.org#506 (comment)). I've tested this approach and it seems to work well: all projects/libs use editable installs from the current codebase while build artifacts have versioned dependencies. So this is a good news.

Remaining Challenges

Investigate how to manage conda dependencies in ML-related packages. Some of the projects (e.g. server) share some logic with the dedup-app while at the same time don't need ML dependencies and conda all together, so they could rely only on poetry and python's standards. At the same time for ML-related projects (e.g. dedup-app) it is nice to have conda packages as they come pre-compiled and all necessary .so libraries comes with the conda installation out of the box. We need to figure out how to resolve this contradiction. So either some of the poetry projects need to depend on conda projects, or some of the conda projects need to depend on poetry projects, or some of the dependency management systems should be dropped in favor of another one.

Some Related Links

PEP 518 and PEP 517 - related standards, introduce pyproject.toml
What the heck is pyproject.toml? - some discussion of the project.toml
poetry2conda - utility to convert python's standard pyproject.toml to conda environment.yaml

stepan-anokhin · 2021-03-31T14:20:15Z

Possible solution:

Use poetry for all projects except for dedup application itself (pipeline)
Use conda-develop command to install non-conda packages when working with the dedup app.

Rationale:
We already do similar thing when we place db package at the repository root.

Links:

stepan-anokhin self-assigned this Feb 22, 2021

johnhbenetech added the tech debt label Feb 23, 2021

johnhbenetech mentioned this issue Apr 1, 2021

Code structure and repository refactor #268

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Organize mono-repo & extract ML core to separate PyPI-distributable library. #295

Organize mono-repo & extract ML core to separate PyPI-distributable library. #295

stepan-anokhin commented Feb 22, 2021 •

edited

Loading

stepan-anokhin commented Feb 28, 2021 •

edited

Loading

stepan-anokhin commented Mar 31, 2021 •

edited

Loading

Organize mono-repo & extract ML core to separate PyPI-distributable library. #295

Organize mono-repo & extract ML core to separate PyPI-distributable library. #295

Comments

stepan-anokhin commented Feb 22, 2021 • edited Loading

Problem

Goals

Possible solution:

stepan-anokhin commented Feb 28, 2021 • edited Loading

Findings

Remaining Challenges

Some Related Links

stepan-anokhin commented Mar 31, 2021 • edited Loading

stepan-anokhin commented Feb 22, 2021 •

edited

Loading

stepan-anokhin commented Feb 28, 2021 •

edited

Loading

stepan-anokhin commented Mar 31, 2021 •

edited

Loading