Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure Html2Parquet with its own dpk_ namespace #809

Draft
wants to merge 20 commits into
base: dev
Choose a base branch
from

Conversation

touma-I
Copy link
Collaborator

@touma-I touma-I commented Nov 17, 2024

Why are these changes needed?

This is a first of a series of restructuring changes that are done to have each transform built as its own module (e.g. dpk_html2parquet) with a ray submodule (dpk_html2parquet.ray ).

  • Removed python and ray folders and keep Dockerfile.python and Dockerfile.ray
  • remove pyproject.toml
  • move python code under dpk_html2parquet
  • move ray code under dpk_html2parquet/ray
  • change import statement to include module name
  • replace recursive Makefile and use targets from .make.cicd.targets
  • adapt kfp_ray/Makefile and other make target

Related issue number (if any).

#774

Signed-off-by: Maroun Touma <[email protected]>
.make.defaults Show resolved Hide resolved
.make.defaults Show resolved Hide resolved
.make.defaults Show resolved Hide resolved
.make.defaults Show resolved Hide resolved
transforms/.make.cicd.targets Show resolved Hide resolved
transforms/.make.cicd.targets Show resolved Hide resolved
transforms/.make.cicd.targets Show resolved Hide resolved
build:: image

publish:
@if [ -e Dockerfile.python ]; then \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe split this into publish-python/ray/spark.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daw3rd When is publish being used. Do you know ?

Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transform needs a README.md

.make.defaults Outdated Show resolved Hide resolved
@@ -236,3 +236,35 @@ def apply_input_params(self, args: Namespace) -> bool:
self.params = self.params | captured
logger.info(f"html2parquet parameters are : {self.params}")
return True


class Html2Parquet(Html2ParquetTransform):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class needs some documentation here and in a README.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BIG TIME :-). working on it. Thanks

transforms/language/html2parquet/Makefile Outdated Show resolved Hide resolved
Html2ParquetTransform,
Html2ParquetTransformConfiguration,
)
from dpk_html2parquet.transform import Html2ParquetTransform, Html2ParquetTransformConfiguration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to run pre-commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daw3rd can you elaborate ? what is pre-commit ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can form the basis of a new readme in ../

Signed-off-by: Maroun Touma <[email protected]>
Copy link
Member

@roytman roytman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are doing so massive refactoring, should we combine test and test-data into one dir, e.g.

  • test
    • data
      • input
      • expected

@roytman
Copy link
Member

roytman commented Nov 18, 2024

if each transformer builds its own module, should we add init.py files and create a unified namespace?
For example: instead
from dpk_html2parquet.transform import Html2ParquetTransformConfiguration
use
from dpk_html2parquet import Html2ParquetTransformConfiguration

the same for the ray runtime.

@touma-I
Copy link
Collaborator Author

touma-I commented Nov 18, 2024

If we are doing so massive refactoring, should we combine test and test-data into one dir, e.g.

  • test

    • data

      • input
      • expected

@roytman Why not leave it to the transform owner developer to decide if they want to nest the test-data under test. All we care about that we have a test folder for running the pytest. no ? where the developer puts their data is up to them. no ?

Signed-off-by: Maroun Touma <[email protected]>
@touma-I
Copy link
Collaborator Author

touma-I commented Nov 18, 2024

if each transformer builds its own module, should we add init.py files and create a unified namespace? For example: instead from dpk_html2parquet.transform import Html2ParquetTransformConfiguration use from dpk_html2parquet import Html2ParquetTransformConfiguration

the same for the ray runtime.

How did I miss that? Done. Thanks @roytman

Signed-off-by: Maroun Touma <[email protected]>
@touma-I touma-I marked this pull request as ready for review November 19, 2024 01:44
@touma-I touma-I marked this pull request as draft November 19, 2024 12:46
TRANSFORM_RAY_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).ray.transform
TRANSFORM_PYTHON_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).spark.transform
## Default setting for TRANSFORM_RUNTIME entry point:
# python -m dpk_html2parquet.ray.transform --help
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need the dpk prefix for each transformer? If you want to prevent possible name conflicts, what about dpk namespace? e.g. dpk.html2parquet.ray.transform

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roytman : @dolfim-ibm made a passionate argument for this. I am good either way. @dolfim-ibm @daw3rd Can yo guys weigh in ? thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my argument was about having something, i.e. either prefix or namespace. I think it is very important to have a library which doesn't provide lots of global names which might overlap with others.
it also brings a bit of "branding" and makes it clearer to users that all these transforms come from the same place, i.e. data-prep-kit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why not dpk.html2parquet.ray.transform ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a suggestion; I'm OK with any decision.

],
"source": [
"from dpk_html2parquet.transform_python import Html2ParquetRuntime\n",
"x=Html2ParquetRuntime(input_folder= \"input\", \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"input/ai-alliance-index.html"

"source": [
"from dpk_html2parquet.transform_python import Html2ParquetRuntime\n",
"x=Html2ParquetRuntime(input_folder= \"input\", \n",
" output_folder= \"output\", \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an output folder for this to work?

"from dpk_html2parquet.transform_python import Html2ParquetRuntime\n",
"x=Html2ParquetRuntime(input_folder= \"input\", \n",
" output_folder= \"output\", \n",
" data_files_to_use=['.html']).transform()"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know input file is just one html file but ['.zip', '.html'] in case user wants to test html zip in the notebook.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or add a comment something like: '.zip' in case your input file is zip of html files.

Copy link
Collaborator

@sungeunan-ibm sungeunan-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed all the changes, and they all make sense to me. I left some comments. Thank you for the significant effort you put into restructuring the code! @touma-I

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants