-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure Html2Parquet with its own dpk_ namespace #809
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
…o longer valid as it is based on folder name Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
build:: image | ||
|
||
publish: | ||
@if [ -e Dockerfile.python ]; then \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe split this into publish-python/ray/spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daw3rd When is publish being used. Do you know ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This transform needs a README.md
@@ -236,3 +236,35 @@ def apply_input_params(self, args: Namespace) -> bool: | |||
self.params = self.params | captured | |||
logger.info(f"html2parquet parameters are : {self.params}") | |||
return True | |||
|
|||
|
|||
class Html2Parquet(Html2ParquetTransform): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class needs some documentation here and in a README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BIG TIME :-). working on it. Thanks
Html2ParquetTransform, | ||
Html2ParquetTransformConfiguration, | ||
) | ||
from dpk_html2parquet.transform import Html2ParquetTransform, Html2ParquetTransformConfiguration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to run pre-commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daw3rd can you elaborate ? what is pre-commit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file can form the basis of a new readme in ../
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are doing so massive refactoring, should we combine test
and test-data
into one dir, e.g.
- test
- data
- input
- expected
- data
if each transformer builds its own module, should we add init.py files and create a unified namespace? the same for the ray runtime. |
@roytman Why not leave it to the transform owner developer to decide if they want to nest the test-data under test. All we care about that we have a test folder for running the pytest. no ? where the developer puts their data is up to them. no ? |
Signed-off-by: Maroun Touma <[email protected]>
How did I miss that? Done. Thanks @roytman |
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
TRANSFORM_RAY_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).ray.transform | ||
TRANSFORM_PYTHON_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).spark.transform | ||
## Default setting for TRANSFORM_RUNTIME entry point: | ||
# python -m dpk_html2parquet.ray.transform --help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need the dpk
prefix for each transformer? If you want to prevent possible name conflicts, what about dpk
namespace? e.g. dpk.html2parquet.ray.transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roytman : @dolfim-ibm made a passionate argument for this. I am good either way. @dolfim-ibm @daw3rd Can yo guys weigh in ? thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my argument was about having something, i.e. either prefix or namespace. I think it is very important to have a library which doesn't provide lots of global names which might overlap with others.
it also brings a bit of "branding" and makes it clearer to users that all these transforms come from the same place, i.e. data-prep-kit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so why not dpk.html2parquet.ray.transform
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a suggestion; I'm OK with any decision.
], | ||
"source": [ | ||
"from dpk_html2parquet.transform_python import Html2ParquetRuntime\n", | ||
"x=Html2ParquetRuntime(input_folder= \"input\", \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"input/ai-alliance-index.html"
"source": [ | ||
"from dpk_html2parquet.transform_python import Html2ParquetRuntime\n", | ||
"x=Html2ParquetRuntime(input_folder= \"input\", \n", | ||
" output_folder= \"output\", \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an output folder for this to work?
"from dpk_html2parquet.transform_python import Html2ParquetRuntime\n", | ||
"x=Html2ParquetRuntime(input_folder= \"input\", \n", | ||
" output_folder= \"output\", \n", | ||
" data_files_to_use=['.html']).transform()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know input file is just one html file but ['.zip', '.html'] in case user wants to test html zip in the notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or add a comment something like: '.zip' in case your input file is zip of html files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed all the changes, and they all make sense to me. I left some comments. Thank you for the significant effort you put into restructuring the code! @touma-I
Why are these changes needed?
This is a first of a series of restructuring changes that are done to have each transform built as its own module (e.g. dpk_html2parquet) with a ray submodule (dpk_html2parquet.ray ).
Related issue number (if any).
#774