-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WP-7630 #7
WP-7630 #7
Conversation
* allow search_prefix to be None * handle both list and string for key_properties and date_overrides * pylint
* Bump to v1.2.2 * Changelog
* TDL-13258:Added integration tests and resources to tap-s3-csv from tap-tester * Add context and triggers to circleci config * Run nosetests on the correct folder * Remove nose tests because there are no unit tests * Fix test properties * TDL-13258:Updated non_rectangular_files test case in types_and_data * Combine related tests into one Co-authored-by: Savan Chovatiya <[email protected]> Co-authored-by: Collin Simon <[email protected]>
* TDL-12589: Added the support of JSONL files * TDL-12589: Formated code * TDL-12589: test updated * TDL-12589: Updated config.yml to expect failures * TDL-12589: added stitch api tocken * TDL-12589: Updated config and conversion of datatype * TDL-12589: Updated priority of datatype like: list date-time dict integer number null - default in evenryone string - default in evenryone * TDL-12589: Updated as per priority * TDL-12589: removed pylint failures * TDL-12589: replaced * TDL-12589: Added warning message for list inside list * TDL-12589: Optimized code * TDL-12589: Removed white space * TDL-12589: Skipping row of JOSNL file if it is empty instaid of raising error. * TDL: Rmoved extra white space * TDL-12589: Updated test files * TDL-12589: Updated code as per review comments changes * TDL-12589: Added Unittests for the same * TDL-12589: Pylint error resolved * TDL-12589: Changed remove fields log from info to debug * TDL-12589: Updated conversion code to support + sign Co-authored-by: dbshah1212 <[email protected]>
symon-ai#30) * TDL-12464: Added support for handling the duplicate headers in the CSV file * Changed warning message * Updated unit tests according to the warning message * TDL-12464: Adding code to leverage duplicate headers support provided in simger-encoding library * TDL-12464: Removed the unwanted code and made compatible with master repo * TDL-12464: Upgraded singer-encodings library to fetch the latest version * TDL-12464: Changing the data type of 'sdc_extra' key in the event * TDL-12464: Updating test cases as per the code optimization * TDL-12464: Updating version of singer-encoding library * TDL-12464: Updating version of singer-python and backoff modules Co-authored-by: Karan Panchal (C) <[email protected]> Co-authored-by: harshpatel4_crest <[email protected]>
* TDL-12486: Added support of compressed files * TDL-12486: Updated singer encoding dependency * TDL-12486: Added more doc strings. * TDL-12486: Upgraded dependencies changed the logic of taking samples from zip * TDL-12486: Increase coverage to test compressed files * TDL-12486: Upgraded the singer-encoding version to 0.1.0 * TDL-12486: Removed trailing-whitespace * TDL-12486: Updated test case of S3AllFilesSupport * TDL-12486: Removed comman self.conn_id * TDL-12486: Changes reverted. * TDL-12486: Changed start date format * TDL-12486: Updated date format in test_All_supported_files. * TDL-12486: Change in logger messages Co-authored-by: dbshah1212 <[email protected]>
* TDL-12589: Changed sdc_extra log from debug to warn * TDL-12589: Changed message to sync with csv message * TDL-12589: Updated message Co-authored-by: dbshah1212 <[email protected]>
…ymon-ai#35) * Strictly enforce the ordering of type checking for integer vs number
* TDL-14068:fixed key-error exception * Added unit test cases and integration tests * Running one integration test for debugging * Debugging integration test case * Updated integration test * Updated integration test expected output * Updated config.yml for running all integration test again
* TLD-14038: Skipping the .gz which gzip using --no-name * TDL-14038: Added final count of total skipped files for discover mode and sync mode * tdl-14038: Updated warning message and added unit test for the same * TDL-14038: Removed global variable and added integration test * TDL-14038: Updated comments * TDL-14038: Added blank line * TDL-14038: Removed: trailing-whitespace * TDL-14038: Added comment of pylint disable * TDL-14038: Updated pylint comment * TDL-14038: Updated the test file class name * TDL-14038: Removed self file call and added global. * TDL: Remove warning message for 0 file skipped * TDL-14038: Removed trailing white space * TDL-14068: Fixed key error exception. * TDL-14038: Reverted another bug changes * TDL-14038: updated skipped_files_count * TDL-14038: Updated message, comments and counts * TDL-14038: Removed trailing-whitespace * TDL-14038: Updated unit test cases * TDL-14038: Updated sync file code. * Resolved: use-maxsplit-arg * Refactor how we handle nameless files * Fix comment placement * Mention tar as a problem too * Make pylint happy Co-authored-by: dbshah1212 <[email protected]> Co-authored-by: Andy Lu <[email protected]>
* Bump to v1.3.2, update changelog * Update changelog
…s found for sampling. (symon-ai#40) * Updated sampled schema when no samples found * Running one integration test for debugging * Debugging integration test * Debugging integration test * Updated integration test for catalog_with_empty_properties * Running all integration test again
* fix: Handled Unicode and JsonDecoder Error for wrong extention file. * fix: Updated sync code and test case * Fix: Handled StopIteration error for empty csv file. * fix: Added unit test of StopIteration code handling * fix: Resolved pylint errors * Fix: removed trailing white space * fix: disabled use-maxsplit-arg as we haven't change the code as part of this branch * fix: Removed exception and added Warning for empty Jsonl file. * fix: Handled pylint error * fix: Skipping records with empty json * fix: Added unit tests and integration tests for empty json jsonl file. * fix: Skipping Empty Josn whily syncing as well * Skipping empty lines of CSV in sampling and sync * fix: Upgraded latest version of singer-encoding. * fix: Added some test files * fix: Removed unused variable declaration * fix: Added UnicodeDecodeError and JSONDecodeError handling scenario in comment. * fix: Final touch * Update spell mistake * Corrected typo * Updated warning messages and empty jsonl file in skip count * fix: Put warning of skipping empty jsonl files. * fix: Updated comment Co-authored-by: dbshah1212 <[email protected]> Co-authored-by: savan-chovatiya <[email protected]> Co-authored-by: Kyle Allan <[email protected]>
* Bump to version 1.3.4 * Bump to version 1.3.4 * Bump to version 1.3.4 * Bump to version 1.3.4 * Bump to version 1.3.4 Co-authored-by: KrishnanG <[email protected]>
CHANGELOG.md
Outdated
## 1.3.5 | ||
|
||
- Reintroduce ability to assume role for external AWS account | ||
- Add optional parameter `recursive_seach` to table config. When set to false, will prevent searching for files in subfolders within S3 bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For WP-7630
@@ -11,6 +11,7 @@ | |||
LOGGER = singer.get_logger() | |||
|
|||
REQUIRED_CONFIG_KEYS = ["bucket"] | |||
REQUIRED_CONFIG_KEYS_EXTERNAL_SOURCE = ["bucket", "account_id", "external_id", "role_name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For WP-7630
@@ -7,5 +7,6 @@ | |||
Optional('search_prefix'): str, | |||
Optional('date_overrides'): [str], | |||
Optional('delimiter'): str, | |||
Optional('escape_char'): str | |||
Optional('escape_char'): str, | |||
Optional('recursive_search'): bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For WP-7630
@@ -159,12 +379,13 @@ def get_input_files_for_table(config, table_spec, modified_since=None): | |||
matched_files_count = 0 | |||
unmatched_files_count = 0 | |||
max_files_before_log = 30000 | |||
for s3_object in list_files_in_bucket(bucket, table_spec.get('search_prefix')): | |||
for s3_object in list_files_in_bucket(bucket, table_spec.get('search_prefix'), table_spec.get('recursive_search')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For WP-7630
tap_s3_csv/s3.py
Outdated
if not recursive_search: | ||
if search_prefix is not None and not search_prefix.endswith('/'): | ||
search_prefix += '/' | ||
args['Delimiter'] = '/' # This will limit results to the exact folder specified by the prefix, without going into subfolders | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For WP-7630. At this time, the S3 connector is intended to import a single file at a time, but the tap naturally attempts to get all files that match a regex pattern. So if multiple files have the same name in a folder structure, they would all be retrieved. By using this Delimiter arg, we prevent the search of files to the specified folder level, so it won't search any subfolders, which will prevent unintentional retrieval of files in subfolders if they have the same name as the requested file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
questions
* Fix/config parsing (#21) * allow search_prefix to be None * handle both list and string for key_properties and date_overrides * pylint * Bump to v1.2.2 (#22) * Bump to v1.2.2 * Changelog * Check if search_prefix is present before popping (#23) * Bump to v1.2.3 (#24) * TDL-13258 move tests from tap-tester to tap-s3-csv (#29) * TDL-13258:Added integration tests and resources to tap-s3-csv from tap-tester * Add context and triggers to circleci config * Run nosetests on the correct folder * Remove nose tests because there are no unit tests * Fix test properties * TDL-13258:Updated non_rectangular_files test case in types_and_data * Combine related tests into one Co-authored-by: Savan Chovatiya <[email protected]> Co-authored-by: Collin Simon <[email protected]> * TDL-12589: Added the support of JSONL files (#31) * TDL-12589: Added the support of JSONL files * TDL-12589: Formated code * TDL-12589: test updated * TDL-12589: Updated config.yml to expect failures * TDL-12589: added stitch api tocken * TDL-12589: Updated config and conversion of datatype * TDL-12589: Updated priority of datatype like: list date-time dict integer number null - default in evenryone string - default in evenryone * TDL-12589: Updated as per priority * TDL-12589: removed pylint failures * TDL-12589: replaced * TDL-12589: Added warning message for list inside list * TDL-12589: Optimized code * TDL-12589: Removed white space * TDL-12589: Skipping row of JOSNL file if it is empty instaid of raising error. * TDL: Rmoved extra white space * TDL-12589: Updated test files * TDL-12589: Updated code as per review comments changes * TDL-12589: Added Unittests for the same * TDL-12589: Pylint error resolved * TDL-12589: Changed remove fields log from info to debug * TDL-12589: Updated conversion code to support + sign Co-authored-by: dbshah1212 <[email protected]> * TDL-12464: Added support for handling the duplicate headers in the CS… (#30) * TDL-12464: Added support for handling the duplicate headers in the CSV file * Changed warning message * Updated unit tests according to the warning message * TDL-12464: Adding code to leverage duplicate headers support provided in simger-encoding library * TDL-12464: Removed the unwanted code and made compatible with master repo * TDL-12464: Upgraded singer-encodings library to fetch the latest version * TDL-12464: Changing the data type of 'sdc_extra' key in the event * TDL-12464: Updating test cases as per the code optimization * TDL-12464: Updating version of singer-encoding library * TDL-12464: Updating version of singer-python and backoff modules Co-authored-by: Karan Panchal (C) <[email protected]> Co-authored-by: harshpatel4_crest <[email protected]> * TDL-12486: Added support of compressed files (#32) * TDL-12486: Added support of compressed files * TDL-12486: Updated singer encoding dependency * TDL-12486: Added more doc strings. * TDL-12486: Upgraded dependencies changed the logic of taking samples from zip * TDL-12486: Increase coverage to test compressed files * TDL-12486: Upgraded the singer-encoding version to 0.1.0 * TDL-12486: Removed trailing-whitespace * TDL-12486: Updated test case of S3AllFilesSupport * TDL-12486: Removed comman self.conn_id * TDL-12486: Changes reverted. * TDL-12486: Changed start date format * TDL-12486: Updated date format in test_All_supported_files. * TDL-12486: Change in logger messages Co-authored-by: dbshah1212 <[email protected]> * Tdl 12589 change sdc extra logs from debug to warn (#33) * TDL-12589: Changed sdc_extra log from debug to warn * TDL-12589: Changed message to sync with csv message * TDL-12589: Updated message Co-authored-by: dbshah1212 <[email protected]> * version bump to 1.3.0 (#34) * Strictly enforce the ordering of type checking for integer vs number (#35) * Strictly enforce the ordering of type checking for integer vs number * Bump to v1.3.1 (#36) * TDL-14068:fixed key-error exception (#38) * TDL-14068:fixed key-error exception * Added unit test cases and integration tests * Running one integration test for debugging * Debugging integration test case * Updated integration test * Updated integration test expected output * Updated config.yml for running all integration test again * Fix/tdl 14038 filename issue (#37) * TLD-14038: Skipping the .gz which gzip using --no-name * TDL-14038: Added final count of total skipped files for discover mode and sync mode * tdl-14038: Updated warning message and added unit test for the same * TDL-14038: Removed global variable and added integration test * TDL-14038: Updated comments * TDL-14038: Added blank line * TDL-14038: Removed: trailing-whitespace * TDL-14038: Added comment of pylint disable * TDL-14038: Updated pylint comment * TDL-14038: Updated the test file class name * TDL-14038: Removed self file call and added global. * TDL: Remove warning message for 0 file skipped * TDL-14038: Removed trailing white space * TDL-14068: Fixed key error exception. * TDL-14038: Reverted another bug changes * TDL-14038: updated skipped_files_count * TDL-14038: Updated message, comments and counts * TDL-14038: Removed trailing-whitespace * TDL-14038: Updated unit test cases * TDL-14038: Updated sync file code. * Resolved: use-maxsplit-arg * Refactor how we handle nameless files * Fix comment placement * Mention tar as a problem too * Make pylint happy Co-authored-by: dbshah1212 <[email protected]> Co-authored-by: Andy Lu <[email protected]> * Bump to v1.3.2, update changelog (#39) * Bump to v1.3.2, update changelog * Update changelog * bump singer-encodings 0.1.1 (#41) * bump 1.3.3 (#42) * TDL-14228: Generate catalog file with the properties key if no samples found for sampling. (#40) * Updated sampled schema when no samples found * Running one integration test for debugging * Debugging integration test * Debugging integration test * Updated integration test for catalog_with_empty_properties * Running all integration test again * Fix/wrong file extention error handling (#43) * fix: Handled Unicode and JsonDecoder Error for wrong extention file. * fix: Updated sync code and test case * Fix: Handled StopIteration error for empty csv file. * fix: Added unit test of StopIteration code handling * fix: Resolved pylint errors * Fix: removed trailing white space * fix: disabled use-maxsplit-arg as we haven't change the code as part of this branch * fix: Removed exception and added Warning for empty Jsonl file. * fix: Handled pylint error * fix: Skipping records with empty json * fix: Added unit tests and integration tests for empty json jsonl file. * fix: Skipping Empty Josn whily syncing as well * Skipping empty lines of CSV in sampling and sync * fix: Upgraded latest version of singer-encoding. * fix: Added some test files * fix: Removed unused variable declaration * fix: Added UnicodeDecodeError and JSONDecodeError handling scenario in comment. * fix: Final touch * Update spell mistake * Corrected typo * Updated warning messages and empty jsonl file in skip count * fix: Put warning of skipping empty jsonl files. * fix: Updated comment Co-authored-by: dbshah1212 <[email protected]> Co-authored-by: savan-chovatiya <[email protected]> Co-authored-by: Kyle Allan <[email protected]> * Bump to version 1.3.4 (#45) * Bump to version 1.3.4 * Bump to version 1.3.4 * Bump to version 1.3.4 * Bump to version 1.3.4 * Bump to version 1.3.4 Co-authored-by: KrishnanG <[email protected]> * WP-7630 Reintroduce role assumption capabilities * WP-7630 Specify config for external source * WP-7630 Test * WP-7630 Undo test * WP-7630 Resolve merge issues * WP-7630 Try with setup.py file * WP-7630 Modify setup.py * WP-7630 Add recursive_search parameter * WP-7630 Fix recursive_search * WP-7630 Use appropriate version number * WP-7630 Fix recursive_search with blank prefix * WP-7630 Update readme, changelog Co-authored-by: Nick McCoy <[email protected]> Co-authored-by: cosimon <[email protected]> Co-authored-by: savan-chovatiya <[email protected]> Co-authored-by: Savan Chovatiya <[email protected]> Co-authored-by: Collin Simon <[email protected]> Co-authored-by: dbshah1212 <[email protected]> Co-authored-by: dbshah1212 <[email protected]> Co-authored-by: karanpanchal-crest <[email protected]> Co-authored-by: Karan Panchal (C) <[email protected]> Co-authored-by: harshpatel4_crest <[email protected]> Co-authored-by: Leslie VanDeMark <[email protected]> Co-authored-by: Andy Lu <[email protected]> Co-authored-by: zachharris1 <[email protected]> Co-authored-by: savan-chovatiya <[email protected]> Co-authored-by: Kyle Allan <[email protected]> Co-authored-by: KrisPersonal <[email protected]> Co-authored-by: KrishnanG <[email protected]>
https://varicent.atlassian.net/browse/WP-7630
In addition to changes made specifically for this ticket, changes from singer-io/tap-s3-csv were also pulled in to capture updates. I have marked the changes I made with comments.
WP-7630 updates were mainly to reintroduce ability to assume role in an external aws account, so we can pull files from accounts other than our own.