Skip to content

Commit

Permalink
WP-7630 (#7)
Browse files Browse the repository at this point in the history
* Fix/config parsing (#21)

* allow search_prefix to be None

* handle both list and string for key_properties and date_overrides

* pylint

* Bump to v1.2.2 (#22)

* Bump to v1.2.2

* Changelog

* Check if search_prefix is present before popping (#23)

* Bump to v1.2.3 (#24)

* TDL-13258 move tests from tap-tester to tap-s3-csv (#29)

* TDL-13258:Added integration tests and resources to tap-s3-csv from tap-tester

* Add context and triggers to circleci config

* Run nosetests on the correct folder

* Remove nose tests because there are no unit tests

* Fix test properties

* TDL-13258:Updated non_rectangular_files test case in types_and_data

* Combine related tests into one

Co-authored-by: Savan Chovatiya <[email protected]>
Co-authored-by: Collin Simon <[email protected]>

* TDL-12589: Added the support of JSONL files (#31)

* TDL-12589: Added the support of JSONL files

* TDL-12589: Formated code

* TDL-12589: test updated

* TDL-12589: Updated config.yml to expect failures

* TDL-12589: added stitch api tocken

* TDL-12589: Updated config and conversion of datatype

* TDL-12589: Updated priority of datatype like:
list
date-time
dict
integer
number
null - default in evenryone
string - default in evenryone

* TDL-12589: Updated as per priority

* TDL-12589: removed pylint failures

* TDL-12589: replaced

* TDL-12589: Added warning message for list inside list

* TDL-12589: Optimized code

* TDL-12589: Removed white space

* TDL-12589: Skipping row of JOSNL file if it is empty instaid of raising error.

* TDL: Rmoved extra white space

* TDL-12589: Updated test files

* TDL-12589: Updated code as per review comments changes

* TDL-12589: Added Unittests for the same

* TDL-12589: Pylint error resolved

* TDL-12589: Changed remove fields log from info to debug

* TDL-12589: Updated conversion code to support + sign

Co-authored-by: dbshah1212 <[email protected]>

* TDL-12464: Added support for handling the duplicate headers in the CS… (#30)

* TDL-12464: Added support for handling the duplicate headers in the CSV file

* Changed warning message

* Updated unit tests according to the warning message

* TDL-12464: Adding code to leverage duplicate headers support provided in simger-encoding library

* TDL-12464: Removed the unwanted code and made compatible with master repo

* TDL-12464: Upgraded singer-encodings library to fetch the latest version

* TDL-12464: Changing the data type of 'sdc_extra' key in the event

* TDL-12464: Updating test cases as per the code optimization

* TDL-12464: Updating version of singer-encoding library

* TDL-12464: Updating version of singer-python and backoff modules

Co-authored-by: Karan Panchal (C) <[email protected]>
Co-authored-by: harshpatel4_crest <[email protected]>

* TDL-12486: Added support of compressed files (#32)

* TDL-12486: Added support of compressed files

* TDL-12486: Updated singer encoding dependency

* TDL-12486: Added more doc strings.

* TDL-12486: Upgraded dependencies changed the logic of taking samples from zip

* TDL-12486: Increase coverage to test compressed files

* TDL-12486: Upgraded the singer-encoding version to 0.1.0

* TDL-12486: Removed trailing-whitespace

* TDL-12486: Updated test case of S3AllFilesSupport

* TDL-12486: Removed comman self.conn_id

* TDL-12486: Changes reverted.

* TDL-12486: Changed start date format

* TDL-12486: Updated date format in test_All_supported_files.

* TDL-12486: Change in logger messages

Co-authored-by: dbshah1212 <[email protected]>

* Tdl 12589 change sdc extra logs from debug to warn (#33)

* TDL-12589: Changed sdc_extra log from debug to warn

* TDL-12589: Changed message to sync with csv message

* TDL-12589: Updated message

Co-authored-by: dbshah1212 <[email protected]>

* version bump to 1.3.0 (#34)

* Strictly enforce the ordering of type checking for integer vs number (#35)

* Strictly enforce the ordering of type checking for integer vs number

* Bump to v1.3.1 (#36)

* TDL-14068:fixed key-error exception (#38)

* TDL-14068:fixed key-error exception

* Added unit test cases and integration tests

* Running one integration test for debugging

* Debugging integration test case

* Updated integration test

* Updated integration test expected output

* Updated config.yml for running all integration test again

* Fix/tdl 14038 filename issue (#37)

* TLD-14038: Skipping the .gz which gzip using --no-name

* TDL-14038: Added final count of total skipped files for discover mode and sync mode

* tdl-14038: Updated warning message and added unit test for the same

* TDL-14038: Removed global variable and added integration test

* TDL-14038: Updated comments

* TDL-14038: Added blank line

* TDL-14038: Removed: trailing-whitespace

* TDL-14038: Added comment of pylint disable

* TDL-14038: Updated pylint comment

* TDL-14038: Updated the test file class name

* TDL-14038: Removed self file call and added global.

* TDL: Remove warning message for 0 file skipped

* TDL-14038: Removed trailing white space

* TDL-14068: Fixed key error exception.

* TDL-14038: Reverted another bug changes

* TDL-14038: updated skipped_files_count

* TDL-14038: Updated message, comments and counts

* TDL-14038: Removed trailing-whitespace

* TDL-14038: Updated unit test cases

* TDL-14038: Updated sync file code.

* Resolved: use-maxsplit-arg

* Refactor how we handle nameless files

* Fix comment placement

* Mention tar as a problem too

* Make pylint happy

Co-authored-by: dbshah1212 <[email protected]>
Co-authored-by: Andy Lu <[email protected]>

* Bump to v1.3.2, update changelog (#39)

* Bump to v1.3.2, update changelog

* Update changelog

* bump singer-encodings 0.1.1 (#41)

* bump 1.3.3 (#42)

* TDL-14228: Generate catalog file with the properties key if no samples found for sampling. (#40)

* Updated sampled schema when no samples found

* Running one integration test for debugging

* Debugging integration test

* Debugging integration test

* Updated integration test for catalog_with_empty_properties

* Running all integration test again

* Fix/wrong file extention error handling (#43)

* fix: Handled Unicode and JsonDecoder Error for wrong extention file.

* fix: Updated sync code and test case

* Fix: Handled StopIteration error for empty csv file.

* fix: Added unit test of StopIteration code handling

* fix: Resolved pylint errors

* Fix: removed trailing white space

* fix: disabled use-maxsplit-arg as we haven't change the code as part of this branch

* fix: Removed exception and added Warning for empty Jsonl file.

* fix: Handled pylint error

* fix: Skipping records with empty json

* fix: Added unit tests and integration tests for empty json jsonl file.

* fix: Skipping Empty Josn whily syncing as well

* Skipping empty lines of CSV in sampling and sync

* fix: Upgraded latest version of singer-encoding.

* fix: Added some test files

* fix: Removed unused variable declaration

* fix: Added UnicodeDecodeError and JSONDecodeError handling scenario in comment.

* fix: Final touch

* Update spell mistake

* Corrected typo

* Updated warning messages and empty jsonl file in skip count

* fix: Put warning of skipping empty jsonl files.

* fix: Updated comment

Co-authored-by: dbshah1212 <[email protected]>
Co-authored-by: savan-chovatiya <[email protected]>
Co-authored-by: Kyle Allan <[email protected]>

* Bump to version 1.3.4 (#45)

* Bump to version 1.3.4

* Bump to version 1.3.4

* Bump to version 1.3.4

* Bump to version 1.3.4

* Bump to version 1.3.4

Co-authored-by: KrishnanG <[email protected]>

* WP-7630 Reintroduce role assumption capabilities

* WP-7630 Specify config for external source

* WP-7630 Test

* WP-7630 Undo test

* WP-7630 Resolve merge issues

* WP-7630 Try with setup.py file

* WP-7630 Modify setup.py

* WP-7630 Add recursive_search parameter

* WP-7630 Fix recursive_search

* WP-7630 Use appropriate version number

* WP-7630 Fix recursive_search with blank prefix

* WP-7630 Update readme, changelog

Co-authored-by: Nick McCoy <[email protected]>
Co-authored-by: cosimon <[email protected]>
Co-authored-by: savan-chovatiya <[email protected]>
Co-authored-by: Savan Chovatiya <[email protected]>
Co-authored-by: Collin Simon <[email protected]>
Co-authored-by: dbshah1212 <[email protected]>
Co-authored-by: dbshah1212 <[email protected]>
Co-authored-by: karanpanchal-crest <[email protected]>
Co-authored-by: Karan Panchal (C) <[email protected]>
Co-authored-by: harshpatel4_crest <[email protected]>
Co-authored-by: Leslie VanDeMark <[email protected]>
Co-authored-by: Andy Lu <[email protected]>
Co-authored-by: zachharris1 <[email protected]>
Co-authored-by: savan-chovatiya <[email protected]>
Co-authored-by: Kyle Allan <[email protected]>
Co-authored-by: KrisPersonal <[email protected]>
Co-authored-by: KrishnanG <[email protected]>
  • Loading branch information
18 people authored Feb 28, 2022
1 parent 7adc422 commit ebacc13
Show file tree
Hide file tree
Showing 116 changed files with 9,916 additions and 311 deletions.
31 changes: 25 additions & 6 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ version: 2
jobs:
build:
docker:
- image: 218546966473.dkr.ecr.us-east-1.amazonaws.com/circle-ci:tap-tester
- image: 218546966473.dkr.ecr.us-east-1.amazonaws.com/circle-ci:tap-tester-v4
steps:
- checkout
- run:
Expand All @@ -12,26 +12,45 @@ jobs:
source /usr/local/share/virtualenvs/tap-s3-csv/bin/activate
pip install .
pip install pylint
pylint tap_s3_csv -d missing-docstring,invalid-name,line-too-long,too-many-locals,too-few-public-methods,fixme,stop-iteration-return,broad-except
pylint tap_s3_csv -d duplicate-code,consider-using-f-string,logging-format-interpolation,missing-docstring,invalid-name,line-too-long,too-many-locals,too-few-public-methods,fixme,stop-iteration-return,broad-except,bare-except,unused-variable,unnecessary-comprehension,no-member,deprecated-method,protected-access
- run:
name: 'Unit Tests'
command: |
source /usr/local/share/virtualenvs/tap-s3-csv/bin/activate
pip install nose
nosetests
nosetests tests/unittests/
- add_ssh_keys
- run:
name: 'Integration Tests'
command: |
aws configure set aws_access_key_id "$AWS_ACCESS_KEY_ID"
aws configure set aws_secret_access_key "$AWS_SECRET_ACCESS_KEY"
aws s3 cp s3://com-stitchdata-dev-deployment-assets/environments/tap-tester/sandbox dev_env.sh
aws s3 cp s3://com-stitchdata-dev-deployment-assets/environments/tap-tester/tap_tester_sandbox dev_env.sh
source dev_env.sh
source /usr/local/share/virtualenvs/tap-tester/bin/activate
run-a-test --tap=tap-s3-csv \
run-test --tap=tap-s3-csv \
--target=target-stitch \
--orchestrator=stitch-orchestrator \
[email protected] \
--password=$SANDBOX_PASSWORD \
--client-id=50 \
tap_tester.suites.s3_csv
--token=$STITCH_API_TOKEN \
tests
workflows:
version: 2
commit:
jobs:
- build:
context: circleci-user
build_daily:
triggers:
- schedule:
cron: "0 14 * * *"
filters:
branches:
only:
- master
jobs:
- build:
context: circleci-user
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
# Changelog

## 1.3.0

- Reintroduce ability to assume role for external AWS account
- Add optional parameter `recursive_seach` to table config. When set to false, will prevent searching for files in subfolders within S3 bucket

- Merge in upstream changes, below:
- Bump singer-encodings to 0.1.2 [21](https://github.com/singer-io/singer-encodings/pull/21)
- Bump singer-encodings to 0.1.1 [#41](https://github.com/singer-io/tap-s3-csv/pull/41)
- Skip files without a name [#37](https://github.com/singer-io/tap-s3-csv/pull/37)
- Fix an issue to allow the tap to run with a catalog without schemas [#38](https://github.com/singer-io/tap-s3-csv/pull/38)
- Fixed bug that caused `integer`s to be discovered as `number` differently in different versions of python [#35](https://github.com/singer-io/tap-s3-csv/pull/35)
- Adds support for Compressed files [#32](https://github.com/singer-io/tap-s3-csv/pull/32)
- Adds support for JSONL files [#31](https://github.com/singer-io/tap-s3-csv/pull/31)
- Adds support for duplicated headers in CSV files [#30](https://github.com/singer-io/tap-s3-csv/pull/30)
- Adds testing [#29](https://github.com/singer-io/tap-s3-csv/pull/29)
- Updates `backoff`, `singer-encodings`, and `singer-python` dependencies
- Updates logging messages

## 1.2.3

- Fix issue relating to search_prefix config values

## 1.0.5

- Removed Singer-specific `_sdc_` columns
Expand Down
30 changes: 30 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,36 @@ The `table` field consists of one or more objects that describe how to find file

A sample configuration is available inside [config.sample.json](config.sample.json)

### Configuration when your source file exists in an external AWS account

```
{
"bucket": "bucket-name",
"account_id": "111222333444",
"role_name": "role name in external AWS account giving your AWS account permission to access their S3 bucket",
"external_id": "external id defined in role in external AWS account giving your AWS account permission to access their S3 bucket",
"tables": "[
{
"search_prefix": "exports",
"search_pattern": "my_table\/.*\.csv",
"table_name": "my_table",
"key_properties": "id",
"date_overrides": "created_at",
"delimiter": ","
"escape_char": "\",
"recursive_search": false
}
]"
}
```

- **account_id**: The AWS account id of the external AWS account you are trying to get the file from
- **role_name**: The name of the role set up in the external AWS account to provide you access to their S3 bucket
- **external_id**: The external_id defined in the role to help authorize your AWS account when connecting to the external AWS account
- **recursive_search**: true/false/undefined

A note about `recursive_search` property: By default (with `recursive_search` undefined or set to true), the tap will select files in your S3 bucket whose file names match the `search_pattern` regex in the folder you specify with `search_prefix`, and any subfolders within the folder. If multiple files are found in the folder structure that match the `search_pattern`, the content of all of the files will be combined. For discovery, this means all columns from all files will be present in the catalog that gets produced, and for import, it means all columns and all rows from all files will be present in the resulting output (for files that don’t include columns that are present in other selected files, the corresponding cells for those rows will just be blank). This behaviour could potentially be beneficial if you have multiple files with the same schema, and you would like the tap to just combine the rows. However, it could also lead to undesired results if multiple files within the same folder structure just happen to match the same `search_pattern`, but aren’t intended to be related. To limit the search to exactly folder specified with `search_prefix`, set `recursive_search` to false.

---

Copyright &copy; 2018 Stitch
214 changes: 0 additions & 214 deletions poetry.lock

This file was deleted.

23 changes: 0 additions & 23 deletions pyproject.toml

This file was deleted.

28 changes: 28 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env python

from setuptools import setup

setup(name='tap-s3-csv',
version='1.3.0',
description='Singer.io tap for extracting CSV files from S3',
author='Stitch',
url='https://singer.io',
classifiers=['Programming Language :: Python :: 3 :: Only'],
py_modules=['tap_s3_csv'],
install_requires=[
'backoff==1.8.0',
'boto3==1.17.0',
'singer-encodings==0.1.2',
'singer-python==5.12.1',
'voluptuous==0.10.5'
],
extras_require={
'dev': [
'ipdb==0.11'
]
},
entry_points='''
[console_scripts]
tap-s3-csv=tap_s3_csv:main
''',
packages=['tap_s3_csv'])
Loading

0 comments on commit ebacc13

Please sign in to comment.