Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES Search Query Collect All Response #1631

Merged
merged 4 commits into from
Oct 29, 2024
Merged

Conversation

noah-paige
Copy link
Contributor

Feature or Bugfix

  • Bugfix

Detail

  • For catalog_indexer_task ensure we collect all hits from query response for with_deletes option
    • Up the Query Size to 1000 results (default is 10)
    • Add logic to continue querying to collect all hits if there are more than the query size limit (i.e. > 1000)

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)?
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization?
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features?
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users?
    • Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@noah-paige
Copy link
Contributor Author

noah-paige commented Oct 10, 2024

TESTING - Tested both locally and in AWS for the following:

  • Test startReindexTask successfully runs catalog_indexer_task when withDeletes=False
  • Test startReindexTask successfully runs catalog_indexer_task when withDeletes=True
  • Test startReindexTask successfully runs catalog_indexer_task when withDeletes=True and # of objects to delete is > 10 and QUERY_SIZE is 10 (i.e. multiple search calls required to collect all responses)
  • UI button works to invoke startReindexCatalog as Admin with either withDeletes Switch T/F

@noah-paige noah-paige self-assigned this Oct 22, 2024
@noah-paige noah-paige marked this pull request as ready for review October 23, 2024 21:43
@dlpzx dlpzx self-requested a review October 25, 2024 07:01
Copy link
Contributor

@dlpzx dlpzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just want to confirm the new way we are dealing with the response of search.
response = {'hits':{'hits':[{'_id:1, ...}...]}}
Before:

  • we extract docs.get('hits', {}).get('hits', []) in the catalog indexer task from search
  • we extract hits-hits in the FE from search - with the Catalog DataSearch props...
    After:
  • we directly get the hits in the catalog indexer task from search_all
  • we extract hits-hits in the FE from search - we do not change anything to not mess up with the Catalog view

@noah-paige
Copy link
Contributor Author

Looks good. I just want to confirm the new way we are dealing with the response of search. response = {'hits':{'hits':[{'_id:1, ...}...]}} Before:

* we extract docs.get('hits', {}).get('hits', []) in the catalog indexer task from search

* we extract hits-hits in the FE from search - with the Catalog DataSearch props...
  After:

* we directly get the hits in the catalog indexer task from search_all

* we extract hits-hits in the FE from search - we do not change anything to not mess up with the Catalog view

Correct! The FE Component we use automatically handles all the pagination for us

@noah-paige noah-paige merged commit 92b591f into main Oct 29, 2024
9 checks passed
dlpzx pushed a commit that referenced this pull request Nov 6, 2024
### Feature or Bugfix
<!-- please choose -->
- Bugfix

### Detail
- For `catalog_indexer_task` ensure we collect all hits from query
response for `with_deletes` option
   -  Up the Query Size to 1000 results (default is 10)
- Add logic to continue querying to collect all hits if there are more
than the query size limit (i.e. > 1000)


### Relates
- <URL or Ticket>

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@dlpzx dlpzx mentioned this pull request Nov 6, 2024
dlpzx added a commit that referenced this pull request Nov 8, 2024
### Feature or Bugfix
- Security

### Detail
* get-parameter CloudfrontDistributionDomainName from us-east-1 (#1687 )
* Added Token Validations (#1682)
* add warning to untrust data.all account when removing an environment
(#1685)
* add custom domain support for apigw (#1679)
* Lambda Event Logs Handling (#1678)
* Upgrade Spark version to 3.3 (#1675) -
a0c63a4
* ES Search Query Collect All Response  (#1631)
* Extend Tenant Perms Coverage (#1630)
* Limit Response info dataset queries (#1665)
* Add Removal Policy Retain to Bucket Policy IaC (#1660) 
* log API handler response only for LOG_LEVEL DEBUG. Set log level INFO
for prod deployments (#1662)
* Add permission checks to markNotificationAsRead + deleteNotification
(#1654)
* Added error view and unified utility to check tenant user (#1657
* Userguide signout flow (#1629)

### Relates
- Security release

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: Noah Paige <[email protected]>
Co-authored-by: Petros Kalos <[email protected]>
@dlpzx dlpzx deleted the fix/catalog-indexer-pagination branch November 22, 2024 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants