Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSearch and GraphQL multivalue facets #551

Merged
merged 1 commit into from
Aug 1, 2022

Conversation

JPrevost
Copy link
Member

@JPrevost JPrevost commented Aug 1, 2022

Why are these changes being introduced:

  • Our spec requires that many facets can be applied multiple times,
    which allows us to filter records to find just those that have
    contentTypes of both dataset and still image for example
  • Source aggregations remain unique as we apply them as OR instead of
    AND
  • literaryForm is the only aggregation that accepts a single value in
    the data model

Relevant ticket(s):

How does this address that need:

  • Refactors how facets are applied
  • If multiple values are allowed, calls the new logic multiple times
    (once per value)
  • If a single value is allowed, calls the logic once
  • Leaves source as custom logic separate from this refactor

Document any side effects to this change:

  • there was an inconsistency in the aggregation type for format. We
    aggregated under contentFormat but accepted queries under format.
    I have normalized both sides of that to use format.
  • Unit tests were created for all facet types, but only one integration
    test was created. contentType was chosen as that was the value the
    timdex-ui needs to support for RDI. Unit testing all of the fields and
    integration testing just one felt okay as it avoids a lot of
    additional semi-fragile tests (see below) while ensuring we are
    constructing the expected query structure.
  • a new semi-fragile test was added. Even taking care to not look for
    specific values in the data the new test is semi-fragile because we at
    least need data that can be filtered twice to test for the expected
    data change. As we stabilize the timdex pipelines, we should create a
    specific test set of data rather than using what is available. This
    would allow us to extract records from each source, while also pulling
    some with specific attributes we want to test... including some
    potentially constructed records to test values we want to support in
    code but can't find appropriate examples. This test set should be
    stored in a way that the new data loader can easily load it locally,
    but we should also consider having a specific test alias loaded
    with this data in dev1 or stage to allow us to access it from PR
    builds when appropriate to demonstrate specific problems/solutions to
    stakeholders without having to manipulate the "real" aliases/indexes.
  • OpenSearch#filter was updated to accept a params argument rather than reading from an instance variable to allow for easier unit testing

Developer

  • All new ENV is documented in README
  • All new ENV has been added to Heroku Pipeline, Staging and Prod
  • ANDI or Wave has been run in accordance to
    our guide and
    all issues introduced by these changes have been resolved or opened as new
    issues (link to those issues in the Pull Request details above)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Requires database migrations?

NO

Includes new or updated dependencies?

NO

Why are these changes being introduced:

* Our spec requires that many facets can be applied multiple times,
  which allows us to filter records to find just those that have
  `contentTypes` of _both_ `dataset` and `still image` for example
* Source aggregations remain unique as we apply them as `OR` instead of
  `AND`
* `literaryForm` is the only aggregation that accepts a single value in
  the data model

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/RDI-247

How does this address that need:

* Refactors how facets are applied
* If multiple values are allowed, calls the new logic multiple times
  (once per value)
* If a single value is allowed, calls the logic once
* Leaves `source` as custom logic separate from this refactor

Document any side effects to this change:

- there was an inconsistency in the aggregation type for `format`. We
  aggregated under `contentFormat` but accepted queries under `format`.
  I have normalized both sides of that to use `format`.
- Unit tests were created for all facet types, but only one integration
  test was created. `contentType` was chosen as that was the value the
  timdex-ui needs to support for RDI. Unit testing all of the fields and
  integration testing just one felt okay as it avoids a lot of
  additional semi-fragile tests (see below) while ensuring we are
  constructing the expected query structure.
- a new semi-fragile test was added. Even taking care to not look for
  specific values in the data the new test is semi-fragile because we at
  least need data that can be filtered twice to test for the expected
  data change. As we stabilize the timdex pipelines, we should create a
  specific test set of data rather than using what is available. This
  would allow us to extract records from each source, while also pulling
  some with specific attributes we want to test... including some
  potentially constructed records to test values we want to support in
  code but can't find appropriate examples. This test set should be
  stored in a way that the new data loader can easily load it locally,
  but we should also consider having a specific `test` alias loaded
  with this data in dev1 or stage to allow us to access it from PR
  builds when appropriate to demonstrate specific problems/solutions to
  stakeholders without having to manipulate the "real" aliases/indexes.
- `OpenSearch#filter` was updated to accept a `params` argument rather than reading from an instance variable to allow for easier unit testing
@mitlib mitlib temporarily deployed to timdex-pr-551 August 1, 2022 13:39 Inactive
@JPrevost JPrevost requested a review from jazairi August 1, 2022 14:08
@JPrevost
Copy link
Member Author

JPrevost commented Aug 1, 2022

This example query might be useful to see the change in the PR build.

{
  search(contentTypeFacet: ["dataset", "still image"], index: "rdi*") {
    hits
    aggregations {
      contentType {
        docCount
        key
      }
    }
  }
}

@jazairi jazairi self-assigned this Aug 1, 2022
Copy link
Contributor

@jazairi jazairi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on noting the fragile test. I'm open to the test alias idea as a future improvement, but I think that documenting it is fine for now. I'll be curious to see how often that comes up in the near term; it may not be as fragile as we think.

Thanks for this! :shipit:

@JPrevost JPrevost merged commit 647292a into main Aug 1, 2022
@JPrevost JPrevost deleted the rdi-247-support-for-multivalue-facets branch August 1, 2022 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants