Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for querying specific indexes #540

Merged
merged 1 commit into from
Jul 15, 2022

Conversation

JPrevost
Copy link
Member

@JPrevost JPrevost commented Jul 13, 2022

UPDATE 2: Helen has ingested data from the 3 sources into dev opensearch with rdi*-prefixes and I've configured this PR to point at that opensearch. The older RDI indexes remain on that server promoted to timdex-prod and are the counts may not match exactly as they were ingested from potentially different data sets. Example query for all 3 RDI sources:

{
  search(searchterm: "data", index: "rdi*") {
    hits
    aggregations {
      source {
        key
        docCount
      }
    }
  }
}

UPDATE: much simpler way to check this, use the note below to confirm searching multiple indexes with rdi* if you'd like, but just checking the ability to search specific indexes can be done in the PR build with the following GraphQL queries:

Searching a specific index:

{
  search(searchterm: "data", index: "dspace*") {
    hits
    aggregations {
      source {
        key
        docCount
      }
    }
  }
}

No index provided defaults to timdex-prod sources [default is defined in ENV as timdex-prod (currently all 4 indexes are promoted in OpenSearch):

{
  search(searchterm: "data") {
    hits
    aggregations {
      source {
        key
        docCount
      }
    }
  }
}

You can search a specific alias (yes, this is the same as the default but it shows we can provide it explicitly)

{
  search(searchterm: "data", index: "timdex-prod") {
    hits
    aggregations {
      source {
        key
        docCount
      }
    }
  }
}

NOTE: the indexes in the opensearch version this app is connected to have not been updated to meet the expectations of this code. To test locally, the following set of commands (modified to point to json files you have downloaded) might be useful:

  •  docker run --name opensearch --rm -p 9200:9200 -e http.port=9200 \                               
     -e discovery.type=single-node -e plugins.security.disabled=true \
     -e http.max_content_length=10MB -e http.cors.enabled=true -e http.cors.allow-origin=\* \
     -e http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization \
     -e http.cors.allow-credentials=true opensearchproject/opensearch:1.3.0
    

- `mario ingest -s dspace --new --auto fixtures/dspace-sample-timdex-json-2022-06-21.json` to get a non-rdi source. This command auto promotes and thus this index is available in `timdex-prod`
- `mario ingest -s rdizenodo --new fixtures/zenodo-full-timdex-json-2022-06-21.json` (no auto promote, uses `rdi` prefix)
- `mario ingest -s rdijpal --new fixtures/jpal-full-timdex-json-2022-06-21.json`(no auto promote, uses `rdi` prefix)
- (you can add as many sources locally as you want, but ensure at least one is `rdi*` as the source and is not promoted (don't include the `auto` flag) and one is not `rdi*` and _is_ auto promoted (by including the `auto` flag)

At that point, I'd recommend looking at some of the new tests for syntax examples along with the playground.

Why are these changes being introduced:

  • We need to be able to provide a UI that only queries a specific set
    of data, so we need to enable that in the API

Relevant ticket(s):

How does this address that need:

  • Adds index to both Search and Retrieve GraphQL endpoints and
    documents them as being best left out
  • Updates Opensearch and Retrieve models to use any index value provided
    and default to the value provided in ENV if none is provided

Document any side effects to this change:

  • The Retrieve model had previously been the same for v1 and v2 but this
    change required a v2 feature flag for Retrieve to ensure the current
    behavior continued to work

Developer

  • All new ENV is documented in README
  • All new ENV has been added to Heroku Pipeline, Staging and Prod
  • ANDI or Wave has been run in accordance to
    our guide and
    all issues introduced by these changes have been resolved or opened as new
    issues (link to those issues in the Pull Request details above)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Requires database migrations?

NO

Includes new or updated dependencies?

NO

Why are these changes being introduced:

* We need to be able to provide a UI that only queries a specific set
  of data, so we need to enable that in the API

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/RDI-208

How does this address that need:

* Adds `index` to both Search and Retrieve GraphQL endpoints and
  documents them as being best left out
* Updates Opensearch and Retrieve models to use any index value provided
  and default to the value provided in ENV if none is provided

Document any side effects to this change:

* The Retrieve model had previously been the same for v1 and v2 but this
  change required a v2 feature flag for Retrieve to ensure the current
  behavior continued to work
@mitlib mitlib temporarily deployed to timdex-pr-540 July 13, 2022 20:44 Inactive
Copy link
Contributor

@hakbailey hakbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave the code review part to Matt, but the query functionality makes sense to me and I've (re)confirmed that it works with the dev OS data.

Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a specific blocking comment - there are some questions where I'm trying to understand what the limits of the ticketed change are, and I think this change duplicates an existing test (but not in a way that I think it should be removed).

Nothing is blocking - merge away - but I do want to ask whether we should have a test suite for the retrieve model? I see it getting covered by the controller tests, but we also have a specific model test for opensearch but not for retrieve.


if client.instance_of?(OpenSearch::Client)
raise OpenSearch::Transport::Transport::Errors::NotFound if record['hits']['total']['value'].zero?
else
raise Elasticsearch::Transport::Transport::Errors::NotFound if record['hits']['total'].zero?
elsif record['hits']['total'].zero?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the point of the PR, but I want to note that I like this a lot better than the former syntax. Good catch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a rubocop fix that I included because I touched the file with this changes, but I agree!

body: build_query(from))
end

def default_index
ENV.fetch('ELASTICSEARCH_INDEX', nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be something we want to update when we fully convert to opensearch and remove the older elasticsearch implementation - but I got really confused for a moment wondering why this doesn't also make changes to search.rb when it also deals with the same env. I get it now, we're using the same name in both underlying systems so it makes sense to have one env - but it tripped me up for a few minutes.

I'm guessing that this will eventually become OPENSEARCH_INDEX or even just SEARCH_INDEX or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, the ELASTICSEARCH_INDEX is a legacy name that should go away.

assert_equal(expected_sources, actual_sources)
end
end

test 'graphqlv2 can retrieve a record from a default index' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't object to this test, but wonder how it differs from the graphqlv2 retrieve test on line 291? Other than asking for a different record, I'm not understanding the distinction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree, but I think the difference is intent. In this case, the distinction is the name of the test declares what we intend to test... the actual test logic is the same. I'm waffly on how I feel about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - I don't think we should remove this test to be clear. Reading through the file, I see this test as paired with the adjacent one that changes the index value, while the earlier test is removed enough that it isn't clear that we're testing both conditions without a higher cognitive load. While this may be a small amount of over-testing, I don't think that's particularly an issue. The suite runs pretty quickly, so there's no downside IMO.

@@ -153,7 +153,7 @@ def setup
}' }
assert_equal(200, response.status)
json = JSON.parse(response.body)
assert json['data']['search']['records'].first['contributors'].any? { |c| c.has_value? 'Moon, Intae' }
assert json['data']['search']['records'].first['contributors'].any? { |c| c.value? 'Moon, Intae' }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that changes like this are about rubocop flagging .has_value as not preferred, rather than some aspect of the index value? I don't object to the change, just wanting to make sure I'm not missing some ramfication to the ticketed change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was just rubocop fixes. I ran it on any file I touched and it fixed a few things out of scope of my changes but it seemed fine to include them. I should have made a note about that though in side effects to clarify the things that were not really part of this work that slid in.

@JPrevost
Copy link
Member Author

@matt-bernhardt I think there are a few tweaks to the testing approach we want to discuss as a team soon. Adam had also noted our over reliance on "controller" tests recently. I think we'll take a good look at our practices soon and make some adjustments.

@JPrevost JPrevost merged commit c3b9dd2 into main Jul 15, 2022
@JPrevost JPrevost deleted the rdi-208-support-multiple-indexes branch July 15, 2022 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants