-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Free text search results cannot be ordered by relevance #1152
Comments
On first inspection I thought the second option is more scalable long term as we may not always have the 300-result limit and we wouldn’t want to harm the client performance too much by sorting everything client side in the first option. But it sounds like the filtering wouldn’t work correctly since all 300 are needed anyway. As a result, the third option is the only one that would retain the lazy loading and sounds like a better combination of both? Personally Ideally, I think it would be good if the search, filtering, and sorting all occurred on the backend for simplicity and performance, but it sounds as though this is not possible with Lucene? |
Thanks for your thoughts @joelvdavies. Sorting by score as part of a DB call (in the way we do in the table filters by Finally it's worth mentioning that (while our current implemtation doesn't allow it) you can also get more information back from Lucene in addition to just the ID. In principle, if you indexed all the relevant data into Lucene, you could avoid doing the subsequent DB call entirely. Though that would be a substantial change. But it would be an option, and relying on Lucene for all sorting (not just the score sorting) would mean you don't have this current situation where I'm only sorting the top 300 results by score in the front end, rather than the top 300 most recent, alphabetically first etc. |
this fixes the ability to select/deselect items
…o display facet counts - currently only added the instrument fix to DatafileSearchTable
Won't run on CI yet Also, fixed a few minor bugs I found
Would need to re-source .bashrc to pick it up from PATH - this is easier
Made it more efficient by bundling it all together in the current tab change
…truments @ dataset/dataset level
if you had multiple entity types available for search, card view pagination could fail if they had different amounts of results Since the "hidden" tab would reset the page to 1 when it detected the page was greater than it's max page
also some other minor styling stuff
Description:
When performing a free text search, Lucene generates a score associated with each entity ID. For example:
[ { id: 2, score: 0.9 }, {id: 3, score: 0.7}, { id: 1, score: 0.5 } ]
. These ids are returned, in descending order by score, by:datagateway/packages/datagateway-common/src/api/lucene.tsx
Lines 63 to 86 in be99284
We then use these to perform the DB query to get the entities (e.g.
/investigations?where={id: {in: [2, 3, 1]}}
):datagateway/packages/datagateway-search/src/table/investigationSearchTable.component.tsx
Lines 75 to 108 in be99284
However, the order of the returned entities does not reflect this order, and cannot be obtained directly from the DB as it does not have any concept of the Lucene score.
Other sorting is done serverside as part of the query, for example the default behaviour of sorting by ID. This has advantages, mainly lazy loading since you don't have to fetch all 300 results and only fetch as many need to be displayed.
Sorting by score would have to be done clientside using the list of IDs from Lucene, however currently we lazily load or paginate the results, so that we have the first 50 by the sorting criteria loaded in the table. However there is no guarantee these are the best results, and so even sorting clientside by score on these would not necesarily give the best results first.
Possible changes:
If sorting by score was implemented, how would this interact with the table filters/sort? For sorting by date say, we now wouldn't want only the best 50 ids by score, but all 300 for that DB query (which wouldn't be available in the 2nd approach). In both the latter two approaches, we would have to switch from manually sending the IDs in batches based on score to sending all of them whenever a table filter was applied. However it's worth noting that even now, sorting by date won't give the most recent results matching the search query, it will give the most recent of the 300 results that best matched the search query. So already this is perhaps not what a user would expect. While not currently implemented, Lucene and its derivatives support sorting by non-score fields (i.e. we could achieve the former behaviour in principle by relying on Lucene for both searching and sorting).
Filtering might also be problematic, as if I only query on the top 50 results by score, with a filter which removes 49 of those results, I'm going to need to send another query to get more results straight away. Currently, as we send all 300 ids as part of the query you get 50 results that definitely match the filter without need for subsequent queries. Having said this, if the user wanted a more accurate result, they could do this using the free text search itself provided the relevant fields are indexed in Lucene.
Further discussion is welcome.
Acceptance criteria:
The text was updated successfully, but these errors were encountered: