Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Free text search results cannot be ordered by relevance #1152

Open
2 tasks
patrick-austin opened this issue Mar 7, 2022 · 2 comments
Open
2 tasks

Free text search results cannot be ordered by relevance #1152

patrick-austin opened this issue Mar 7, 2022 · 2 comments
Labels
datagateway-search Issues relating to the search plugin enhancement New feature or request requires discussion

Comments

@patrick-austin
Copy link
Contributor

Description:
When performing a free text search, Lucene generates a score associated with each entity ID. For example: [ { id: 2, score: 0.9 }, {id: 3, score: 0.7}, { id: 1, score: 0.5 } ]. These ids are returned, in descending order by score, by:

export const fetchLuceneData = async (
datasearchType: DatasearchType,
params: LuceneSearchParams,
settings: {
icatUrl: string;
}
): Promise<number[]> => {
// Query params.
const queryParams = {
sessionId: readSciGatewayToken().sessionId,
query: urlParamsBuilder(datasearchType, params),
// Default maximum count is 300.
maxCount: params.maxCount ? params.maxCount : 300,
};
return axios
.get(`${settings.icatUrl}/lucene/data`, {
params: queryParams,
})
.then((response) => {
// flatten result into array
return response.data.map((result: { id: number }) => result.id);
});
};

We then use these to perform the DB query to get the entities (e.g. /investigations?where={id: {in: [2, 3, 1]}}):
const { data: totalDataCount } = useInvestigationCount([
{
filterType: 'where',
filterValue: JSON.stringify({
id: { in: luceneData || [] },
}),
},
]);
const { fetchNextPage, data } = useInvestigationsInfinite([
{
filterType: 'where',
filterValue: JSON.stringify({
id: { in: luceneData || [] },
}),
},
{
filterType: 'include',
filterValue: JSON.stringify({
investigationInstruments: 'instrument',
}),
},
]);
const { data: allIds } = useIds(
'investigation',
[
{
filterType: 'where',
filterValue: JSON.stringify({
id: { in: luceneData || [] },
}),
},
],
selectAllSetting
);

However, the order of the returned entities does not reflect this order, and cannot be obtained directly from the DB as it does not have any concept of the Lucene score.

Other sorting is done serverside as part of the query, for example the default behaviour of sorting by ID. This has advantages, mainly lazy loading since you don't have to fetch all 300 results and only fetch as many need to be displayed.

Sorting by score would have to be done clientside using the list of IDs from Lucene, however currently we lazily load or paginate the results, so that we have the first 50 by the sorting criteria loaded in the table. However there is no guarantee these are the best results, and so even sorting clientside by score on these would not necesarily give the best results first.

Possible changes:

  • Get all 300 entities at once rather than lazily loading them, ensuring we will always be able to sort and show the most relevant results of that 300 first. This would lose the benefits of lazy loading.
  • Enable skip and limit parameters for Lucene through ICAT server. Currently, we can only set a limit (default 300), however Lucene itself supports skip and limit behaviour, so it would be possible to search for IDs in batches of 50/25 for the table's lazy loading, then sort these clientside. This would allow us to load smaller batches, but would require changes throughout the stack.
  • Still get all 300 IDs from Lucene, and "manually" paginate our requests to the DB. Instead of querying for entities in the list of 300 ids, just use the first 50 in the list for the first query, then the next 25 for subsequent queries.

If sorting by score was implemented, how would this interact with the table filters/sort? For sorting by date say, we now wouldn't want only the best 50 ids by score, but all 300 for that DB query (which wouldn't be available in the 2nd approach). In both the latter two approaches, we would have to switch from manually sending the IDs in batches based on score to sending all of them whenever a table filter was applied. However it's worth noting that even now, sorting by date won't give the most recent results matching the search query, it will give the most recent of the 300 results that best matched the search query. So already this is perhaps not what a user would expect. While not currently implemented, Lucene and its derivatives support sorting by non-score fields (i.e. we could achieve the former behaviour in principle by relying on Lucene for both searching and sorting).

Filtering might also be problematic, as if I only query on the top 50 results by score, with a filter which removes 49 of those results, I'm going to need to send another query to get more results straight away. Currently, as we send all 300 ids as part of the query you get 50 results that definitely match the filter without need for subsequent queries. Having said this, if the user wanted a more accurate result, they could do this using the free text search itself provided the relevant fields are indexed in Lucene.

Further discussion is welcome.

Acceptance criteria:

  • Possible to order search results by their score (relevance to search term)
  • Lazy loading behaviour (mostly) unaffected
@patrick-austin patrick-austin added enhancement New feature or request datagateway-search Issues relating to the search plugin requires discussion labels Mar 7, 2022
@joelvdavies
Copy link
Contributor

On first inspection I thought the second option is more scalable long term as we may not always have the 300-result limit and we wouldn’t want to harm the client performance too much by sorting everything client side in the first option. But it sounds like the filtering wouldn’t work correctly since all 300 are needed anyway. As a result, the third option is the only one that would retain the lazy loading and sounds like a better combination of both? Personally Ideally, I think it would be good if the search, filtering, and sorting all occurred on the backend for simplicity and performance, but it sounds as though this is not possible with Lucene?

@patrick-austin
Copy link
Contributor Author

Personally Ideally, I think it would be good if the search, filtering, and sorting all occurred on the backend for simplicity and performance, but it sounds as though this is not possible with Lucene?

Thanks for your thoughts @joelvdavies. Sorting by score as part of a DB call (in the way we do in the table filters by sort={"title"%3A"asc"} or whatever) isn't possible as the DB has no concept of the score. As I mention in the rambling towards the end, you can do sorting and filtering by fields with Lucene itself in principle (e.g. I could search for (extremely relevant neutron scattering data) +id:2400* returning entities that "match" at least one term from the first phrase whilst also telling Lucene to "filter" and only allow results that have an ID begining 2400. By default this would return results sorted by score, but Lucene can sort by another field such as date (however this latter sorting is not exposed by our current icat.server/icat.lucene implementation). In this sense Lucene can do sorting and filtering in the backend, however as long as we need to do a subsequent DB query for the entities you would then still need to sort the entities clientside to match the list of IDs Lucene gave you (which are already ordered and filtered).

Finally it's worth mentioning that (while our current implemtation doesn't allow it) you can also get more information back from Lucene in addition to just the ID. In principle, if you indexed all the relevant data into Lucene, you could avoid doing the subsequent DB call entirely. Though that would be a substantial change. But it would be an option, and relying on Lucene for all sorting (not just the score sorting) would mean you don't have this current situation where I'm only sorting the top 300 results by score in the front end, rather than the top 300 most recent, alphabetically first etc.

louise-davies added a commit that referenced this issue Jun 29, 2023
louise-davies added a commit that referenced this issue Oct 3, 2023
this fixes the ability to select/deselect items
louise-davies added a commit that referenced this issue Oct 3, 2023
louise-davies added a commit that referenced this issue Oct 5, 2023
…o display facet counts

- currently only added the instrument fix to DatafileSearchTable
louise-davies added a commit that referenced this issue Oct 5, 2023
Won't run on CI yet
Also, fixed a few minor bugs I found
louise-davies added a commit that referenced this issue Oct 5, 2023
louise-davies added a commit that referenced this issue Oct 5, 2023
louise-davies added a commit that referenced this issue Oct 5, 2023
Would need to re-source .bashrc to pick it up from PATH - this is easier
louise-davies added a commit that referenced this issue Oct 11, 2023
louise-davies added a commit that referenced this issue Oct 17, 2023
Made it more efficient by bundling it all together in the current tab change
louise-davies added a commit that referenced this issue Oct 17, 2023
louise-davies added a commit that referenced this issue Feb 23, 2024
if you had multiple entity types available for search,
card view pagination could fail if they had different amounts of results
Since the "hidden" tab would reset the page to 1 when it detected the page was greater than it's max page
louise-davies added a commit that referenced this issue Feb 23, 2024
louise-davies added a commit that referenced this issue Feb 29, 2024
louise-davies added a commit that referenced this issue May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datagateway-search Issues relating to the search plugin enhancement New feature or request requires discussion
Projects
None yet
Development

No branches or pull requests

2 participants