Free text search results cannot be ordered by relevance #1152

patrick-austin · 2022-03-07T18:51:28Z

Description:
When performing a free text search, Lucene generates a score associated with each entity ID. For example: [ { id: 2, score: 0.9 }, {id: 3, score: 0.7}, { id: 1, score: 0.5 } ]. These ids are returned, in descending order by score, by:

datagateway/packages/datagateway-common/src/api/lucene.tsx

Lines 63 to 86 in be99284

    
           export const fetchLuceneData = async ( 
        
             datasearchType: DatasearchType, 
        
             params: LuceneSearchParams, 
        
             settings: { 
        
               icatUrl: string; 
        
             } 
        
           ): Promise<number[]> => { 
        
             // Query params. 
        
             const queryParams = { 
        
               sessionId: readSciGatewayToken().sessionId, 
        
               query: urlParamsBuilder(datasearchType, params), 
        
               // Default maximum count is 300. 
        
               maxCount: params.maxCount ? params.maxCount : 300, 
        
             }; 
        
             return axios 
        
               .get(`${settings.icatUrl}/lucene/data`, { 
        
                 params: queryParams, 
        
               }) 
        
               .then((response) => { 
        
                 // flatten result into array 
        
                 return response.data.map((result: { id: number }) => result.id); 
        
               }); 
        
           };

We then use these to perform the DB query to get the entities (e.g. /investigations?where={id: {in: [2, 3, 1]}}):

datagateway/packages/datagateway-search/src/table/investigationSearchTable.component.tsx

Lines 75 to 108 in be99284

    
           const { data: totalDataCount } = useInvestigationCount([ 
        
             { 
        
               filterType: 'where', 
        
               filterValue: JSON.stringify({ 
        
                 id: { in: luceneData || [] }, 
        
               }), 
        
             }, 
        
           ]); 
        
           const { fetchNextPage, data } = useInvestigationsInfinite([ 
        
             { 
        
               filterType: 'where', 
        
               filterValue: JSON.stringify({ 
        
                 id: { in: luceneData || [] }, 
        
               }), 
        
             }, 
        
             { 
        
               filterType: 'include', 
        
               filterValue: JSON.stringify({ 
        
                 investigationInstruments: 'instrument', 
        
               }), 
        
             }, 
        
           ]); 
        
           const { data: allIds } = useIds( 
        
             'investigation', 
        
             [ 
        
               { 
        
                 filterType: 'where', 
        
                 filterValue: JSON.stringify({ 
        
                   id: { in: luceneData || [] }, 
        
                 }), 
        
               }, 
        
             ], 
        
             selectAllSetting 
        
           );

However, the order of the returned entities does not reflect this order, and cannot be obtained directly from the DB as it does not have any concept of the Lucene score.

Other sorting is done serverside as part of the query, for example the default behaviour of sorting by ID. This has advantages, mainly lazy loading since you don't have to fetch all 300 results and only fetch as many need to be displayed.

Sorting by score would have to be done clientside using the list of IDs from Lucene, however currently we lazily load or paginate the results, so that we have the first 50 by the sorting criteria loaded in the table. However there is no guarantee these are the best results, and so even sorting clientside by score on these would not necesarily give the best results first.

Possible changes:

Get all 300 entities at once rather than lazily loading them, ensuring we will always be able to sort and show the most relevant results of that 300 first. This would lose the benefits of lazy loading.
Enable skip and limit parameters for Lucene through ICAT server. Currently, we can only set a limit (default 300), however Lucene itself supports skip and limit behaviour, so it would be possible to search for IDs in batches of 50/25 for the table's lazy loading, then sort these clientside. This would allow us to load smaller batches, but would require changes throughout the stack.
Still get all 300 IDs from Lucene, and "manually" paginate our requests to the DB. Instead of querying for entities in the list of 300 ids, just use the first 50 in the list for the first query, then the next 25 for subsequent queries.

If sorting by score was implemented, how would this interact with the table filters/sort? For sorting by date say, we now wouldn't want only the best 50 ids by score, but all 300 for that DB query (which wouldn't be available in the 2nd approach). In both the latter two approaches, we would have to switch from manually sending the IDs in batches based on score to sending all of them whenever a table filter was applied. However it's worth noting that even now, sorting by date won't give the most recent results matching the search query, it will give the most recent of the 300 results that best matched the search query. So already this is perhaps not what a user would expect. While not currently implemented, Lucene and its derivatives support sorting by non-score fields (i.e. we could achieve the former behaviour in principle by relying on Lucene for both searching and sorting).

Filtering might also be problematic, as if I only query on the top 50 results by score, with a filter which removes 49 of those results, I'm going to need to send another query to get more results straight away. Currently, as we send all 300 ids as part of the query you get 50 results that definitely match the filter without need for subsequent queries. Having said this, if the user wanted a more accurate result, they could do this using the free text search itself provided the relevant fields are indexed in Lucene.

Further discussion is welcome.

Acceptance criteria:

Possible to order search results by their score (relevance to search term)
Lazy loading behaviour (mostly) unaffected

The text was updated successfully, but these errors were encountered:

joelvdavies · 2022-03-08T10:36:31Z

On first inspection I thought the second option is more scalable long term as we may not always have the 300-result limit and we wouldn’t want to harm the client performance too much by sorting everything client side in the first option. But it sounds like the filtering wouldn’t work correctly since all 300 are needed anyway. As a result, the third option is the only one that would retain the lazy loading and sounds like a better combination of both? Personally Ideally, I think it would be good if the search, filtering, and sorting all occurred on the backend for simplicity and performance, but it sounds as though this is not possible with Lucene?

patrick-austin · 2022-03-08T11:25:44Z

Personally Ideally, I think it would be good if the search, filtering, and sorting all occurred on the backend for simplicity and performance, but it sounds as though this is not possible with Lucene?

Thanks for your thoughts @joelvdavies. Sorting by score as part of a DB call (in the way we do in the table filters by sort={"title"%3A"asc"} or whatever) isn't possible as the DB has no concept of the score. As I mention in the rambling towards the end, you can do sorting and filtering by fields with Lucene itself in principle (e.g. I could search for (extremely relevant neutron scattering data) +id:2400* returning entities that "match" at least one term from the first phrase whilst also telling Lucene to "filter" and only allow results that have an ID begining 2400. By default this would return results sorted by score, but Lucene can sort by another field such as date (however this latter sorting is not exposed by our current icat.server/icat.lucene implementation). In this sense Lucene can do sorting and filtering in the backend, however as long as we need to do a subsequent DB query for the entities you would then still need to sort the entities clientside to match the list of IDs Lucene gave you (which are already ordered and filtered).

Finally it's worth mentioning that (while our current implemtation doesn't allow it) you can also get more information back from Lucene in addition to just the ID. In principle, if you indexed all the relevant data into Lucene, you could avoid doing the subsequent DB call entirely. Though that would be a substantial change. But it would be an option, and relying on Lucene for all sorting (not just the score sorting) would mean you don't have this current situation where I'm only sorting the top 300 results by score in the front end, rather than the top 300 most recent, alphabetically first etc.

…1152

this fixes the ability to select/deselect items

This reverts commit c0298fd.

…o display facet counts - currently only added the instrument fix to DatafileSearchTable

Won't run on CI yet Also, fixed a few minor bugs I found

…ssions stuff to ICAT

Would need to re-source .bashrc to pick it up from PATH - this is easier

…on tests

Made it more efficient by bundling it all together in the current tab change

…truments @ dataset/dataset level

if you had multiple entity types available for search, card view pagination could fail if they had different amounts of results Since the "hidden" tab would reset the page to 1 when it detected the page was greater than it's max page

also some other minor styling stuff

patrick-austin added enhancement New feature or request datagateway-search Issues relating to the search plugin requires discussion labels Mar 7, 2022

patrick-austin mentioned this issue Mar 23, 2022

Enable field sorting icatproject/icat.lucene#25

Open

patrick-austin added a commit that referenced this issue Jul 7, 2022

Demo implementation for Diamond #1152

860e70a

patrick-austin added a commit that referenced this issue Jul 25, 2022

Filters on datafileSearchTable for ISIS demo #1152

d02a122

louise-davies added a commit that referenced this issue Jun 28, 2023

#1152 - fix search unit tests & fix bug in how search tabs were synced

6c52934

louise-davies added a commit that referenced this issue Jun 29, 2023

#1152 - fix tests failing on CI because of timezone issues

28c2262

louise-davies added a commit that referenced this issue Jun 29, 2023

#1152 - improve test coverage

03165f5

louise-davies added a commit that referenced this issue Sep 13, 2023

#1152 - fix cypress warnings in new e2e tests

1547842

patrick-austin added a commit that referenced this issue Sep 28, 2023

Fix datafileSearch investigationfacilitycycle and new facet requests #…

0193e68

…1152

louise-davies added a commit that referenced this issue Oct 3, 2023

#1152 - fix handling of string ID from lucene

c0298fd

this fixes the ability to select/deselect items

louise-davies added a commit that referenced this issue Oct 3, 2023

Revert "#1152 - fix handling of string ID from lucene"

5618130

This reverts commit c0298fd.

louise-davies added a commit that referenced this issue Oct 5, 2023

#1152 - allow for facet counts to be undefined so we can choose not t…

cdc01f2

…o display facet counts - currently only added the instrument fix to DatafileSearchTable

louise-davies added a commit that referenced this issue Oct 5, 2023

#1152 - fix unit tests & update datafileSearch to be a proper e2e test

187c354

Won't run on CI yet Also, fixed a few minor bugs I found

louise-davies added a commit that referenced this issue Oct 5, 2023

#1152 - update CI to run on new snapshot versions & add correct permi…

dea54e1

…ssions stuff to ICAT

louise-davies added a commit that referenced this issue Oct 5, 2023

#1152 - fix broken lucene api test

4be9515

louise-davies added a commit that referenced this issue Oct 5, 2023

#1152 - fix capitalisation in public steps script

dee3742

louise-davies added a commit that referenced this issue Oct 5, 2023

#1152 - find icatadmin script by path

84d7d1a

Would need to re-source .bashrc to pick it up from PATH - this is easier

louise-davies added a commit that referenced this issue Oct 6, 2023

#1152 - use ansible branch that supports new server properties

947fe4f

louise-davies added a commit that referenced this issue Oct 10, 2023

#1152 - refactor investigationSearch e2e tests to use real API

2d460b9

louise-davies added a commit that referenced this issue Oct 11, 2023

#1152 - fix more e2e tests

6d50f49

louise-davies added a commit that referenced this issue Oct 12, 2023

#1152 - update datasetSearch tests and improve datafile & investigati…

7cab324

…on tests

louise-davies added a commit that referenced this issue Oct 17, 2023

#1152 - fix saving filter params when switching tabs

4df9c64

Made it more efficient by bundling it all together in the current tab change

louise-davies added a commit that referenced this issue Oct 17, 2023

#1152 - improve facet count look & don't display facet counts for ins…

83a3d64

…truments @ dataset/dataset level

louise-davies added a commit that referenced this issue Oct 17, 2023

#1152 - add e2e tests for saved filters & no instrument facet count

98ec307

louise-davies added a commit that referenced this issue Feb 21, 2024

#1152 - move @types/lodash.isequal to dependencies to fix docker build

0ed2ff7

louise-davies added a commit that referenced this issue Feb 21, 2024

#1152 - move @types/lodash.isequal to dependencies to fix docker build

9e8621c

louise-davies added a commit that referenced this issue Feb 23, 2024

#1152 - improve search load more rows e2e test

5b45788

louise-davies added a commit that referenced this issue Feb 23, 2024

#1152 - remove unused functionality from count endpoints

161e8de

louise-davies added a commit that referenced this issue Feb 23, 2024

#1152 - fix typo in e2e test

d8d2ed3

louise-davies added a commit that referenced this issue Feb 29, 2024

#1152 - fix anon search

efd2d9b

louise-davies added a commit that referenced this issue Feb 29, 2024

#1152 - fix search e2e tests broken by restrict change

cebbadd

louise-davies added a commit that referenced this issue Feb 29, 2024

#1152 - Guard against empty arrays for investigationfacilitycycles

ac458fb

louise-davies added a commit that referenced this issue May 15, 2024

#1152 - fix colour of facet accordian headers in dark mode

06539eb

louise-davies added a commit that referenced this issue May 17, 2024

#1152 - fix overflow issues when filter chips were present

1971ea4

also some other minor styling stuff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Free text search results cannot be ordered by relevance #1152

Free text search results cannot be ordered by relevance #1152

patrick-austin commented Mar 7, 2022

joelvdavies commented Mar 8, 2022

patrick-austin commented Mar 8, 2022

Free text search results cannot be ordered by relevance #1152

Free text search results cannot be ordered by relevance #1152

Comments

patrick-austin commented Mar 7, 2022

joelvdavies commented Mar 8, 2022

patrick-austin commented Mar 8, 2022