Fix issue of storing too many docs during IR-eval.: Maintain topk with heaps #1715

kwang2049 · 2022-10-05T22:41:58Z

In the current version, the InformationRetrievalEvaluator and the util.semantic_search accumulates all the docs in each top-k retrieved chunk in the query results:

sentence-transformers/sentence_transformers/evaluation/InformationRetrievalEvaluator.py

Lines 162 to 173 in a8cebb2

    
           for name, score_function in self.score_functions.items(): 
        
               pair_scores = score_function(query_embeddings, sub_corpus_embeddings) 
        
               #Get top-k values 
        
               pair_scores_top_k_values, pair_scores_top_k_idx = torch.topk(pair_scores, min(max_k, len(pair_scores[0])), dim=1, largest=True, sorted=False) 
        
               pair_scores_top_k_values = pair_scores_top_k_values.cpu().tolist() 
        
               pair_scores_top_k_idx = pair_scores_top_k_idx.cpu().tolist() 
        
               for query_itr in range(len(query_embeddings)): 
        
                   for sub_corpus_id, score in zip(pair_scores_top_k_idx[query_itr], pair_scores_top_k_values[query_itr]): 
        
                       corpus_id = self.corpus_ids[corpus_start_idx+sub_corpus_id] 
        
                       queries_result_list[name][query_itr].append({'corpus_id': corpus_id, 'score': score})

sentence-transformers/sentence_transformers/util.py

Lines 218 to 222 in a36e6f1

    
           for query_itr in range(len(cos_scores)): 
        
               for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr], cos_scores_top_k_values[query_itr]): 
        
                   corpus_id = corpus_start_idx + sub_corpus_id 
        
                   query_id = query_start_idx + query_itr 
        
                   queries_result_list[query_id].append({'corpus_id': corpus_id, 'score': score})

In other words, the queries_result_list will hold #trunks * top-K docs for each query instead of just top-K. This will result in a very severe memory burden for a large corpus, e.g. 60GB+ when evaluating on MS MARCO. One could do some simulation about how large RAM it could use:

import random
import psutil

def entry():
	return {'corpus_id': random.randint(0, 100000), 'score': random.random()}

msmarco_docs = 8800000
corpus_chunk_size = 50000
top_k = 1000
queries = 100
# queries = 7000

print("Available RAM (GB), before:", psutil.virtual_memory().available / 1024 / 1024 / 1024)

query_results = [entry() for _ in range(msmarco_docs // corpus_chunk_size * top_k * queries)]

print("Available RAM (GB), after:", psutil.virtual_memory().available / 1024 / 1024 / 1024)

# Available RAM (GB), before: 16.977794647216797
# Available RAM (GB), after: 11.836997985839844

This PR fixes the issue by maintaining exactly top-K docs with heaps efficiently during the whole retrieval process. A test has been made to make sure the new code yields the same final results / scores:
https://colab.research.google.com/drive/1ibA6hjfXKsl97L1wA_FlT1HnVFhnRw0L?usp=sharing

kwang2049 · 2022-10-07T19:51:50Z

Due to the same reason mentioned in beir-cellar/beir#117 (comment), I have updated the PR with heapq.heappushpop instead of heapq.heapreplace. The colab in the PR desc has also been updated. 0cb2720

kwang2049 added 4 commits October 5, 2022 23:29

fix InformationRetrievalEvaluator

b1ff573

fix InformationRetrievalEvaluator

ffa83e5

fix util.semantic_search

9df58c8

fix util.semantic_search

1ea2490

kwang2049 changed the title ~~Fix the issue of storing too many docs during IR-evaluation: Maintain topk with heaps~~ Fix issue of storing too many docs during IR-eval.: Maintain topk with heaps Oct 5, 2022

kwang2049 requested a review from nreimers October 6, 2022 09:27

nreimers mentioned this pull request Oct 7, 2022

Fix bug of retrieving more docs than specified: Use heapq to maintain top-K beir-cellar/beir#117

Merged

should use heappushpop instead of heapreplace

0cb2720

nreimers merged commit 71b9c43 into UKPLab:master Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue of storing too many docs during IR-eval.: Maintain topk with heaps #1715

Fix issue of storing too many docs during IR-eval.: Maintain topk with heaps #1715

kwang2049 commented Oct 5, 2022 •

edited

Loading

kwang2049 commented Oct 7, 2022

	for name, score_function in self.score_functions.items():
	pair_scores = score_function(query_embeddings, sub_corpus_embeddings)

	#Get top-k values
	pair_scores_top_k_values, pair_scores_top_k_idx = torch.topk(pair_scores, min(max_k, len(pair_scores[0])), dim=1, largest=True, sorted=False)
	pair_scores_top_k_values = pair_scores_top_k_values.cpu().tolist()
	pair_scores_top_k_idx = pair_scores_top_k_idx.cpu().tolist()

	for query_itr in range(len(query_embeddings)):
	for sub_corpus_id, score in zip(pair_scores_top_k_idx[query_itr], pair_scores_top_k_values[query_itr]):
	corpus_id = self.corpus_ids[corpus_start_idx+sub_corpus_id]
	queries_result_list[name][query_itr].append({'corpus_id': corpus_id, 'score': score})

	for query_itr in range(len(cos_scores)):
	for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr], cos_scores_top_k_values[query_itr]):
	corpus_id = corpus_start_idx + sub_corpus_id
	query_id = query_start_idx + query_itr
	queries_result_list[query_id].append({'corpus_id': corpus_id, 'score': score})

Fix issue of storing too many docs during IR-eval.: Maintain topk with heaps #1715

Fix issue of storing too many docs during IR-eval.: Maintain topk with heaps #1715

Conversation

kwang2049 commented Oct 5, 2022 • edited Loading

kwang2049 commented Oct 7, 2022

kwang2049 commented Oct 5, 2022 •

edited

Loading