Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue of storing too many docs during IR-eval.: Maintain topk with heaps #1715

Merged
merged 5 commits into from
Oct 13, 2022

Conversation

kwang2049
Copy link
Member

@kwang2049 kwang2049 commented Oct 5, 2022

In the current version, the InformationRetrievalEvaluator and the util.semantic_search accumulates all the docs in each top-k retrieved chunk in the query results:

for name, score_function in self.score_functions.items():
pair_scores = score_function(query_embeddings, sub_corpus_embeddings)
#Get top-k values
pair_scores_top_k_values, pair_scores_top_k_idx = torch.topk(pair_scores, min(max_k, len(pair_scores[0])), dim=1, largest=True, sorted=False)
pair_scores_top_k_values = pair_scores_top_k_values.cpu().tolist()
pair_scores_top_k_idx = pair_scores_top_k_idx.cpu().tolist()
for query_itr in range(len(query_embeddings)):
for sub_corpus_id, score in zip(pair_scores_top_k_idx[query_itr], pair_scores_top_k_values[query_itr]):
corpus_id = self.corpus_ids[corpus_start_idx+sub_corpus_id]
queries_result_list[name][query_itr].append({'corpus_id': corpus_id, 'score': score})

for query_itr in range(len(cos_scores)):
for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr], cos_scores_top_k_values[query_itr]):
corpus_id = corpus_start_idx + sub_corpus_id
query_id = query_start_idx + query_itr
queries_result_list[query_id].append({'corpus_id': corpus_id, 'score': score})

In other words, the queries_result_list will hold #trunks * top-K docs for each query instead of just top-K. This will result in a very severe memory burden for a large corpus, e.g. 60GB+ when evaluating on MS MARCO. One could do some simulation about how large RAM it could use:

import random
import psutil

def entry():
	return {'corpus_id': random.randint(0, 100000), 'score': random.random()}

msmarco_docs = 8800000
corpus_chunk_size = 50000
top_k = 1000
queries = 100
# queries = 7000

print("Available RAM (GB), before:", psutil.virtual_memory().available / 1024 / 1024 / 1024)

query_results = [entry() for _ in range(msmarco_docs // corpus_chunk_size * top_k * queries)]

print("Available RAM (GB), after:", psutil.virtual_memory().available / 1024 / 1024 / 1024)

# Available RAM (GB), before: 16.977794647216797
# Available RAM (GB), after: 11.836997985839844

This PR fixes the issue by maintaining exactly top-K docs with heaps efficiently during the whole retrieval process. A test has been made to make sure the new code yields the same final results / scores:
https://colab.research.google.com/drive/1ibA6hjfXKsl97L1wA_FlT1HnVFhnRw0L?usp=sharing

@kwang2049 kwang2049 changed the title Fix the issue of storing too many docs during IR-evaluation: Maintain topk with heaps Fix issue of storing too many docs during IR-eval.: Maintain topk with heaps Oct 5, 2022
@kwang2049 kwang2049 requested a review from nreimers October 6, 2022 09:27
@kwang2049
Copy link
Member Author

Due to the same reason mentioned in beir-cellar/beir#117 (comment), I have updated the PR with heapq.heappushpop instead of heapq.heapreplace. The colab in the PR desc has also been updated. 0cb2720

@nreimers nreimers merged commit 71b9c43 into UKPLab:master Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants