rag evaluation: 'mean_gpt_retrieval_score': nan} #3226

Elizabeth819 · 2024-06-08T04:11:56Z

Operating System

MacOS

Version Information

(modeleval) (base) MengMacBook-M3MaxPro:model_evaluation wanmeng$ pip install azureml-metrics[generative-ai]
Requirement already satisfied: azureml-metrics[generative-ai] in /Users/wanmeng/miniconda3/envs/modeleval/lib/python3.11/site-packages (0.0.57)

Steps to reproduce

OPENAI_API_VERSION="2024-02-01"
OPENAI_API_BASE="https://gpt4o-eliz-westus3.openai.azure.com/"
OPENAI_API_TYPE="azure"
OPENAI_API_KEY="
deployment_id="gpt-4o"
%%time

gpt4o model

from azureml.metrics import compute_metrics, constants
from pprint import pprint
import os

y_test = [["4", "2 + 2 = 4"], ["Agra", "Agra, India"]]

y_pred = [
[
{"role": "user", "content": "What is the value of 2 + 2?"},
{"role": "assistant", "content": "2 + 2 = 4",
"context": {
"citations": [{'id': 'math_document1.md',
'content': 'Information about additions: '
'1 + 2 = 3, 2 + 2 = 4'}]
}
}
],
[
{"role": "user", "content": "Where is Taj Mahal located?"},
{"role": "assistant", "content": "Taj Mahal is located in Agra, India",
"context": {
"citations": [{'id': 'taj_mahal_document1.md',
'content': 'Taj Mahal is located in Agra, India '
'and is one of the seven wonders of the world.'}]
}
}
]
]

openai_params = {
"api_version": OPENAI_API_VERSION,
"api_base": OPENAI_API_BASE,
"api_type": OPENAI_API_TYPE,
"api_key" : OPENAI_API_KEY,
"deployment_id": deployment_id
}

metrics_config = {
"openai_params": openai_params,
"score_version": "v1",
"use_chat_completion_api": True,
# To compute RAG based metrics
"metrics": ["gpt_relevance", "gpt_groundedness", "gpt_retrieval_score"]
}

The above metrics can even be computed by setting the task_type to RAG_EVALUATION

result = compute_metrics(task_type=constants.Tasks.CHAT_COMPLETION,
y_test=y_test,
y_pred=y_pred,
**metrics_config)
pprint(result)

Expected behavior

'metrics': {'mean_gpt_groundedness': 5.0,
'mean_gpt_relevance': 5.0,
'mean_gpt_retrieval_score': 5.0}}

Actual behavior

'metrics': {'mean_gpt_groundedness': 5.0,
'mean_gpt_relevance': 5.0,
'mean_gpt_retrieval_score': nan}}

Addition information

A customer is using this metrics next week, pls help fix the problem, thanks a lot!

Elizabeth819 added the bug label Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rag evaluation: 'mean_gpt_retrieval_score': nan} #3226

rag evaluation: 'mean_gpt_retrieval_score': nan} #3226

Elizabeth819 commented Jun 8, 2024

rag evaluation: 'mean_gpt_retrieval_score': nan} #3226

rag evaluation: 'mean_gpt_retrieval_score': nan} #3226

Comments

Elizabeth819 commented Jun 8, 2024

Operating System

Version Information

Steps to reproduce

gpt4o model

The above metrics can even be computed by setting the task_type to RAG_EVALUATION

Expected behavior

Actual behavior

Addition information