Using different models in evaluating mode-graded eval and in generating the completion #1393

LoryPack · 2023-11-03T11:47:34Z

Describe the feature or improvement you're requesting

In general, the evaluation model and the model being evaluated don't have to be the same, though we will assume that they are here for ease of explanation.

However, I can't find anywhere how to do this. Is this currently implemented?

Additional context

No response

LRudL · 2023-11-27T11:14:41Z

I recently struggled to get this to work too so I can share what I found.

This is currently implemented in the GitHub version of this repo (but not the one on PyPI that you get by installing it the library through any package manager, as these versions are many months out of date and have a version where gpt-3.5-turbo is hard-coded as the grader).

Lines 29-32 in evals/elsuite/modelgraded/classify.py show you how this feature is implemented: the last completion_fn given is treated as the evaluation function.

Completion functions in turn can be specified in a comma-separated string. The logic for this is at evals/cli/oaieval.py lines 142-145.

Concretely, a string like "gpt-4,gpt-3.5-turbo" seems to work for me to get gpt-4 to be the completer and gpt-3.5-turbo the one grading the responses.

However, be warned that there seems to be a slight bug where modelgraded eval execution can hang for a long time in a way that other evals don't (and seems unrelated to rate limits).

LoryPack · 2023-11-27T12:45:51Z

I had opened a PR last week (#1418) where I address this issue but forgot to mention it here.

LRudL · 2023-11-27T15:50:32Z

Regarding #1418: A new PR is not necessary for setting the evaluating model (though the feature really should be documented), since the full relevant lines are:

        # treat last completion_fn as eval_completion_fn
        self.eval_completion_fn = self.completion_fns[-1]
        if len(self.completion_fns) > 1:
            self.completion_fns = self.completion_fns[:-1]

If you pass in many (in a comma-separated list) into completion_fns, then the last one will be treated as the evaluating model.

LoryPack · 2023-11-27T16:26:47Z

But wouldn't the task be run on the passed completion functions if doing so? Il giorno lun 27 nov 2023 alle ore 15:50 LRudL ***@***.***> ha scritto:

…

Regarding #1418 <#1418>: A new PR is not necessary for setting the evaluating model (though the feature really should be documented), since the full relevant lines <https://github.com/openai/evals/blob/7400b0ee3934d64ff6efd9d4ec04be631625c014/evals/elsuite/modelgraded/classify.py#L29C1-L29C1> are: # treat last completion_fn as eval_completion_fn self.eval_completion_fn = self.completion_fns[-1] if len(self.completion_fns) > 1: self.completion_fns = self.completion_fns[:-1] If you pass in many (in a comma-separated list) into completion_fns, then the last one will be treated as the evaluating model. — Reply to this email directly, view it on GitHub <#1393 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIT3WHBJR4EJXJFD3HER233YGSZFHAVCNFSM6AAAAAA64JYRF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRYGEYDENZWGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

LRudL · 2023-11-27T17:50:56Z

If you want to run the eval with modelA, and run the grading with modelB, then you can pass in the string "modelA,modelB" as the name of the completer.

sahilrajput03 · 2024-10-18T20:23:39Z

Can anyone please help me on this #1564

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using different models in evaluating mode-graded eval and in generating the completion #1393

Using different models in evaluating mode-graded eval and in generating the completion #1393

LoryPack commented Nov 3, 2023

LRudL commented Nov 27, 2023

LoryPack commented Nov 27, 2023

LRudL commented Nov 27, 2023

LoryPack commented Nov 27, 2023 via email

LRudL commented Nov 27, 2023

sahilrajput03 commented Oct 18, 2024

Using different models in evaluating mode-graded eval and in generating the completion #1393

Using different models in evaluating mode-graded eval and in generating the completion #1393

Comments

LoryPack commented Nov 3, 2023

Describe the feature or improvement you're requesting

Additional context

LRudL commented Nov 27, 2023

LoryPack commented Nov 27, 2023

LRudL commented Nov 27, 2023

LoryPack commented Nov 27, 2023 via email

LRudL commented Nov 27, 2023

sahilrajput03 commented Oct 18, 2024