Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Leaderboard Update, 11/17/2024 #748

Merged
merged 6 commits into from
Nov 19, 2024

Conversation

@HuanzhiMao HuanzhiMao added the BFCL-Website BFCL Leaderboard Website label Nov 9, 2024
@HuanzhiMao HuanzhiMao marked this pull request as ready for review November 14, 2024 02:32
@CharlieJCJ CharlieJCJ self-requested a review November 14, 2024 05:13
Copy link
Collaborator

@Fanjia-Yan Fanjia-Yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CharlieJCJ
Copy link
Collaborator

CharlieJCJ commented Nov 14, 2024

![changes_heatmap](https://github.com/user-attachments/assets/ac3f1994-b6c0-4e4c-ae09-823a667562fc)
![multi_turn_acc_table_heatmap](https://github.com/user-attachments/assets/8b644f07-5c4e-4b14-9320-ee26ea507414)

DIFF of 11_9 and 10_21 versions.

Need to double-check the models that scores change a lot.

@HuanzhiMao HuanzhiMao added the DO NOT MERGE Not ready to be merged label Nov 15, 2024
@CharlieJCJ
Copy link
Collaborator

TODO @CharlieJCJ: After #761 result generation, publish another DIFF graph.

@HuanzhiMao HuanzhiMao removed the DO NOT MERGE Not ready to be merged label Nov 17, 2024
@CharlieJCJ
Copy link
Collaborator

CharlieJCJ commented Nov 17, 2024

Updated heatmaps after human review and additions of #760 and #761

![changes_heatmap](https://github.com/user-attachments/assets/636f019b-2955-4e5b-936f-27a35d945fcc)
![multi_turn_acc_table_heatmap](https://github.com/user-attachments/assets/c2e4a391-727f-47e7-a8db-228f1b766d74)

cc @HuanzhiMao @Fanjia-Yan @ShishirPatil

@CharlieJCJ
Copy link
Collaborator

CharlieJCJ commented Nov 17, 2024

Also include non-live and live statistics here for more visibility on how gemini models' score changes due to #760 and #764

![non-live_ast_acc_table_heatmap](https://github.com/user-attachments/assets/e2c9214b-72ee-45d1-9136-fc6c08f46663)
![non-live_exec_acc_table_heatmap](https://github.com/user-attachments/assets/ef6660b0-4913-4745-997a-bfffc95b145f)
![live_acc_table_heatmap](https://github.com/user-attachments/assets/7bd4f645-9271-411e-9bc1-6e4641bdcb6a)

@CharlieJCJ
Copy link
Collaborator

And @HuanzhiMao can you update the date for the PR, since there are more recent PR that are included

@HuanzhiMao HuanzhiMao changed the title [BFCL] Leaderboard Update, 11/09/2024 [BFCL] Leaderboard Update, 11/17/2024 Nov 17, 2024
@CharlieJCJ
Copy link
Collaborator

CharlieJCJ commented Nov 19, 2024

Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-Website BFCL Leaderboard Website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants