Skip to content

Latest commit

 

History

History
81 lines (61 loc) · 3.41 KB

README.md

File metadata and controls

81 lines (61 loc) · 3.41 KB

repobench logo repobench logo

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

ICLR 2024


This directory hosts the datasets for subsequet versions of RepoBench. We are committed to updating RepoBench regularly, with updates scheduled every 3 months.

🌇 Overview

  • Our primary focus is on next-line prediction tasks to aid in code auto-completion. If your research requires retrieval data, please don't hesitate to reach out to us for collaboration.

  • Our datasets will be hosted on 🤗 HuggingFace, making them easily accessible for everyone.

  • Each data point within our datasets is categorized based on the prompt length (number of tokens), which is determined by OpenAI's GPT-4 tokenizer using tiktoken. Here's a detailed table illustrating the levels we've defined:

    Level Prompt Length (Number of Tokens)
    2k 640 - 1,600
    4k 1,600 - 3,600
    8k 3,600 - 7,200
    12k 7,200 - 10,800
    16k 10,800 - 14,400
    24k 14,400 - 21,600
    32k 21,600 - 28,800
    64k 28,800 - 57,600
    128k 57,600 - 100,000
  • We hereby provide the official implementation for constructing prompts here. Please note that the methods provided are not necessarily the optimal way of construction. Reordering, retrieval argumentation, or employing different cropping/construction techniques could potentially lead to varying degrees of improvement. Ensure that the evaluations are conducted fairly.

📚 Versions

RepoBench v1.1

RepoBench v1.1 includes data collected from GitHub between October 6, 2023, and December 31, 2023. To mitigate the data leakage and memorization issues, we conducted a deduplication process on the Stack v2 (coming soon) based on the file content.

You can access RepoBench v1.1 at the following links:

Or, you can load the data directly from the HuggingFace Hub using the following code:

from datasets import load_dataset

# Load the Python dataset
python_dataset = load_dataset("tianyang/repobench_python_v1.1")

# Load the Java dataset
java_dataset = load_dataset("tianyang/repobench_java_v1.1")

RepoBench v1.2

Cooming soon...

📝 Citation

If you use RepoBench in your research, please cite the following paper:

@misc{liu2023repobench,
      title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems}, 
      author={Tianyang Liu and Canwen Xu and Julian McAuley},
      year={2024},
      url={https://arxiv.org/abs/2306.03091},
      booktitle={International Conference on Learning Representations}
}