This repository contains a Jupyter Notebook, which you can see live at https://eff.org/ai/metrics. It collects problems and metrics / datasets from the artificial intelligence and machine learning research literature, and tracks progress on them. You can use it to see how things are progressing in specific subfields or AI/ML as a whole, as a place to report new results you've obtained, and as place to look for problems that might benefit from having new datasets/metrics designed for them, or as a source to build on for data science projects.
At EFF we're also interested in collecting this data to understand the likely implications of AI, but to begin with we're focused on gathering it.
Original authors: Peter
Eckersley and Yomna
Nasser at EFF
With contributions from: Gennie Gebhart and Owain Evans
Inspired by and merging data from:
- Rodrigo Benenson's "Who is the Best at X / Are we there yet?" collating machine vision datasets & progress
- Jack Clark and Miles Brundage's collection of AI progress measurements
- Sarah Constantin's Performance Trends in AI
- Katja Grace's Algorithmic Progress in Six Domains
- The Swedish Computer Chess Association's History of Computer Chess performance
- Qi Wu et al.'s Visual Question Answering: A survey of Methods and Datasets
- Eric Yuan's Comparison of Machine Reading Comprehension Datasets
This notebook is an open source, community effort. You can help by adding new metrics, data and problems to it! If you're feeling ambitious you can also improve its semantics or build new analyses into it. Here are some high level tips on how to do that.
If you've already worked a lot with git
and IPython/Jupyter Notebooks, here's a quick list of things you'll need to do:
-
Install Jupyter Notebook and git.
- On an Ubuntu or Debian system, you can do:
sudo apt-get install git sudo apt-get install ipython-notebook || sudo apt-get install jupyter-notebook || sudo apt-get install python-notebook
- Make sure you have IPython Notebook version 3 or higher. If your OS
doesn't provide it, you might need to enable backports, or use
pip
to install it.
- On an Ubuntu or Debian system, you can do:
-
Install this notebook's Python dependencies:
- On Ubuntu or Debian, do:
sudo apt-get install python-{cssselect,lxml,matplotlib{,-venn},numpy,requests,seaborn}
- On other systems, use your native OS packages, or use
pip
:pip install cssselect lxml matplotlib{,-venn} numpy requests seaborn
- On Ubuntu or Debian, do:
-
Fork our repo on github: https://github.com/AI-metrics/AI-metrics#fork-destination-box
-
Clone the repo on your machine, and
cd
into the directory it's using -
Configure your copy of git to use IPython Notebook merge filters to prevent conflicts when multiple people edit the Notebook simultaneously. You can do that with these two commands in the cloned repo:
git config --file .gitconfig filter.clean_ipynb.clean $PWD/ipynb_drop_output
git config --file .gitconfig filter.clean_ipynb.smudge cat
-
Run Jupyter Notebok in the project directory (the command may be
ipython notebook
,jupyter notebook
,jupyter-notebook
, orpython notebook
depending on your system), then go to localhost:8888 and edit the Notebook to your heart's content -
Save and commit your work (
git commit -a -m "DESCRIPTION OF WHAT YOU CHANGED"
) -
Push it to your remote repo
-
Send us a pull request!
Microsoft Azure has an IPython / Jupyter service that will let you run and modify notebooks from their servers. You can clone this Notebook and work with it via their service: https://notebooks.azure.com/EFForg/libraries/ai-progress. Unfortunately there are a few issues with running the notebook on Azure:
- arXiv seems to block requests from Azure's IP addresses, so it's impossible to automatically extract information about paper when running the Notebook there
- The Azure Notebooks service seems to transform Unicode characters in strange ways, creating extra work merging changes from that source
- Each
.measure()
call is a data point of a specific algorithm on a specific metric/dataset. Thus one paper will often produce multiple measurements on multiple metrics. It's most important to enter results that were at or near the frontier of best performance on the date they were published, though this isn't a strict requirement and it's nice to have a sense of the performance of the field, or of algorithms that are otherwise notable even if they aren't the frontier for a sepcific problem. - When multiple revisions of a paper (typically on arXiv) have the same results on some metric, use the date of the first version (the CBTest results in this paper are an example)
- When subsequent revisions of a paper improve on the original results (example), use the date and scores of the first results, or if each revision is interesting / on the frontier of best performance, include each paper
- We didn't check this carefully for our first ~100 measurement data points :(. In order to denote when we've checked which revision of an arXiv preprint first published a result, cite the specific version (https://arxiv.org/abs/1606.01549v3 rather than https://arxiv.org/abs/1606.01549); that way we can see which previous entries should be double-checked for this form of inaccuracy.
- Where possible, use a clear short name or acronym for each algorithm. The full paper name can go in the
papername
field (and is auto-populated for some papers). When matplotlib 2.1 ships we may be able to get nice rollovers with metadata like this. Or perhaps we can switch to D3 to get that type of interactivity.
- If you know of ML datasets/metrics that aren't included yet, add them
- If there are papers with interesting results for metrics that aren't included, add them
- If you know of important problems that humans can solve, and machine learning systems may or may not yet be able to, and they're missing from our taxonomy, you can propose them
- Look at our Github issue list perhaps starting with those tagged as good volunteer tasks.
- You can also add missing conferences / journals to to the venue-to-date mapping table (unhide the source code and search for
conference_dates
):
Q: What's the point of this project? How does it tie in with the EFF's mission?
Given that machine learning tools and AI techniques are increasingly part of our everyday lives, it is critical that journalists, policy makers, and technology users understand the state of the field. When improperly designed or deployed, machine learning methods can violate privacy, threaten safety, and perpetuate inequality and injustice. Stakeholders must be able to anticipate such risks and policy questions before they arise, rather than playing catch-up with the technology. To this end, it’s part of the responsibility of researchers, engineers, and developers in the field to help make information about their life-changing research widely available and understandable
Q: Why haven't you included dataset X?
There are a tiny number of us and this is a large task! If you'd like to add more data, please send us a pull request
Q: Do you track other things besides how well-solved particular tasks are? For instance, the speed and efficiency of training?
No, but we'd love to. If you are motivated to help organize that data, please dive in and improve the notebook!
Q: Have you thought about how to visualise this data and make it more accessible?
We've considered a variety of things, but decided that the iPython notebook was ultimately the most accessible for now. We're very open to suggestions about visualizations and accessibility, so feel free to reach out if you have any ideas!
Also, if you'd like to build visualizations on top of this project, all the data we use is available in the easily digestible JSON format in progress.json
. If you do so, let us know and we'll try to link to it.
Q: Is this an EFF project?
Yes, but we'd like it to grow to be a self-sustaining community effort supported by a coalition of organizations. EFF did the initial work of making the Notebook, but we built on several excellent datasets collected by many other people, and had a number of productive collabrative discussions most especially with people at OpenAI and the Future of Humanity Institute in preparing the document. We will strive to keep the authorship section of the Notebook accurate as others continue to contribue.
Q: When will artificial general intelligence happen?
We don't know--and this project is not meant to answer this question. Instead, we’re interested in compiling data to guide evidence-based conversations about the state of the art in various corners of AI and machine learning research.