Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
custom_newline.py		custom_newline.py
parse.py		parse.py

README.md

Commit Messages

Download link.

1.3 billion commit messages extracted from GHTorrent dumps.

The dataset consists of 3 files:

commits.bin - commit hashes, 24GB.
repos.txt.xz - GitHub repository names, 5GB.
messages.txt.xz - commit messages, 17GB.

The precise number of commits is 1288456749. There can be duplicate commits. Please contribute a deduplicated dataset if you can.

Format

commits.bin - continuous binary stream, 20 bytes per commit hash. The hashes are random by definition, so it makes no sense to compress this file.
repos.txt.xz - strings separated by \0 - NULL character, xz-compressed. The order matches commits.bin. There is a trailing '\0'.
messages.txt.xz - strings separated by \0, xz-compressed. The order matches commits.bin. There is a trailing '\0'.

Sample code

Python:

import lzma
from custom_newline import CustomNewlineReader

with open("commits.bin", "rb") as commf:
    with CustomNewlineReader(lzma.open("repos.txt.xz"), b"\0") as reposf:
        with CustomNewlineReader(lzma.open("messages.txt.xz"), b"\0") as msgf:
            for msg, repo in zip(msgf, reposf):
                commit = commf.read(20).hex()
                print(commit, repo.decode(), msg.decode())

custom_newline.py is included into this repository.

Origin

GHTorrent MongoDB dumps before 2019-03-18. The command to generate the dataset was:

(
  for dd in 2019-03-17 2019-03-16 ... 2015-12-01; do
    wget -O - http://ghtorrent-downloads.ewi.tudelft.nl/mongo-daily/mongo-dump-$dd.tar.gz |
    tar -xzO dump/github/commits.bson
  done
  for dd in 2015-12-01 2015-10-03 2015-08-03; do
    wget -O - http://ghtorrent-downloads.ewi.tudelft.nl/mongo-full/commits-dump.$dd.tar.gz |
    tar -xzO dump/github/commits.bson
  done
  wget -O - http://ghtorrent-downloads.ewi.tudelft.nl/mongo-full/commits-1-dump.2015-08-04.tar.gz |
  tar -xzO dump/github/commits.bson
) | python3 parse.py

2019-03-17 2019-03-16 ... 2015-12-01 are the dump dates from ghtorrent.org/downloads.html. parse.py is included into this repository.

License

Open Data Commons Open Database License (ODbL)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CommitMessages

CommitMessages

README.md

Commit Messages

Format

Sample code

Origin

License

Files

CommitMessages

Directory actions

More options

Directory actions

More options

Latest commit

History

CommitMessages

Folders and files

parent directory

README.md

Commit Messages

Format

Sample code

Origin

License