Skip to content

A tool set for fast and efficient git scanning to capture data with focus on large repos

License

Notifications You must be signed in to change notification settings

discoveryjs/scan-git

Repository files navigation

@discoveryjs/scan-git

NPM version Build Coverage Status

@discoveryjs/scan-git is a powerful Node.js library designed for reading and analyzing Git repositories directly from the filesystem. It provides a rich set of APIs that allow you to access Git objects, references, commits, trees, and more without the need for Git command-line tools or external dependencies.

Whether you're building tools for repository analysis, visualization, or automation, @discoveryjs/scan-git provides a robust and efficient API to meet your Git interaction needs.

Key Features:

  • Direct Repository Access: Interact with Git repositories by reading data directly from the .git directory.
  • Comprehensive Git Object Support: Work with both loose and packed objects, including support for large pack files over 2GB.
  • Advanced Git Features: Handle complex repository structures with support for cruft packs and on-disk reverse indexes.
  • Efficient Data Retrieval: Efficiently fetch commit histories, branches, tags, and files, even for large repositories.
  • Flexible APIs: Compute diffs between commits, read specific Git objects, and parse commits, trees, and annotated tags.

Usage

npm install @discoveryjs/scan-git

API


Git reader

import { createGitReader } from '@discoveryjs/scan-git';

const reader = await createGitReader('path/to/.git');
const commits = await reader.log({ ref: 'my-branch', depth: 10 });

console.log(commits);

await reader.dispose();

createGitReader(gitdir, options?)

Creates an instance of the Git reader, which provides access to most of the library's functionality:

  • gitdir: string
    The path to the Git repository. This can either be a directory containing a .git folder or a direct path to a .git folder (even if it has a non-standard name).
  • options (optional):
    • maxConcurrency: number (default: 50)
      Limits the number of concurrent file system operations
    • cruftPacks: 'include' | 'exclude' | 'only' | boolean (default: 'include')
      Defines how cruft packs are processed:
      • 'include' or true – Process all packs
      • 'exclude' or false – Exclude cruft packs from processing
      • 'only' – Process only cruft packs
import { createGitReader } from '@discoveryjs/scan-git';

const reader = await createGitReader('path/to/.git');

reader.dispose()

Cleans up resources used by the reader instance, such as file handles or caches. This method should be called when the reader instance is no longer needed to ensure proper resource management and avoid memory leaks.

const reader = await createGitReader('path/to/.git');

// do something with reader

// Dispose of the repository instance when done
await reader.dispose();

Note: After calling dispose(), attempting to use the reader instance (e.g., calling methods like log() or readCommit()) will likely result in errors or undefined behavior.

Note: Always ensure dispose() is called in applications or scripts that manage multiple repositories or long-running processes to prevent resource exhaustion.


Reference methods

Common parameters:

  • ref: string – a reference to an object in repository
  • withOid: boolean – a flag to include resolved oid for a reference
reader.defaultBranch()

Returns the default branch name of a repository:

const defaultBranch = await reader.defaultBranch();
// 'main'

The algorithm to identify a default branch name:

  • if there is only one branch, that must be the default
  • otherwise looking for specific branch names, in this order:
    • upstream/HEAD
    • origin/HEAD
    • main
    • master
reader.currentBranch()

Returns the current branch name along with its commit oid. If the repository is in a detached HEAD state, name will be null.

const currentBranch = await reader.currentBranch();
// { name: 'main', oid: '8bb6e23769902199e39ab70f2441841712cbdd62' }

const detachedHead = await reader.currentBranch();
// { name: null, oid: '8bb6e23769902199e39ab70f2441841712cbdd62' }
reader.isRefExists(ref)

Checks if a ref exists.

const isValidRef = reader.isRefExists('main');
// true
reader.expandRef(ref)

Expands a ref into a full form, e.g. 'main' -> 'refs/heads/main'. Returns null if ref doesn't exist. For the symbolic ref names ('HEAD', 'FETCH_HEAD', 'CHERRY_PICK_HEAD', 'MERGE_HEAD' and 'ORIG_HEAD') returns a name without changes.

const fullPath = reader.expandRef('heads/main');
// 'refs/heads/main'
reader.resolveRef(ref)

Resolves ref into oid if it exists, otherwise throws an exception. In case if ref is oid, returns this oid back. If ref is not a full path, expands it first.

const oid = await reader.resolveRef('main');
// '8bb6e23769902199e39ab70f2441841712cbdd62'
reader.describeRef(ref)

Returns an info object for provided ref.

const info = await reader.describeRef('HEAD');
// {
//   path: 'HEAD',
//   name: 'HEAD',
//   symbolic: true,
//   ref: 'refs/heads/test',
//   oid: '2dbee47a8d4f8d39e1168fad951b703ee05614d6'
// }
const info = await reader.describeRef('main');
// {
//   path: 'refs/heads/main',
//   name: 'main',
//   symbolic: false,
//   scope: 'refs/heads',
//   namespace: 'refs',
//   category: 'heads',
//   remote: null,
//   ref: null,
//   oid: '7b84f676f2fbea2a3c6d83924fa63059c7bdfbe2'
// }
const info = await reader.describeRef('origin/HEAD');
// {
//   path: 'refs/remotes/origin/HEAD',
//   name: 'HEAD',
//   symbolic: false,
//   scope: 'refs/remotes',
//   namespace: 'refs',
//   category: 'remotes',
//   remote: 'origin',
//   ref: 'refs/remotes/origin/main',
//   oid: '7b84f676f2fbea2a3c6d83924fa63059c7bdfbe2'
// }
reader.isOid(value)

Checks if a value is a valid oid.

reader.isOid('7b84f676f2fbea2a3c6d83924fa63059c7bdfbe2'); // true
reader.isOid('main'); // false
reader.listRemotes()
const remotes = reader.listRemotes();
// [
//   'origin'
// ]
reader.listRemoteBranches(remote, withOid?)

Get a list of branches for a remote.

const originBranches = await reader.listRemoteBranches('origin');
// [
//   'HEAD',
//   'main'
// ]

const originBranches = await reader.listRemoteBranches('origin', true);
// [
//   { name: 'HEAD', oid: '7c2a62cdbc2ef28afaaed3b6f3aef9b581e5aa8e' }
//   { name: 'main', oid: '56ea7a808e35df13e76fee92725a65a373a9835c' }
// ]
reader.listBranches(withOid?)

Get a list of local branches.

const localBranches = await reader.listBranches();
// [
//   'HEAD',
//   'main'
// ]

const localBranches = await reader.listBranches(true);
// [
//   { name: 'HEAD', oid: '7c2a62cdbc2ef28afaaed3b6f3aef9b581e5aa8e' }
//   { name: 'main', oid: '56ea7a808e35df13e76fee92725a65a373a9835c' }
// ]
reader.listTags(withOid?)

Get a list of tags.

const tags = await reader.listTags();
// [
//   'v1.0.0',
//   'some-feature'
// ]

const tags = await reader.listTags(true);
// [
//   { name: 'v1.0.0', oid: '7c2a62cdbc2ef28afaaed3b6f3aef9b581e5aa8e' }
//   { name: 'some-feature', oid: '56ea7a808e35df13e76fee92725a65a373a9835c' }
// ]

Trees (file lists) methods

reader.treeOidFromRef(ref)

Resolves a Git reference (e.g., branch name, tag, commit, or SHA-1 hash) to the object ID (OID) of the corresponding tree.

  • ref: string – The reference, SHA-1 hash, or object ID to resolve.

Behavior:

  • If the reference points to an annotated tag, the method resolves the tag to its underlying object
  • If the reference resolves to a commit, the method retrieves the tree associated with the commit
  • If the reference resolves directly to a tree, the tree OID is returned
  • Throws an error if the resolved object is not a tree, commit, or tag
const treeOid = await reader.treeOidFromRef('HEAD');
// 'a1b2c3d4e5f6...'

// Error handling
try {
  const invalidTreeOid = await reader.treeOidFromRef('nonexistent-ref');
} catch (error) {
  console.error(error.message); // "Object 'nonexistent-ref' must be a 'tree' but ..."
}
reader.listFiles(ref, filesWithHash)

List all files in the repository at the specified commit reference.

  • ref: string (default: 'HEAD') – commit reference
  • filesWithHash: boolean (default: false) – specify to return blob's hashes
const headFiles = reader.listFiles(); // the same as reader.listFiles('HEAD')
// [ 'file.ext', 'path/to/file.ext', ... ]

const headFilesWithHashes = reader.listFiles('HEAD', true);
// [ { path: 'file.ext', hash: 'f2e492a3049...' }, ... ]
reader.getPathEntry(path, ref)

Retrieve a tree entry (file or directory) by its path at the specified commit reference.

  • path: string - the path to the file or directory
  • ref: string (default: 'HEAD') - commit reference
const entry = await reader.getPathEntry('path/to/file.txt');
// { isTree: false, path: 'path/to/file.txt', hash: 'a1b2c3d4e5f6...' }
reader.getPathsEntries(paths, ref)

Retrieve a list of tree entries (files or directories) by their paths at the specified commit reference.

  • paths: string[] - an array of paths to files or directories
  • ref: string (default: 'HEAD') - commit reference
const entries = await reader.getPathsEntries([
  'path/to/file1.txt',
  'path/to/dir1',
  'path/to/file2.txt'
]);
// [
//   { isTree: false, path: 'path/to/file1.txt', hash: 'a1b2c3d4e5f6...' },
//   { isTree: true, path: 'path/to/dir1', hash: 'b1c2d3e4f5g6...' },
//   { isTree: false, path: 'path/to/file2.txt', hash: 'c1d2e3f4g5h6...' }
// ]
reader.deltaFiles(nextRef, prevRef)

Compute the file delta (changes) between two commit references, including added, modified, and removed files.

  • nextRef: string (default: 'HEAD') - commit reference for the "next" state
  • prevRef: string (optional) - commit reference for the "previous" state
const fileDelta = await reader.deltaFiles('HEAD', 'branch-name');
// {
//   add: [ { path: 'path/to/new/file.txt', hash: 'a1b2c3d4e5f6...' }, ... ],
//   modify: [ { path: 'path/to/modified/file.txt', hash: 'f1e2d3c4b5a6...', prevHash: 'a1b2c3d4e5f6...' }, ... ],
//   remove: [ { path: 'path/to/removed/file.txt', hash: 'a1b2c3d4e5f6...' }, ... ]
// }

Commit methods

reader.commitOidFromRef(ref)

Resolves a Git reference (e.g., branch name, tag, or SHA-1 hash) to the object ID (OID) of the corresponding commit.

  • ref: string – The reference, SHA-1 hash, or object ID to resolve.

Behavior:

  • If the reference points to an annotated tag, the method resolves the tag to its underlying commit.
  • Throws an error if the reference does not resolve to a valid commit.
const commitOid = await reader.commitOidFromRef('HEAD');
// '7c2a62cdbc2ef28afaaed3b6f3aef9b581e5aa8e'

// Error handling
try {
  const invalidCommitOid = await reader.commitOidFromRef('nonexistent-ref');
} catch (error) {
  console.error(error.message); // "Object 'nonexistent-ref' must be a 'commit' but ..."
}
reader.readCommit(ref)

Reads and resolves a commit object identified by a reference (e.g., branch name, tag, or SHA-1 hash).

  • ref: string – The reference, SHA-1 hash, or object ID of the commit.
const commit = await reader.readCommit('HEAD');
// {
//     oid: '7c2a62cdbc2ef28afaaed3b6f3aef9b581e5aa8e',
//     tree: '20596d5c9e037844ae2b707a4a1cb45c72e70e7f',
//     parent: ['8bb6e23769902199e39ab70f2441841712cbdd62'],
//     author: { name: 'John Doe', email: '[email protected]', timestamp: 1680390225, timezone: '+0200' },
//     committer: { name: 'Jane Doe', email: '[email protected]', timestamp: 1680392225, timezone: '+0200' },
//     message: 'Initial commit',
//     gpgsig: '-----BEGIN PGP SIGNATURE-----...'
// }
reader.log(options)

Returns a list of commits in topological order, starting from the specified reference.

  • options: An object with the following properties:
    • ref: string (default: 'HEAD') – The reference, SHA-1 hash, or object ID to start from.
    • depth: number (default: 50) – Limits the number of commits to retrieve. Pass Infinity to retrieve all reachable commits.
const commits = await reader.log({ ref: 'my-branch', depth: 10 });
// [
//     { oid: 'a1b2c3d4...', tree: '...', parent: [...], author: {...}, committer: {...}, message: '...' },
//     { oid: 'b2c3d4e5...', tree: '...', parent: [...], author: {...}, committer: {...}, message: '...' },
//     ...
// ]

To retrieve all commits reachable from a ref, set the depth option to Infinity.

const allCommits = await reader.log({ ref: 'my-branch', depth: Infinity });
console.log(allCommits.length); // All reachable commits

Misc methods

reader.readObjectHeaderByHash(hash)

Reads and returns the header of a Git object by its hash.

  • hash: Buffer – The SHA-1 hash of the object
const hash = Buffer.from('8bb6e23769902199e39ab70f2441841712cbdd62', 'hex');
const header = await reader.readObjectHeaderByHash(hash);
// { type: 'commit', length: 123 }
reader.readObjectByHash(hash, cache?)

Reads and returns the complete content of a Git object by its hash.

  • hash: Buffer – The SHA-1 hash of the object
  • cache: boolean (optional) – Whether to use reader's caching (default: true)
const hash = Buffer.from('8bb6e23769902199e39ab70f2441841712cbdd62', 'hex');
const object = await reader.readObjectByHash(hash);
// { type: 'blob', object: <Buffer ...> }
reader.readObjectHeaderByOid(oid)

Reads and returns the header of a Git object by its OID (Object ID).

  • oid: string – The Object ID of the Git object
const header = await reader.readObjectHeaderByOid('8bb6e23769902199e39ab70f2441841712cbdd62');
// { type: 'tree', length: 45 }
reader.readObjectByOid(oid, cache?)

Reads and returns the complete content of a Git object by its OID.

  • oid: string – The Object ID of the Git object.
  • cache: boolean (optional) – Whether to use reader's caching (default: true).
const object = await reader.readObjectByOid('8bb6e23769902199e39ab70f2441841712cbdd62');
// { type: 'tree', object: <Buffer ...> }
reader.stat()

Retrieves repository statistics, including refs, objects, and files.

const stats = await reader.stat();
/*
{
    size: 163937,
    refs: {
        remotes: [
            { remote: "origin", branches: ["HEAD", "main", ...] },
            ...
        ],
        branches: ["main", "foo", "bar", ...],
        tags: ["tag1", "tag2", ...]
    },
    objects: {
        count: 322,
        size: 145569,
        unpackedSize: 446973,
        unpackedRestoredSize: 755430,
        types: [
            { type: "tree", count: 23, size: 7537, unpackedSize: 8929, unpackedRestoredSize: 0 },
            ...
        ]
        loose: {
            objects: { count: 19, size: 15407, unpackedSize: 40312, unpackedRestoredSize: 0, types: [...] },
            files: [
                {
                    path: "objects/20/596d5c9e037844ae2b707a4a1cb45c72e70e7f",
                    size: 536,
                    object: { oid: "20596d5c9e037844ae2b707a4a1cb45c72e70e7f", type: "tree", length: 606 }
                },
                ...
            ]
        },
        packed:{
            objects: { ... },
            files: [
                {
                    path: "objects/pack/pack-43bc2b9ae5b7a56ab22e849c6c1dfaa00ba72ab1.pack",
                    size: 130194,
                    objects: { ... },
                    index: {
                        path: "objects/pack/pack-43bc2b9ae5b7a56ab22e849c6c1dfaa00ba72ab1.idx",
                        size: 9556,
                        namesBytes: 6060,
                        offsetsBytes: 1212,
                        largeOffsetsBytes: 0
                    },
                    reverseIndex: {
                        path: "objects/pack/pack-43bc2b9ae5b7a56ab22e849c6c1dfaa00ba72ab1.rev",
                        size: 1264
                    }
                },
                ...
            ]
        }
    },
    files: [
        { path: 'config', size: 123 },
        { path: 'objects/pack/pack-a1b2c3d4.pack', size: 456789 },
        { path: 'refs/heads/main', size: 45 }
    ]
}
*/

Utils

isGitDir(dir)

Checks whether the specified directory is a valid Git directory. Returns true if the directory contains the necessary files and subdirectories to be a valid Git directory (e.g. objects, refs, HEAD, and config), false otherwise.

  • dir: string – The path to the directory to check.
import { isGitDir } from '@discoveryjs/scan-git';

const isValidGitDir = await isGitDir('/path/to/repo/.git');
console.log(isValidGitDir); // true or false

resolveGitDir(dir)

Resolves the path to the Git directory for the specified input directory.

  • dir: string – The path to the directory to resolve.

Behaviour:

  • If the input directory contains a .git subdirectory, the method resolves to its path
  • If no .git subdirectory is found, it resolves the input directory itself, assuming it's already the .git directory
  • Throws an error if the input path doesn't exist or isn't a directory
import { resolveGitDir } from '@discoveryjs/scan-git';

try {
  const gitDir = await resolveGitDir('/path/to/repo');
  console.log(gitDir); // '/path/to/repo/.git' or '/path/to/repo'
} catch (error) {
  console.error(error.message);
}

parseContributor(input)

Parses a string representation of a Git contributor into a structured object.

  • input: string – A contributor string in the format Name <email> timestamp timezone
import { parseContributor } from '@discoveryjs/scan-git';

const contributor = parseContributor('John Doe <[email protected]> 1680390225 +0200');
// {
//     name: 'John Doe',
//     email: '[email protected]',
//     timestamp: 1680390225,
//     timezone: '+0200'
// }

parseTimezone(offset)

Parses a Git timezone offset string into a numeric offset in minutes.

  • offset: string – A timezone string in the format +hhmm or -hhmm.
import { parseTimezone } from '@discoveryjs/scan-git';

const timezoneOffset = parseTimezone('+0200');
console.log(timezoneOffset); // 120

parseAnnotatedTag(object)

Parses a buffer representing an annotated Git tag into a structured object.

  • object: Buffer – The tag object buffer.
import { parseAnnotatedTag } from '@discoveryjs/scan-git';

const tagObject = await reader.readObjectByOid('7c2a62cdbc2ef28afaaed3b6f3aef9b581e5aa8e');
const tag = parseAnnotatedTag(tagObject.content);
// {
//     tag: 'v1.0.0',
//     type: 'tag',
//     object: 'a1b2c3d4e5f6g7h8i9j0',
//     tagger: { name: 'John Doe', email: '[email protected]', timestamp: 1680390225, timezone: '+0200' },
//     message: 'Initial release',
//     gpgsig: '-----BEGIN PGP SIGNATURE-----...'
// }

parseCommit(object)

Parses a buffer representing a Git commit into a structured object.

  • object: Buffer – The commit object buffer.
import { parseCommit } from '@discoveryjs/scan-git';

const commitObject = await reader.readObjectByOid('7c2a62cdbc2ef28afaaed3b6f3aef9b581e5aa8e');
const commit = parseCommit(commitObject.content);
// {
//     tree: 'a1b2c3d4e5f6g7h8i9j0',
//     parent: ['b2c3d4e5f6g7h8i9j0k1'],
//     author: { name: 'John Doe', email: '[email protected]', timestamp: 1680390225, timezone: '+0200' },
//     committer: { name: 'John Doe', email: '[email protected]', timestamp: 1680390225, timezone: '+0200' },
//     message: 'Fix a critical bug',
//     gpgsig: '-----BEGIN PGP SIGNATURE-----...'
// }

parseTree(buffer)

Parses a buffer representing a Git tree object into a structured array of entries.

  • buffer: Buffer – The tree object buffer.
import { parseTree } from '@discoveryjs/scan-git';

const treeObject = await reader.readObjectByOid('7c2a62cdbc2ef28afaaed3b6f3aef9b581e5aa8e');
const tree = parseTree(treeObject.content);
// [
//     { isTree: true, path: 'src', hash: <Buffer ...> },
//     { isTree: false, path: 'README.md', hash: <Buffer ...> }
// ]

Features and comparation

scan-git isomorphic-git Feature
loose refs
packed refs
🚫 index file
Boosts fetching a file list for HEAD
loose objects
packed objects (*.pack + *.idx files)
🚫 2Gb+ packs support
Version 2 pack-*.idx files support packs larger than 4 GiB by adding an optional table of 8-byte offset entries for large offsets
🚫 On-disk reverse indexes (*.rev files)
Reverse index is boosting operations such as a seeking an object by offset or scanning objects in a pack order
🚫 🚫 multi-pack-index (MIDX)
Stores a list of objects and their offsets into multiple packfiles, can provide O(log N) lookup time for any number of packfiles
🚫 🚫 multi-pack-index reverse indexes (RIDX)
Similar to the pack-based reverse index
🚫 Cruft packs
A cruft pack eliminates the need for storing unreachable objects in a loose state by including the per-object mtimes in a separate file alongside a single pack containing all loose objects
🚫 🚫 Pack and multi-pack bitmaps
Bitmaps store reachability information about the set of objects in a packfile, or a multi-pack index
🚫 (TBD) 🚫 commit-graph
A binary file format that creates a structured representation of Git’s commit history, optimizes some operations

License

MIT

About

A tool set for fast and efficient git scanning to capture data with focus on large repos

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •