Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some contributors appear several times under a different name #47

Closed
vhoulbreque opened this issue May 5, 2018 · 17 comments · Fixed by #85
Closed

Some contributors appear several times under a different name #47

vhoulbreque opened this issue May 5, 2018 · 17 comments · Fixed by #85

Comments

@vhoulbreque
Copy link

I tried this project on https://github.com/vinzeebreak/ironcar

What I did:

git-of-theseus-analyze ironcar
git-of-theseus-stack-plot authors.json

And I get this:

stack_plot

But, several authors are the same person (and they appear under only one name in github's list of commits):

  • Houlbreque, Vincent Houlbrèque, Vinzeebreak and vinzeebreak
  • Hugo Masclet, Hugoo, Masclet Hugo

Shouldn't they appear under the same name ?

@vhoulbreque vhoulbreque changed the title Contributors appear several times Some contributors appear several times under a different name May 6, 2018
@andilar
Copy link

andilar commented Nov 29, 2018

I have the same issue. I tried working with the .mailmap file, but there is no difference.

@erikbern
Copy link
Owner

weird, i thought .mailmap would do the trick

feel free to investigate

@andilar
Copy link

andilar commented Nov 29, 2018

Ok thx. What I found out is, if you just have one entry in your .mailmap, it will be recognized. Also my output with git shortlog -sne is coming out correctly with a full blown .mailmap.

@erikbern
Copy link
Owner

weird, maybe gitpython doesn't parse .mailmap?

@tveon
Copy link

tveon commented Jun 6, 2019

No, they don't: gitpython-developers/GitPython#764
But they also propose a solution...

@erikbern
Copy link
Owner

erikbern commented Jun 6, 2019

feel free to commit a fix for this!

@martinib77
Copy link

Does this problem persist? Any solution.
I didn't understand if the .mailmap must be added on the git repo or can be used at the plot generation step

@erikbern
Copy link
Owner

erikbern commented Jan 1, 2020

Pretty sure the problem still exists, so feel free to try to fix it!

@dht
Copy link

dht commented Mar 26, 2022

Workaround:
Use this Javascript script to fix the authors.json file:

fix-authors.js

const fs = require("fs");
const authors = JSON.parse(fs.readFileSync("./authors.json"));

const labels = authors.labels;

const output = {
  ...authors,
};

const mailMap = {
  Houlbreque: "Hugo Masclet",
  "Hugo Masclet": "Hugo Masclet",
  Hugoo: "Hugo Masclet",
  "Masclet Hugo": "Hugo Masclet",
  "Vincent Houlbr\u00e8que": "Vincent Houlbr",
  Vinzeebreak: "Vincent Houlbr",
  adizout: "adizout",
  mathrb: "mathrb",
  srdadian: "srdadian",
  vinzeebreak: "Vincent Houlbr",
};

let memo = {},
  memoIndex = 0;

const map = labels.map((name, index) => {
  const toName = mailMap[name];

  if (!memo[toName]) {
    memo[toName] = memoIndex++;
  }
  return memo[toName];
});

output.y = output.y.reduce((output, item, index) => {
  const toMap = map[index];

  item.forEach((value, i2) => {
    output[toMap] = output[toMap] || [];
    output[toMap][i2] = output[toMap][i2] || 0;
    output[toMap][i2] += value;
  });

  return output;
}, []);

output.labels = Object.keys(memo);

fs.writeFileSync("./authors.out.json", JSON.stringify(output, null, 4));

Then you can plot with:

git-of-theseus-stack-plot authors.out.json --out stack.authors.png

@thehale
Copy link

thehale commented Jul 8, 2022

I tried @dht 's script, but ended up with some authors getting mixed up.

I wrote a comparable script in Python, that could probably be converted into a PR without too much effort (I just ran out of time to figure out how to integrate file paths with the CLI and the complexities of the analyze function)

Expand to see full script (120 lines)
"""
Aggregates contribution data from the `authors.json` file generated
by the `git-of-theseus` tool using an `authors_map.json` file.

The `authors_map.json` file must have the following format:
{
    "authorA": ["aliasA", "aliasA2", ...],
    "authorB": ["aliasB", "aliasB2", ...],
}
"""
import json


def read_authors_map(path):
    with open(path, "r") as f:
        authors_map = json.load(f)
    return authors_map


def read_authors_json(path):
    with open(path, "r") as aj:
        authors_json = json.load(aj)
    return authors_json


def parse_raw_contributions(authors_json):
    """
    The `authors.json` has the following format
    {
        "y": [
            [<line_count1>, <line_count2>, ...],
            [<line_count1>, <line_count2>, ...],
            ...
        ],
        "ts": ["date1", "date2", ...]
        "labels": ["aliasA", "aliasB", ...]
    }

    Each author's line count over time is stored separately
    from the author list. The association is made by index.

    This function parses the `authors.json` into the following
    format:
    {
        "aliasA": [<line_count1>, <line_count2>, ...],
        "aliasB": [<line_count1>, <line_count2>, ...],
        ...
    }
    """
    raw_contributions = {}
    for idx, alias in enumerate(authors_json["labels"]):
        raw_contributions[alias] = authors_json["y"][idx]
    return raw_contributions


def aggregate_contributions(authors_map, raw_contributions):
    """
    Aggregates the contribution data from each `alias` in the
    `raw_contributions` based on the `authors_map`.

    Returns a dictionary of the following format:
    {
        "authorA": [<line_count1>, <line_count2>, ...],
        "authorB": [<line_count1>, <line_count2>, ...],
    }
    where the values of each `author` are the sum of the contribution
    data for each author's corresponding aliases in the `authors_map`.

    For example, if the author `authorA` has aliases `aliasA` and `aliasA2`,
    and the `raw_contributions` data looks like this:
    {
        "aliasA": [10, 20],
        "aliasA2": [5, 20],
    }
    then the aggregated contribution data will look like this:
    {
        "authorA": [15, 40],
    }
    """
    contributions = {}
    for author, aliases in authors_map.items():
        alias_contributions = [
            raw_contributions[a] for a in aliases if a in raw_contributions
        ]
        if len(alias_contributions) > 0:
            contributions[author] = [
                sum(ac[idx] for ac in alias_contributions)
                for idx in range(len(alias_contributions[0]))
            ]

    return contributions


def format_new_authors_json(authors_map, authors_json, contributions):
    """
    Formats the `contributions` data into the `authors.json` format.
    """
    return {
        "y": [
            contributions[author]
            for author in authors_map.keys()
            if author in contributions
        ],
        "ts": authors_json["ts"],
        "labels": [author for author in authors_map.keys() if author in contributions],
    }


def write_authors_json(path, authors_json):
    with open(path, "w") as f:
        json.dump(authors_json, f)


if __name__ == "__main__":
    authors_map = read_authors_map("authors_map.json")
    authors_json = read_authors_json("authors.json")
    raw_contributions = parse_raw_contributions(authors_json)
    contributions = aggregate_contributions(authors_map, raw_contributions)
    new_authors_json = format_new_authors_json(authors_map, authors_json, contributions)
    write_authors_json("authors.out.json", new_authors_json)

@erikbern
Copy link
Owner

erikbern commented Jul 8, 2022

I think a mailmap file might resolve it, but I'm not sure

@Whathecode
Copy link

@erikbern I tried:

  • adding a .mailmap file
  • checking it in (not certain this would be a requirement)
  • re-running git-of-theseus-analyze (not certain this would be a requirement)

But, the created graphs still don't disambiguate between authors using what is specified in .mailmap. I.e., it doesn't seem to work.

@thehale
Copy link

thehale commented Jul 14, 2022

@Whathecode It doesn't look like git-of-theseus currently considers a .mailmap when computing author statistics. I understood erikbern's comment to mean that he would prefer a solution based on parsing a .mailmap over my proposed solution which uses a custom JSON format.

@erikbern
Copy link
Owner

erikbern commented Jul 14, 2022

I thought .mailmap would maybe work through the git library that git-of-theseus uses

I guess not? Would be nice to support .mailmap files!

Thanks for checking @Whathecode – really appreciate it!

@owenlamont
Copy link
Contributor

owenlamont commented Feb 8, 2023

I also just ran into this. The .mailmap issue is still unresolved at GitPython and apparently that repo is now in maintenance mode and no longer actively maintained.

Not sure if that means that dependency will ultimately need to be swapped out although I have no idea how big that job would be or what alternatives exist.

@thehale
Copy link

thehale commented Feb 8, 2023

@owenlamont The maintainer of GitPython actively responds to PRs, including PRs for new features (I had one merged in a few months ago). If someone contributed .mailmap support to GitPython I'm reasonably confident it would be accepted.

@owenlamont
Copy link
Contributor

Good to know, cheers. I kind of got mixed messages from the README as to how much it was still supported. I'll try to have a look at what is involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants