Skip to content

Distributed Version Control

Aakash Goplani edited this page Jan 7, 2018 · 9 revisions

Topics Covered

Introduction

Hello, and welcome to How Git Works, module 4. We're only missing one last layer in our description of the Git onion, but it's a really important one. Distribution. So far we mentioned that there is only one computer in the world, the computer that you're running it on. Now let's see what happens if you use Git the way it's used in practice, to share projects across multiple computers.

A World of Peers

  • Imagine that you have a Git repository on a computer somewhere. It's this orange box here. And you also want the same repository somewhere else, probably on a different machine, so you want to have it here. I made it green.
    image 1

  • Now, the machine that hosts the green repository must be able to connect to the machine that hosts the orange repository, so you might have some technical setup to do here. You have to run a Git daemon process on the orange repo so that the green repo can connect to it and so on and so forth.

  • Let's say orange repository holds our project in github and green repository i.e. our local machine needs to have a copy of that orange repository. So, I want to get a copy of the project on this computer inside this empty directory. It's the git clone command. It takes the address of Git repository, which I can copy/paste from GitHub there, and now I have the project.

$ git clone https://github.com/aakash14goplani/FullStack.wiki.git
  • All the files are here. But I didn't just get the files. I got the entire .git directory as well and all the files it contains.

  • Here is what git clone did. It created an empty directory for the cookbook, and it copied the .git directory from the GitHub project to this directory. It didn't literally copy each and every file. For example, in recent versions of Git, git clone only copies one branch, the master branch. If I want to work with the other branches on the remote repo, I need to give specific commands to do so. The important part is Git did copy over the objects in the object database.

  • After copying this stuff, Git checked out the master branch to rebuild these files in the working area. Remember, the working area in Git is not very important. You can always rebuild it on the fly from the content of the .git directory. And since the .git directory contains the entire repository, now we have a copy of the project and its history on this computer.

  • Now that we have two clones of the repo, one on GitHub and one on this computer, both clones are equally good.

  • Git is not like subversion or other traditional revision control systems that need a centralized server and everyone else is just talking to that one server. Instead, both computers now contain the whole project and its history. We could have as many of these clones as we want synchronizing with each other.

  • Of course, you can still decide that one specific clone is the most important one. For example, if you had multiple developers working on the same software project, then you would probably decide that the repo on GitHub is the reference repo, the one that you build the releases from, and everybody must synchronize with that one. That's why I drew you the GitHub right on top.

  • You can still synchronize the developer's repos directly with each other, but even then you probably want to appoint a well-known reference copy that everybody synchronizes with. However, in Git that's not a technical issue. It's a social issue; it's a convention. From a technical standpoint, all of these clones are peers.

Local and Remote

  • Now we have the same project in two separate repos, orange and green. We're working on green, so it would be useful if green could remember the address of orange because we decided that orange is an important copy and we want to stay synchronized with it.

  • Indeed, when we issued the git clone command, Git added a few lines to the configuration of our repository. It's here in the config file.

$ cat .git/config

[core]
	repositoryformatversion = 0
	filemode = false
	bare = false
	logallrefupdates = true
	symlinks = false
	ignorecase = true
[remote "origin"]
	url = https://github.com/aakash14goplani/FullStack.git
	fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
	remote = origin
	merge = refs/heads/master
  • Each Git repository, such as this one, carry member information about other copies of the same repository. Each other copy is called a remote. You can define as many remotes as you want, but when you clone a project Git immediately defines a default remote and calls it with a conventional name, origin.

  • Here is the configuration of origin, and it points to the URL that we cloned the project from. The default configuration says that we have one master branch that maps over the master branch of the remote. You can tweak this configuration to change the policies that you use to synchronize with remotes, but the default is pretty obvious.

  • So, now Git remembers which other repo or repos we want to synchronize with, but to synchronize Git also needs to know the current state of origin, which by interest are there on the remote, which commits those branches are currently pointing at and so on.

  • And in fact, Git does store that information. If we ask it for branches, then it will just show the local branches. We only have master now.

$ git branch
master*

But if you list the branches with the --all switch, then you see all the references, including the ones on the remote, the remote branches and the current position of HEAD.

git branch --all
master*
lisa
nogood
spaghetti
  • Git tracks a remote by just exactly like it tracks local branches, by writing those branches as references in the refs folder. If you look inside that folder, you will see an origin folder in here that contains the references to branches, tags, and the current HEAD pointer of origin.

  • Git will automatically update this information when we connect to a remote.

  • There is one wrinkle here. If you look inside this folder, you might find that some of the branches are missing. In this case, I can only see the remotes HEAD here and not the branches. That's because of a low-level optimization in Git. To avoid maintaining one small file for each branch, Git sometimes compacts some of them into a single file called packed-refs here.

  • There is no simple command to unpack this file, so you will have to take my word for it that the branches that are not in the refs directory must be in this file. This can happen for both local and remote branches. But in both cases, whether the branches are still individual files or packaged together in packed-refs, they're still conceptually the same thing.

  • All branches, local or remote, are still references to a commit, and Git tracks all of that. Since we cannot peek inside the files for some of these branches because they've been packed, let's use this plumbing command, git show-ref, to see which commits they're pointing at.

$ git show-ref master
b5cacb8c0bd86e1f166f29f0a9c8c82f6cca9064 refs/heads/master
  • git show-ref master lists all of the branches that have master in their names, which means the local master branch and the remote master branch.

  • So, bottom line, you know that a local branch in Git is just a reference to a commit. Well, a remote branch is exactly the same thing. Whenever you synchronize with a remote, Git updates remote branches. Let's see how that synchronization happens in practice.

The Joy of Pushing

  • Git object is just a sequence of bytes identified by a SHA1. I also insisted a lot that SHA1s are unique; unique in the universe. Finally, this is the point in our training where we can see how that uniqueness is truly useful.

  • Look back at our two repositories. When we cloned, we copied the objects from the orange repo to the green repo. Now we mentioned that we added a few new objects to the green repo, for example a new commit and the associated blobs and trees.

  • Synchronization is mostly about getting the same objects on all the clones. But now it's very easy to synchronize because each object is immutable and has a unique SHA1, so Git will never get confused. It can just copy the missing objects from one repo to the other.
    image 2

  • Git also has to keep the branches synchronized on the various clones, and that's where things get a bit tricky. Let's see how this works. I will make a change to this repo by editing one file and commit it.

  • So now we have a few new objects in the database, a new blob to represent the file I changed, a new tree that represents the updated project root folder that is pointing to that blob, and this new commit here.

  • The local master branch is pointing at the new commit while the master branch on origin is still pointing at the previous commit. Of course, nobody changed that branch yet, and origin doesn't even have this commit, and neither does it have the other new database objects.

  • So, let's send both the new objects and the updated branch to origin.

git push

Now our new objects have been pushed to the remote, and the branches on origin moved to point at the latest commit. We can easily check that because Git updated our remote branches to align with the current state of origin.

The Chore of Pulling

  • Now what happens when they're other repos pushing to origin so the state of origin might change at any time? No, we cannot just write changes to the remote. We also must read the changes from the remote.

  • Things get a bit more complicated here, so I will use a diagram here instead of a demo. Imagine that we have a remote repo that looks like this. It's a single commit. I will use different colors for the commits, and I will not throw trees and blobs, I will skip them because they would make this diagram too busy.

  • When we clone this repo, we get the same objects on our local client, and here are the branches. Now let's say that we add the one commit and we push. If there are no changes to the remote's master branch, then things are easy. Git copies our new commit and the associated objects to the remote, and then updates the remote's master branch to point at the new commit, and it also reflects the change in the branches on origin by updated the origin/master branch on the local repo.

  • This is what we did when we pushed our changes a few minutes ago. Now let's do it again. This is the initial situation. We had the commit, and we prepared to push, but this time we mentioned that something has changed on the remote as well. Someone pushed another commit to the remote. Now we can't just push. We have a conflict here.

  • We have two different histories that need to be reconciled. In this case, we basically have two options. One option, which I would not recommend except in very special cases, is to force a push. We can do that with the command line switch on the Git push command, git push -f, which stands for force. This means that we're forcing the remote to take our new objects and change its history to match our local history. So, we're probably losing data on the origin. Here we're losing the very commit. Now branch is pointing at that commit any more, so it will be garbage collected eventually. We're also creating a very confusing situation for all other people synchronizing to the same remote because now their local history will be conflicting with the history in origin. So, probably forcing a push is not a good idea.

  • Let's do it again properly. This is the situation we had before the push. What we want to do in general is we want to fix the conflict on our own machine before we push. To do that, we need first to fetch the data from the remote. There is a command to do that called git fetch. We get the new objects from the remote, and we also update the current position of the remote branches, as usual. Now that we have the new commit and the related objects, we can merge our local changes with the remote history.

  • So, we did a fetch. Now we do a merge. Of course during the merge we might have to fix merging conflicts and the like, but the important point here is that we are not rewriting history. Merges never do that. Instead, they just add the new objects.

  • So, once we do the merge, our history is the history from the remote plus some more stuff, and we can push that new stuff to the remote without rewriting the remote's history.
    image 3

  • This is what you do most of the time. You fetch the changes from the remote, you merge them into your own repo, and then you push the result. This sequence of a git fetch followed by a git merge is so common that there is one single command that does both. It's called, you guessed it, git pull, a fetch followed by a merge.

Rebase Revisited

  • There is one more important thing to say about this process of pushing and pulling, and it has to do with rebasing. There are a few cases where rebases do not work very well, let's see why.

  • Say that we have this repo freshly cloned with two branches that are both tracking branches on origin. We're working on the lisa branch, and we decide to roll the changes from master into lisa. You know that we can do this with either a merge or a rebase, so let's try the rebase this time. Git copies over the lisa commit so that its parent is now the latest commit on master, and there we are.

  • However, remember that this new yellow commit that we have here is not the same commit as the previous yellow commit. Instead, it's a copy, a different database object. I marked it with an explanation point to tell it apart from the original commit. The original commit will actually be garbage collected at some point.

  • So, now we have a conflict again. We can't just push because we have different histories on our local repo and on origin. This particular conflict, however, doesn't seem like much. We can fix it easily, for example by doing a false push or a pull followed by a push. In any case, we can work around this, and then we have the same stuff on origin that we have on local.

  • However, things break down when we introduce another user. Our friend, Annie, is also working on the same cookbook repository, and she still has the original known exclamation mark, the yellow commit in our repository. Not only that, she also kept working on the lisa branch. She added a commit there, so now Annie has a pretty nasty conflict to sort out the next time she synchronizes with origin.

  • She needs to understand what happened first, and then to solve the conflicts even though she didn't cause the conflicts herself. There are good chances that even after solving the conflicts she will end up with a confusing history that includes both yellow commits even though they look exactly the same.
    image 4

  • So, this is the bottom line when it comes to rebasing. As a general rule, never rebase stuff that has been shared with some other repository*. It's okay to rebase commits that you haven't shared yet in general, but remember that it's easy to rebase share commits by mistake and then expect some trouble.

Getting Social

  • Imagine that there is this project on GitHub that we want to contribute to. It belongs to a user named ProjectA. We could simply clone this project, but then it would be stuck on our local machine because we don't have the right access to ProjectA's repository, so we cannot push to it.

  • What we can do from the GitHub web interface is to create our own copy of the project on GitHub. This called a fork. A fork is kind of like clone, but it's a remote clone. We are cloning the project from someone else's GitHub account to our own GitHub account.

  • So now we have a new project in the cloud, and we can clone that one on our local machine. When we do that, Git creates a remote in our local repo pointing at origin. Origin is pointing at our own GitHub project, not the original project, of course.

  • Actually, from Git's point of view there is no connection at all between our project and the original project that we fork from. GitHub does know that the two projects are connected, but Git doesn't. So if we want to track changes to the original project, then we need to add another remote pointing at it. This is not something that Git does automatically. We have to do it ourselves. A common convention is to call this a remote upstream.

  • Now we have our local project with multiple remotes. We can work on it, and we can synchronize all our local changes with origin. If we commit local changes, we can just push those changes to origin. If there are changes on upstream, we can pull them into our local project, solve any conflicts, and then push them to origin.

  • One thing that we still cannot do, however, is to push changes to upstream. For example, we might like to contribute our orange commit to the original project, but we still do not have right access to upstream, so GitHub gives us an alternative. We can send a message to the maintainers of upstream and ask them to pull our changes. It's a pull request.

  • Once again, pull requests are not a Git feature. They're not even a version control feature, strictly speaking. In a way, they are a social network feature. You're just sending a message to people. If those people like your changes, then GitHub makes it easy for them to do a remote pull and get your changes from origin.

image 5

The Whole Onion

  • We started right from the core of Git, a simple map of hashes to objects.

  • Then we looked into those objects, and we got to the point where we could see Git as a stupid content tracker that tracks changes to your files and directories.

  • From there we moved on to the revision control features of Git. We talked about branches and merges and rebases.

  • And finally, we looked at the distribution-related features of Git that are probably the main reason why you use Git in the first place.

  • And there he is, the whole onion.

image 6

Clone this wiki locally