Skip to content

Introduction to GitHub

Aakash Goplani edited this page Jan 7, 2018 · 14 revisions

Topics Covered

Introduction

  • Imagine that Git is layered like an onion. We won't to try and understand the whole onion at once. That would be very ambitious to eat the whole onion. Instead, we will peel off the layers of the onion until we reach Git's conceptual core.

  • GitHub, it's a distributed revision control system.

  • Let's make it easier by peeling off one layer. Let's remove distribution. Imagine Git is not distributed at all. If you can imagine that there is only one computer in the world, then there is a repository in that computer. So Git becomes just a revision control system, no distribution. However, a revision control system is still a complex beast. It includes things such as history branches, merges. And these features make things more complicated, so let's make it simple instead again.

  • Let's peel off one more layer. What happens if you forget about branches, history, and the like? You can call it a content tracker because that's all it does. It tracks content files or directories. And if you look at Git's documentation, you will see that this is actually Git's definition of itself, Git, the stupid content tracker. If you look at it as a content tracker, then Git is easier to understand.

  • Let's take this one step further, forget even about tracking files. Forget about the notion of a commit or versioning. Let's look at the very core of the onion, the basic idea behind it, and I would say that at its core, Git is just a map, a simple structure that maps keys to values. And this structure is persistent. It's stored on your disk.

Types of Git Commands

  1. Plumbing Command: The low-level commands, are called plumbing commands such as cat-file, hash-object, and a few more. These are the basic building bricks that the porcelain commands are built upon. You might never need to use the plumbing commands unless you're doing some advanced Git scripting or the like.

  2. Porcelain Command: The more (high-level)user-friendly commands are called porcelain commands. e.g. add, commit, pull, push etc.

Note:

  • Now understanding all these commands can be hard, some of them can be confusing; however, here is a key point. You could argue that the secret to Git is not about knowing the commands, either porcelain or plumbing. Instead, the secret to Git is about knowing the conceptual model behind the Git.
  • If you want to use Git safely and unleash all of its power and not get in trouble, then don't look at the commands. Look at the model instead. Once you do, the complexity of the Git commands kind of fades away. Suddenly Git looks simple, even elegant. You don't get stuck anymore.
  • So if you really want to become a Git master, then you should understand the model, and then you will also understand the commands much more deeply after you understand the model.

SHA1

  • Git, at its core, is just a map. That means that it's a table with keys and values.

  • The values are just sequences of bytes, for example the content of a text file or even a binary file. Any sequence of bytes can be a value.

  • You can give a value to Git, and it will calculate a key for you, a hash. Git calculates hashes with a SHA1 algorithm.

  • Every piece of content has its own SHA1. For example, let's take a piece of content, the string Aakash Goplani. If you ask Git to generate a SHA1 out of this string, then you will get this hash, exactly this one. There is only one hash for this string. SHA1s are 20 bytes in hexadecimal format, so they are a sequence of 40 hex digits. These will be Git's key to store this content in the map.

  • We can also calculate the SHA1 on command line using a low-level plumbing command: git hash-object,

C:\Users\AakashGoplani>echo "Aakash Goplani" | git hash-object --stdin
4b35045ce6e4713079261b2c1ead538091d37a86
  • And here is the result. This is the SHA1 for the string Aakash Goplani. If you change anything in the content, a single letter, for example,
C:\Users\AakashGoplani>echo "aakash goplani" | git hash-object --stdin
319d1f20eb31670585f1dba0fdc39c27698d4856

then you get a completely different SHA1. Every object in Git repository has a SHA1.

  • If you put the string Aakash Goplani in the file and store this filing Git, then the SHA1 we just generated will identify the file. Directories also have their own SHA1, as do comments and so on.

Storing Things

  • So we have seen that Git is a map where the keys are SHA1s and the values are pieces of content, but I also said that Git is not just a map, it's a persistent map (ie it preserves the previous version of itself when it is modified).

  • Where does persistence come from? Let's go back to the git hash-object command, if I want the Aakash Goplani content to be persistent, I can add the -w argument to this command. -w stands for write.

$ echo "Aakash Goplani" | git hash-object --stdin -w
e094bd04bb2caf22011ca0d3036ff93ba93edddd
  • Apparently nothing changed, but if you look at the hidden files and directories, on this computer I do that the ls -a switch, then we can see a new hidden subdirectory called .git. This is where your Git repository goes. So, now Git has a place to save stuff. Let's peek inside the .git directory.

  • There are a few files and folders here, but for now just look at this directory here, objects. This is called the object database. It's the place where Git saves all its objects like the string Aakash Goplani we just saved. Let's peek inside. Ignore this too, the info and pack subdirectories. For now they're not important. Instead, look at this subdirectory here. Its name is e0, and these are the first two hexadecimal digits of the SHA1 of the content we just saved. And if we look inside e0, there is a file in here, and the name of the this file is the remaining digits of the SHA1. It uses this scheme to organize content and spread it over multiple directories. It's just a trick to avoid piling up all the content into a single huge clutter directory. Our original string, Aakash Goplani, is inside this file. This is what Git calls a blob of data. A blob is a generic piece of content.

  • However, the original string has been mangled a bit inside the file. Git added a small letter and compressed the content to save space. So we can't just open the file and read it, but we can use another low-level plumbing command to look at the content. It's called git cat-file. git cat-file takes the SHA1 of an object and an argument. If we run it with the -t argument, it stands for type, -s stands for size, Git asks us what this piece of content is. It's a blob.

$ git cat-file e094bd04bb2caf22011ca0d3036ff93ba93edddd -t
blob

$ git cat-file e094bd04bb2caf22011ca0d3036ff93ba93edddd -s
15

And if we run it again with -p for pretty printing, then Git unzips the object, removes the other, and it prints out the actual content of the blob.

$ git cat-file e094bd04bb2caf22011ca0d3036ff93ba93edddd -p
Aakash Goplani
OR
$ git cat-file -p master^{tree}

And here it is, the string Aakash Goplani there. So far we have seen that Git is able to take any piece of content, generate a key for it, a SHA1, and then persist the content into the repository as a blob, a persistent map. This is the very basic of the Git model. Let's build on this and move on to the next layer of the onion.

First Commit

  • We have seen that Git is a persistent map, but you probably don't see it as a map. You see it as something more than that, something that tracks your files in your directories, a content tracker. Let's see what that means.

  • We need an example project, so I built a very simple one, cookbook.

D:\GitProjects\FullStack>tree /f /a
cookbook:.
|   menu.txt
\---recipes
        apple_pie.txt
        README.txt
  1. In the root of the project there is a file named menu.txt. This is supposed to a menu, a list of all the recipes in the cookbook. Right now it only contains a single recipe, Apple Pie.

  2. Then we have our recipes directory that contains the README that tell you that you are supposed to add one separate file for each recipe here.

  3. And indeed we have one file here with the recipe of the apple pie. This file is supposed to contain the entire recipe. For now it's just a placeholder actually, and it contains the string Apple Pie.

  4. So, we have three files, one in the root, and two in the recipes folder. It's a very simple project, but that's what we want for now. We want to understand how Git stores these files and folders, so it's better if we start simple.

  • Let's make this a Git project with git init. There, now we have a .git directory here.
start .git

And because it's a brand new project, the object database in the database folder here is empty apart from the info and pack subdirectories. We can ignore those as usual.

  • Now that we have a project, let's create our first commit for this project. Let's use the git status command to see the files and folders in the project root. So we can see that both menu.txt and the recipes directory are red because they are untracked. This Git doesn't yet know what to do with them. You know that to commit a file I have to put it in the so-called staging area first. It's like a launch pad. Whatever is in the staging area will get into the next commit. We can add these files to the staging area with the git add command. Let's add menu.txt and then the recipes folder and all of its content. Now the files are green. It means that they have been staged. Let's commit them. I will use the -m argument to get commit so that I can give a commit message right here.

  • There. Now the staging area is clean, and we can use another popular command, git log, to look at the list of existing commits. There is only one, and it's SHA1 starts with these digits.

$ git log
commit 101955bd89d0bc988b8e88f33b54986927090b46 (HEAD -> master)
Author: aakash14goplani <[email protected]>
Date:   Tue Dec 26 22:27:28 2017 +0530

    First Commit!
  • If you look in the .git directory under objects, you will see that we have a bunch of subdirectories in here now. One of these is named with the first two digits of the commit, and here are the remaining digits, so this file must be the commit.
10
|---1955bd89d0bc988b8e88f33b54986927090b46

A commit is compressed just like a blob, but by now we know how to peek inside compressed files. We can use git cat-file for that. I will git cat-file the commits SHA1 with -p so that it prints the content of the commit. And here it is.

$ git cat-file -p 101955bd89d0bc988b8e88f33b54986927090b46
tree be4d5bfce489a2591e7fed5c672f9e52cd695a43
author aakash14goplani <[email protected]> 1514307448 +0530
committer aakash14goplani <[email protected]> 1514307448 +0530

First Commit!

So, what's a commit? It's a simple and very short piece of text, nothing else. It's truly a simple as this. Git generates this text, and then it stores it pretty much the same way it stores a blob. It generates it's SHA1, it adds a small letter to the text to say this is a commit, it compresses the text, and it stores the result in a file in the object database. The commit text contains all the metadata about the commit, the name of the owner, the committer, both are myself, and the date of the commit and the message, and then it contains something more, the SHA1 of a tree. What tree? Well, just like a blob is the content of a file stored in Git, a tree is a directory stored in Git. The commit is pointing at the root directory of the project. That's what this tree is, the root of the project.

  • If you look in the object database, you will see a directory named with the first two digits of the trees hash, and inside it is the tree, a file name with the remaining digits of the hash, as usual.
be
|---4d5bfce489a2591e7fed5c672f9e52cd695a43

It's just like commit, see a piece of content that is generated by Git and that hash then stored in object database. So what's inside this tree? What does it look like? Let's cat-file it.

$ git cat-file -p be4d5bfce489a2591e7fed5c672f9e52cd695a43
100644 blob 23991897e13e47ed0adb91a0082c31c82fe0cbe5    menu.txt
040000 tree 3ee76fde69b730530f1682f1f51789e89cf30500    recipes

Just like a commit, a tree is a tiny piece of text. That's all it is, and it contains a list of the content of the directory, a list of SHA1 actually. In this case we have a blob and another tree with our names. The blob is the menu.txt file that's in the root, and the tree is the recipes directory that's also in the root. There is also some additional data for the files and directories, access permissions, but otherwise that is it. That's all it takes for Git to store a directory.

  • I will use cat-file -p as usual, pass it the SHA1 of the blob, and there it is, the string Apple Pie.
$ git cat-file -p 23991897e13e47ed0adb91a0082c31c82fe0cbe5
Apple Pie

That's what's inside menu.txt.

  • So to recap, the commit points to a tree, the root, and this tree points to a blob, menu.txt, and another tree, recipes. And the blob is just a piece of content, the string Apple Pie.

  • Now let's finish the job. Let's look at this other tree and see what's in there. Let's use cat-file again to peek inside the recipe string.

$ git cat-file -p 3ee76fde69b730530f1682f1f51789e89cf30500
100644 blob 361af858632ee7d8d8f9c4022ccaf61fc8d4799c    README.txt
100644 blob 23991897e13e47ed0adb91a0082c31c82fe0cbe5    apple_pie.txt

And there you are, two blobs. One of these blobs is the README file. I will cat-file it.

$ git cat-file -p 361af858632ee7d8d8f9c4022ccaf61fc8d4799c
Put your recipes in this directory, one recipe per file.

Here, and there it is, the content of the README.

$ git cat-file -p 23991897e13e47ed0adb91a0082c31c82fe0cbe5
Apple Pie

The other blob, well this one looks familiar because it's the same SHA1 as the menu.txt blob. That' because these two files have exactly the same contents, so Git will NOT create two separate objects for them. It will just reuse the existing object that is already in the database. So to be picky, a blob is not really a file. A blob is just a content of a file.

  • The file name and the file permissions are not stored in the blob. They are stored in the tree that points to the blob.

  • In the meantime, let's look at the object database again. The recipes tree is pointing at the blob with the content of the README file, and it's also pointing at the blob with the content of apple_pie.txt, which is the same content as the menu.txt file, so it's actually the same blob. And there you are, the whole object database, all of it. image 1 One small note about this. If you tried building this exact same project, and you tried giving the exact same commands that I gave to Git, then you will see that you get exactly the same SHA1s for all the trees and all the blobs. However, the SHA1 of the commit, that one will be different because you have different data in your commit, a different owner and a different commit date. The important thing to understand here is that there is no magic behind SHA1. If you have the same content I do, then you get the same hashes. A commit is also just a piece of content, and your commit has different content than mine, so you get a different hash. It's as simple as that.

Versioning

  • First, let's change a file. I will edit the menu.txt file. I will add the name of another recipe to it, Cheesecake. Let's save the file. And now git status tells us that the file has changed, so let's stage it with git add and create a new commit.
$ git status
On branch master
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   menu.txt

no changes added to commit (use "git add" and/or "git commit -a")

$ git add .
warning: LF will be replaced by CRLF in menu.txt.
The file will have its original line endings in your working directory.
  • Now our working area is aligned again, and if we look at the log, we can see both commits.
$ git log
commit 22a1bdfec6dc4dc061256fce79cbb6a5c3b13908 (HEAD -> master)
Author: aakash14goplani <[email protected]>
Date:   Wed Dec 27 01:42:58 2017 +0530

    File updated

commit 101955bd89d0bc988b8e88f33b54986927090b46
Author: aakash14goplani <[email protected]>
Date:   Tue Dec 26 22:27:28 2017 +0530

    First Commit!
  • Let's use the now familiar cat-file to peek inside this second commit. There, this commit has something more than the first one.
$ git cat-file -p 22a1bdfec6dc4dc061256fce79cbb6a5c3b13908
tree 3448ae5948efcbabc5ad725bc124d78294562e8e
parent 101955bd89d0bc988b8e88f33b54986927090b46
author aakash14goplani <[email protected]> 1514319178 +0530
committer aakash14goplani <[email protected]> 1514319178 +0530

File updated

It has a parent. The parent is the first commit of course. Commits are linked. That makes sense. Most commits have a parent. The very first commit is an exception. So, the commits are linked like this. Also, if you look at the hash of the tree that this second commit is pointing at, you will see that this is a brand new tree. It's not the same tree that the first commit was pointing at. It's like a different root.

  • Let's look at the content of this tree.
$ git cat-file -p 3448ae5948efcbabc5ad725bc124d78294562e8e
100644 blob c058fb19592494dcc29ed0ea712cbbf92fa192e4    menu.txt
040000 tree 3ee76fde69b730530f1682f1f51789e89cf30500    recipes

So, now we can see that the tree contains another tree, the recipes folder and the blob menu.txt. Now, menu.txt is a branch new blob itself because this file has changed. So, if we cat-file it, we can see that it has the new content of the file, all of it, including both Apple Pie and Cheesecake.

$ git cat-file -p c058fb19592494dcc29ed0ea712cbbf92fa192e4
Apple Pie
Cheesecake
  • However, the tree here that lists the content of the recipes directory, this one is the same object that we already had in the database since the first commit. Because the contents of this directory haven't changed, so there is no more reason to create a new object. Git can just use the object that was already in the database.

  • So here is the file structure of the object database after our second commit. image 2 The new commit is pointing to a new tree, which is pointing to a new blob and to the same tree as the first commit. Now it's clear why this tree must be new. This blob has changed, so the content of this tree must be different because it's pointing to a different blob. As usual, if you change anything in a piece of content, then you get a whole new object with a whole new SHA1.

  • This tree, however, it hasn't changed because nothing inside the directory changed, so Git can reuse the same object. That's one of the reasons why Git is so efficient. It doesn't store things more than once. We changed a single file, so Git stored a new blob, and in our case a new tree and a new commit because they are ultimately pointing at that new file, so they are changed. The recent commits are really small, so that's still extremely efficient.

  • If you count the number of object in this diagram, it's two commits plus six strings and blobs, eight objects in total. This is the current number of objects in the object database. Let's double-check it. The database itself is getting a bit crowded, so instead of counting the files let's use one of those seldom- used plumbing commands, git count-objects.

$ git count-objects
8 objects, 79 kilobytes

And there you are, eight objects and they take a very small amount of disk space.

  • Speaking of efficiency, you might be surprised that Git stores a new blob every time you change a file. What if I have a huge file and I only change a single line? Will Git store an entire new blob in this case and duplicate the rest of the file? Well, not really. Git also does another layer of optimizations to save more space. For example, as you keep working and adding content to the repository, Git might decide to store only the differences between the two files or even compress multiple objects in the same physical file. By the way, that kind of stuff is the reason for those mysterious info and pack directories in the database.

  • It's good enough to think of each commit, blob, or tree as just files, separate files that are hashed and stored in the database. At commence level, this how Git actually works, and then it has another layer of optimizations that are probably not interesting to you unless you're working on the Git source code. Just know this. When it comes to being efficient, you can assume that Git always does the right thing.

Annotated Tags

  • A tag is like a label for the current state of the project. There are actually two types of tags in Git, regular tags and annotated tags. Tags are object in Git

  • Annotated tags are the ones that come with a message. To create an annotated tag, you could use the git tag command with the -a argument, and you need a name for the tag, and you also need some kind of message here.

$ git tag -a mytag -m "myTag"

We have an annotated tag. It's similar to creating a commit, and in fact an annotated tag is also an object in Git's object database, like commit. Let's use cat-file to peek inside it. In case of tags, cat-file can take either the tag's hash or the tag's name. I don't know the hash right now, so I will use the name of the tag. And here is the tag.

$ git cat-file -p mytag
object eff82d1e671ffac429f504ccf69a9927ffa139cb
type commit
tag mytag
tagger aakash14goplani <[email protected]> 1514321131 +0530

myTag

It contains metadata such as the tag's message, the name, the target date, and most importantly an object that the tag is pointing to. In this case, it's a commit. So, that's what the tag is. It's just a simple label attached to an object. image 3

  • In the Git object database you have blobs, arbitrary content, trees, the equivalent of directories, commits, and annotated tags. There is nothing else in the database, just these four types of objects.

Conclusion

  • Look at the whole model from an abstract point of view. image 2 What do we have here? Well, we have a structure where some things contain data, blobs, and then there are other things called trees that contain blobs and other trees so the entire structure is recursive.

  • And the names of the blobs and trees, they are not in the objects themselves. Instead, they are stored in the containing tree. So, you can have the same object, say the same blob or the same tree, pointed at by different tress with different names. Does this structure remind you of anything? Well, to me it looks an awful lot like a file system. Just like in a file system, you have content, files or blobs, and nested containers, directories or trees in this case, and you can have links. The same file or directory can be reached from different places with different names. It's like links in Linux or shortcuts in Windows. In fact, you might argue that that's what Git is. It's a high-level file system built on top of your netty file system.

  • It's a version file system, of course, because it also has commits, which add versioning. And that's what we mean when we say that Git is a content tracker.

  • So we have seen that Git is a persistent map at its core, and layered on top of that is a stupid content tracker that looks a lot like a versioned file system.

Clone this wiki locally