Is Git a Blockchain?

Probhakar
9 min readMar 2, 2022

Understanding Git from basic building blocks

The other day I was learning blockchain and how those blocks are interlinked with each other. If you are aware of the linked list, then it is sort of that, every block knows the address of its previous block. Let’s see digram —

blockchain

Here you can see every block has a previous hash that uniquely identifies the previous block. Needless to say there any many more things in a block like timestamps etc. But the point is, if I give you block 3, you can backtrack all other blocks by asking — “what is the previous hash?”. Block 0 has no previous block since it is the genesis block (in simple terms blockchain started from it).

The nonce is something that changes, and its value should be such that sha256(all other data + nonce) has some predefined leading 0s. In other words, the blockchain community defines how many leading zeros in 64 character sha256 hash will have. The miners around the world are fighting to get that golden hash with that specific leading zeros by tweaking the nobs, for example, nonce here, in practice there are lot more nobs 😅.

Alright, let’s not spend too much time on that. When I was learning blockchain I somehow connected with how Git organizes everything, which resembled so much with the previous concept. Obviously, there is no race for achieving golden hash, but the idea remains the same.

NOTE: Gitbash is used for Linux-style commands.

I want to write this article from a bottom-up perspective. For the time being, let us forget about Git. Create a folder on the Desktop and go (cd into) into it. This is our current working directory. make the folder and files as shown in the below image —

$ echo -n "hello" > file1.txt
$ mkdir folder
$ cd folder/
$ echo -n "hello" > file1.txt
$ echo -n "war is not a solution!" > file2.txt

NOTE: echo with -n switch is important otherwise it will add a line break inside the file which will give a different hash.

folder structure

Git has 3 constructs —

  1. Blob
  2. Tree
  3. Commit

Whenever there is a file ending with .txt, .exe, .py, or whatever, git sees it as blob. The directories are like trees. And commits are a snapshot of a tree at a particular point in time.

I know this may seem a little overwhelming for beginners. So let me put it in a simple way. let's look at the above directory (working directory)

how many files do we have?

— 3 (file1.txt, file1.txt, file2.txt): these are blobs

blob

Now, a tree is something that gives structure to the blobs.

The tree can have blobs or new tree as a children. In case of blob, it is the leaf node.
tree

If I give you the blobs, can you make the given file & folder structure? No right? You need to know that — in the current directory there will be one file1.txt and there will be one folder named folder and there will be 2 files named file1.txt and file2.txt.

Now, commit is like the snapshot of this tree at a particular point in time. Suppose you created these files & folders on 1st March 2022. Then on 2nd March 2022, you changed the content of folder/file2.txt to “I love war” because you fell in love with Kalashnikov. On 3rd March 2022, you added a new file named file3.txt. If you do commit every end of the day, you will have 3 snapshots of the tree structure —

commit

Now that we have the understanding of the constructs, let's present them with blocks for clear understanding —

block diagram of git building blocks

Does the commit block resemble the block from the blockchain? If you are a little familiar with Git, you will have one eureka moment. All of Git is nothing but a blockchain of commits. Each commit knows about its previous commit and preserves the tree structure at that particular point in time. That is why you can traverse back in time and ask Git, what was the file content yesterday.

Git used sha1 for hashing. let's see what will be the hash of file1.txt.

$ git hash-object file1.txtb6fc4c620b67d95f953a5c1c1230aaab5db5a1b0

So, the SHA1 hash for the file is $ git hash-object file1.txt

b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0.

If you want to do that quickly in python you can try —

>>> import hashlib
>>> hashlib.sha1("blob 5\0hello".encode()).hexdigest()
'b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0'

So the format is sha1("blob " + filesize + "\0" + data)

Now the tree has its own way of calculating has like sha1 hash. For example, it is something like —

sha1(sha1(file1.txt) + metadata + sha1_of_tree2)where 
sha1_of_tree2 = sha1(sha1(file1.txt) + sha1(file2.txt) + metadata)

I think now you can appreciate, how smart git is. If the file is duplicated then it is not maintaining 2 different copies inside its repository. Since by the virtue of hashing algorithm, the hash of same content gives same hash, so sha1(“hello”) will be same — whoever does it, when does it, where does it.

Let’s do some Git-ing

Now, let's start some Git commands. I hope you are in the working directory.

$ git init

It will initialize a repository. you can see a folder named .git where git stores all its information. So suppose you want to remove the repository, deleting this is folder is enough to remove any trace of git tracking.

In the .git folder, there are many folders —

.git/

objects are the folder where git stores its constructs-blobs, trees, commits

If Git is a blockchain of commits, then HEAD tells us where is currently we are. Because we can always traverse back if we know where we are right?

Now the .git/objects/ will be empty. let’s do —

$ git add .

This is the command where we tell Git to add all the files to the index/staging area. Git calculates the sha1 hash.

.git/objects/

Does it resemble something? See the above hashes in the above diagram — b6fc4c62… & 50f68…. Here only 2 characters are stored since it is enough to avoid collision of names as of now, Git is smart. What would happen if the hashes were 50f68d… & 50b6ff…?

You can check the type of files —

$ git cat-file -t b6fcblob$ git cat-file -t 50f6blob

cat-file -t tells the type of the file. You don’t have to give the full sha1 hash, some portion of it is okay as long as it is unique.

You can also check the content of the blobs —

$ git cat-file -p 50f6war is not a solution!

-p switch tells to print pretty. To check the files in the staging/index area you can —

$ git ls-files --stage100644 b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0 0       file1.txt100644 b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0 0       folder/file1.txt100644 50f68d270c06b6425df579be2dc0fb35cc497a6b 0       folder/file2.txt

Now, let's commit —

$ git commit -m "first commit"[master (root-commit) b950c3d] first commit3 files changed, 3 insertions(+)create mode 100644 file1.txtcreate mode 100644 folder/file1.txtcreate mode 100644 folder/file2.txt

Now the folder in .git/objects/

.git/objects/

git log shows the details of the commit, you can traverse from the commit like a blockchain —

$ git log --onelineb950c3d (HEAD -> master) first commit------------------------------------------------------$ git cat-file -p b950tree 6f1af1031a73bdee5fcc50fbd7377f26a2b51295author Probhakar Sarkar <probhakar.95@gmail.com> 1646239902 +0530committer Probhakar Sarkar <probhakar.95@gmail.com> 1646239902 +0530first commit------------------------------------------------------$ git cat-file -p 6f1a100644 blob b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0    file1.txt040000 tree 201b9dffed762d63e105c0ae70419ec6ce127465    folder------------------------------------------------------$ git cat-file -p b6fchello

Except for the commit sha1 hash, my hashes will be the same (tree & blob) as you if you kept the file names and content the same because commit sha1 hash considers author name and timestamp which will be different from you some Tom/Dick reading this article in some point in future 😅

Now let's do some changes and commit —

$ echo -n "world" >> file1.txt$ cat file1.txthelloworld

Now, since we have modified the content of the file, git will object —

git status

Now let’s add those changes and commit them as well.

$ git add . && git commit -m "second commit"[master 9596c85] second commit1 file changed, 1 insertion(+), 1 deletion(-)

Now if you do git log

git log — graph

So it has effectively become —

commit

Now if you keep on adding commits you are just moving forward keeping a track of the last block commit. How cool is that?

$ echo -n "new file" > file3.txt
$ git add . && git commit -m "third commit"
[master 4e66e90] third commit1 file changed, 1 insertion(+)create mode 100644 file3.txt

let’s create a new branch —

$ git branch new-branch
$ git checkout new-branch
Switched to branch 'new-branch'$ echo -n "more hello world" >> file1.txt$ git add . && git commit -m "4th commit on new branch"[new-branch e25a92a] 4th commit on new branch1 file changed, 1 insertion(+), 1 deletion(-)
git log — oneline — graph

Now, let's go to the master branch and do some changes there —

$ git branchmaster* new-branch---------------------------------------------$ git checkout masterSwitched to branch 'master'$ git branch* masternew-branch

now we are in the master branch

$ touch file3.txt$ git add . && git commit -m "5th commit"On branch masternothing to commit, working tree clean
Commit blockchain

Now you might be thinking what is a branch? If you go to .git\refs\heads you can see as many files as the number of branches. If you open them in Wordpad/notepad you will see the latest commit sha1 hash of the branch. So branch is nothing but the alias of a commit hash. HEAD just points to the current commit hash in work!

After all these messy operations, if you open .git\objects\ you will see many new folders. All seem messy, but spare a moment of your time to think that how each of them is deeply inner wired with each other. Just by git cat-file -t <sha1 hash> and git cat-file -p <sha1 hash> you can recreate the whole DAG, how cool is that?

😂

--

--