Lab 9 - Version Control and Backups

Overview

In this lab, we’ll be talking about using version control, in particular Git, for keeping track of changes to your files and backing up your files to other machines. We’ll be setting up Gitolite, an open-source Git server that allows you to host Git repositories on other machines, and set access to Git repositories hosted on the Gitolite server.

A Git Primer

Version control tools are designed to keep track of changes to files, and allow you to rewind to some last saved state. Git, as one of the most popular version control tools in use today, supports these features, but also allows for more advanced workflows such as splitting a filebase into “branches” for experimentation. In order to complete this lab, you’ll need to know a few basic git commands and some terminology.

Getting started in git: Initializing/cloning a repository

A “repository” is just a fancy way of saying “a folder where git is set up.” In order to start working with git, you’ll need to either create a repository from scratch, or “clone” (download and copy) a repository from somewhere else.

A folder can be made into a git repository by running git init, telling git that files under the folder can start being tracked by git. When a repository is initialized, git creates a hidden .git folder within the root folder to track the files’ histories.

bzh@tsunami ~$ mkdir example-project
bzh@tsunami ~$ cd example-project/
bzh@tsunami ~/example-project$ git init
Initialized empty Git repository in /home/b/bz/bzh/example-project/.git/

A git repository can also be cloned from online by running git clone [repository url], which will create a copy of the repository on your machine, and set the address you cloned the repo from to be the “origin” remote (more on this later).

bzh@tsunami /tmp$ git clone https://github.com/ocf/ocfweb.git
Cloning into 'ocfweb'...
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 13063 (delta 1), reused 0 (delta 0), pack-reused 13053
Receiving objects: 100% (13063/13063), 3.75 MiB | 0 bytes/s, done.
Resolving deltas: 100% (9159/9159), done.

bzh@tsunami /tmp$ cd ocfweb

bzh@tsunami /tmp/ocfweb$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working tree clean

bzh@tsunami /tmp/ocfweb$ git remote -v
origin	https://github.com/ocf/ocfweb.git (fetch)
origin	https://github.com/ocf/ocfweb.git (push)

Adding files to git

Now that you have a git repository on your local machine, you’ll want to use it start tracking file history, yeah? However, files do not have their histories recorded by git by default. In order to tell git to start recording changes made to a file, run git add [file] within a repository.

In git parlance, a file added through git add is called a “tracked file”, and all other files in a repository are “untracked files”. Git doesn’t track every file in a repository by default because codebases tend to accumulate a lot of cruft –temporary files, build files, caches, and secrets that would take up a lot of space if tracked by git.

Basically: only add the files you care about.

bzh@tsunami ~/example-project$ ls
file1  file2  file3  subdir/

bzh@tsunami ~/example-project$ ls subdir/
cruft.pyc  script.py

Add a single file:

bzh@tsunami ~/example-project$ git add file1

bzh@tsunami ~/example-project$ git status
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   file1

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	file2
	file3
	subdir/

Add multiple files at once:

bzh@tsunami ~/example-project$ git add file2 file3

bzh@tsunami ~/example-project$ git status
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   file1
	new file:   file2
	new file:   file3

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	subdir/

Add a directory (be careful when using this, since it will add everything within the directory*, which can include cruft files).

_{^{*except for things added to .gitignore.}}

bzh@tsunami ~/example-project$ git add subdir

bzh@tsunami ~/example-project$ git status
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   file1
	new file:   file2
	new file:   file3
	new file:   subdir/cruft.pyc
	new file:   subdir/script.py

Committing

git groups a set of changes into a commit. Tracked files in a git repository do not have their contents saved in git until git commit is run. You can think of git commit as creating a snapshot of all modifications made to tracked files since the last commit.

In almost every case, you’ll want to use the -m "<message>" flag when making a commit, describing the changes you made in that commit.

Committing often is very important. The basic unit of history in git is the commit, so you can’t roll back changes that were made in between commits, and if you delete something that hasn’t been committed in a long time, you’re out of luck.

bzh@tsunami ~/example-project$ git status
On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   file1
	new file:   file2
	new file:   file3
	new file:   subdir/cruft.pyc
	new file:   subdir/script.py

bzh@tsunami ~/example-project$ git commit -m "Hello world."
[master (root-commit) f61de77] Hello world.
 5 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 file1
 create mode 100644 file2
 create mode 100644 file3
 create mode 100644 subdir/cruft.pyc
 create mode 100644 subdir/script.py

Adding files (again): Staging

If you’ve added and committed some files, and then modified them after the commit, even though the files are tracked, git will not automatically add them to the next commit. You’ll need to “stage” them by running git add again.

bzh@tsunami ~/example-project> echo "hello" $ file1

bzh@tsunami ~/example-project$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   file1

no changes added to commit (use "git add" and/or "git commit -a")

bzh@tsunami ~/example-project$ git add file1

bzh@tsunami ~/example-project$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	modified:   file1

Branching

Imagine that you want to develop a new feature, feature1, in example_project. However, the feature would require a lot of time and change many files in example_project, and you don’t want to break the project while you’re working on the feature. What can you do? You could add the feature in a single giant commit, but that runs directly counter to the philosophy of “commit early, commit often.”

To solve this, git has the concept of branching. You can think of a branch as a series of commits across time. The main branch, and the one usually reserved for stable production code is master.

You can create a new branch using git branch [branch name]; in our example, we would run:

bzh@tsunami ~/example-project$ git branch feature1

In order to start adding commits to the branch, you have to checkout the branch. This will move our “view” of the files from the parent branch to the new branch. More technically, git checkout [branch name] shifts a marker called HEAD from the tip of the current branch to the tip of the new branch.

bzh@tsunami ~/example-project$ git status
On branch master
nothing to commit, working tree clean

bzh@tsunami ~/example-project$ git checkout feature1
Switched to branch 'feature1'

bzh@tsunami ~/example-project$ git status
On branch feature1
nothing to commit, working tree clean

You can combine the “create new branch, then checkout” commands by running git checkout -b [branch name].

Now that you’ve checked out the new branch, you can make commits to it, and those commits won’t show up in the parent branch, and vice versa.

bzh@tsunami ~/example-project$ ls
file1  file2  file3  subdir/

bzh@tsunami ~/example-project$ git rm file1
rm 'file1'

bzh@tsunami ~/example-project$ git commit -m "removed file1."
[feature1 e80901d] removed file1.
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 file1

bzh@tsunami ~/example-project$ ls
file2  file3  subdir/

bzh@tsunami ~/example-project$ git checkout master
Switched to branch 'master'

bzh@tsunami ~/example-project$ ls
file1  file2  file3  subdir/

Merging

Imagine you’ve completed feature1, and you want to use it in the master branch. You can merge a branch into its parent using the command git merge [branch you want to merge into your current branch]. This will integrate the tip of the branch into its parent’s tip, so changes from the branch show up in its parent.

bzh@tsunami ~/example-project$ git status
On branch master
nothing to commit, working tree clean

bzh@tsunami ~/example-project$ ls
file1  file2  file3  subdir/

bzh@tsunami ~/example-project$ git merge feature1
Updating f61de77..e80901d
Fast-forward
 file1 | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 file1

bzh@tsunami ~/example-project$ ls
file2  file3  subdir/

Merge conflicts: the worst part of git

If you’ve edited the same file in both branches, when you try to merge them, you’ll run into merge conflicts. git will refuse to merge the branches together until you’ve gone to the offending files and fixed the conflicts. This is beyond the scope of this primer, but here is a good guide on it.

Rebasing: an alternative to merging

Instead of merging, you can rebase a branch on top of another one. using git rebase [branch you want to revase on top of your current branch]. This will take all the commits you’ve made to the source branch, and apply them to the tip of the destination branch.

To use a tree analogy, if merging is like a branch growing back into the trunk, rebasing is like cutting off the branch, and grafting it onto the top of the trunk.

Many people like rebasing a lot better than merging, since rebasing will produce a nicer history. Both have their own benefits and drawbacks, as presented nicely here.

bzh@tsunami ~/example-project$ git checkout master
Switched to branch 'master'

bzh@tsunami ~/example-project$ ls
file1  file2  file3  subdir/

bzh@tsunami ~/example-project$ git rebase feature1 
 file1 | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 file1
First, rewinding head to replay your work on top of it...
Fast-forwarded master to feature1.

bzh@tsunami ~/example-project$ ls
file2  file3  subdir/

A small aside: `git status`

As you’ve seen, the git status command is very useful. It displays which branch you’re on, which commit you’re viewing, tracked, untracked, staged, and unstaged files.

Pushing and Pulling

After you’ve made some commits, you’ll want to share them with other people, or upload them to a production server. In the other direction, you’ll want to periodically download changes made by other people to the repository, to keep up to date.

In order to do this, git has you to set remotes. A remote is, literally, another git repository that you can upload to/download from. To see which remotes you have, and what URLs they point to, you can run git remote -v. Adding a remote is done by running git remote add [remote name] [remote url]. If you’ve cloned a repository from elsewhere, you probably won’t need to do this, since the clone URL is automatically set as the “origin” remote (which is naming convention for the primary remote of a project).

In order to upload changes on a local repository to a remote, run git push [remote name] [branch name].

In order to download changes from a remote to a local repository, run git pull [remote name] [branch name].

When you push and pull to a remote, you may end up with merge conflicts that you’ll need to fix (yay).

bzh@tsunami ~/example-project$ git remote -v

bzh@tsunami ~/example-project$ git remote add origin git@github.com:redplanks/example-project.git

bzh@tsunami ~/example-project$ git remote -v
origin	git@github.com:redplanks/example-project.git (fetch)
origin	git@github.com:redplanks/example-project.git (push)

bzh@tsunami ~/example-project$ git push origin master
Counting objects: 4, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (4/4), 317 bytes | 0 bytes/s, done.
Total 4 (delta 0), reused 0 (delta 0)
remote: 
remote: Create a pull request for 'master' on GitHub by visiting:
remote:      https://github.com/redplanks/example-project/pull/new/master
remote: 
To github.com:redplanks/example-project.git
 * [new branch]      master -> master

bzh@tsunami ~/example-project$ git pull origin master
From github.com:redplanks/example-project
 * branch            master     -> FETCH_HEAD
Already up-to-date.

Summary: the git workflow

make a new repo
||
\/
pull from remote -> add files
/\                      ||
||                      \/
push changes <- commit files

branch and merge/rebase to taste

Gitolite

Generating an SSH key

Gitolite identifies its users through their SSH public key. Since you’ll need to interact with Gitolite from your OCF account on tsunami, in order to complete the lab, you’ll first have to generate an SSH keypair on tsunami.

On tsunami.ocf.berkeley.edu, create a directory called .ssh in your home directory if it doesn’t already exist. Within .ssh, run ssh-keygen and hit enter until it finishes generating a public-private keypair. You’ll need to copy the contents of .ssh/id_rsa.pub later when installing Gitolite.

Installing Gitolite

Log into your DigitalOcean VM and install Gitolite using apt. The package name in the apt system is gitolite3. When you are prompted for the administrator’s SSH key, paste in the contents of id_rsa.pub (not id_rsa!) from the key you just generated on tsunami. If you mess up when configuring the admin SSH key or accidentally paste in the contents of id_rsa, you can run sudo apt purge gitolite3 to remove the package and all of its config files, and try installing it again.

After you’ve finished installing Gitolite, verify that it is running by running the command ssh gitolite3@[your VM's address] info from tsunami. It should produce an output similar to:

hello admin, this is gitolite3@test running gitolite3 3.6.6-1 (Debian) on git 2.11.0

 R W	gitolite-admin
 R W	testing

Configuring Gitolite

Gitolite configuration, unusually, is done by cloning the gitolite-admin git repository hosted on the Gitolite server, modifying it, and sending the updated repository back by pushing.

In order to clone the configuration repository to tsunami, run git clone gitolite3@[username].decal.xcf.sh:gitolite-admin from tsunami. If all goes well, a folder called gitolite-admin will appear in your current directory.

Adding a New User

As the administrator of the Gitolite server, you automatically get an account on the Gitolite server (Gitolite accounts are entirely separate from Linux user accounts). For the purposes of this lab’s checkoff, we’ll have you add a new user to the Gitolite server, which will be done by adding a new file in the keydir/ subdirectory that contains the SSH public key of the new Gitolite user.

Navigate to the directory keydir under the gitolite-admin folder. Create a new file named decal-checkoff.pub using the editor of your choice, and paste in this SSH public key:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDcNRJUAALGBkFfjxC3Vf/qBbeCdPeHHXXKIcTLYghhHNKK1Ob5MQObs6dDGRmur+x6SlDCTokTK+xK/xRoHMwZNKFVJC8fneRjztjy3iy2wap5PzaVw36AvuC+j/N2eInWSf6INQnKvea5/kOlKYMajpI0HOsSxwRvFbMtU0wMX2jhyTalsC7qJkkZuWITBeSTG3iFCMCZ1K3RAk7oUF3dbv36ePAu5lDoq+tyxrH9WJLQeCNkZLn/56m6fY1FMdPLJNWgyE+7QtvhJB/hvrdt9/QiRx1hWHFCJNgYagfAw3FwKKL7859JRy4fLKL4OCjvC/F7cfFi29G96whlYVUx test@test

After the key has been added, you’ll need to add, commit, and push the file to the Gitolite server through Git. Assuming you are in keydir/, run git add decal-checkoff.pub to track the file under Git, git commit with a message describing the changes made in the commit, and finally git push origin to push the changes up to the Gitolite server.

Adding a new Git repository

Now that you’ve added a new Gitolite user, you’ll need to create a new Git repository that they can access. In Gitolite, this is done by modifying the conf/gitolite.conf file within the gitolite-admin repository.

Edit gitolite.conf to create a new repository with name decal-checkoff with RW+ access given to both admin (you) and decal-checkoff. Hopefully, you can figure out what format Gitolite expects by looking at the entires for the admin and testing repositories. The only gotcha is that multiple users have to separated by a space and nothing else.

Once you have done that, follow the same Git workflow as with decal-checkoff.pub to add, commit, and push the changes to gitolite.conf to the Gitolite server.

If you’ve done everything correctly, running ssh gitolite3@[username].decal.xcf.sh info should display something like:

hello admin, this is gitolite3@test running gitolite3 3.6.6-1 (Debian) on git 2.11.0

 R W	decal-checkoff
 R W	gitolite-admin
 R W	testing

However, please make sure that you have added the decal-checkoff user and given it permissions for the decal-checkoff repository, as we will be using the ssh info command to check you off for the lab, and the command only works if the decal-checkoff user has been added properly.

Backups

We’ll be briefly practicing using rsync, a powerful file transfer tool often used in backup scenarios. It is difficult to overstate the degree to which rsync and the rsync delta-transfer algorithm are used in the real world.

On a superficial level, rsync works similarly to scp - it copies files from src to dst, optionally over the network, over SSH. What makes rsync in particular so powerful is its ability to keep two directories synchronized while transferring little beyond the absolute difference in files between the two directories. It does this by calculating the differences between blocks in a file, and only transferring the difference (the ‘delta’). By default, it assumes that a file is unchanged if its last modified time is not newer. For example, if one line in a 1-GB text file is changed, rsync won’t transfer the entire file again, but will send the one line that changed.

The basic syntax of rsync is rsync [options] [[user]@host:]source [[user]@host:]dest - you can either transfer files in the local filesystem, between the local system and a network host, or between network hosts.

rsync has dozens of options, but the most common are the following:

-r --recursive: recursively transfer all files and subdirectories in the source
-v --verbose: verbosely print out file names and transfer speeds
-P --partial --progress: keep partially transferred files if the transfer is interrupted, show progress of file transfers
-a --archive: combination option, implies -r, and preservation of most file metadata properties
-u --append: append data to the end of files on the destination that are smaller than the corresponding file on the source (useful for log files, dangerous if used on files that do not only grow by appending data)
-z --compress: compress files before transfer (Only useful for rsyncing over the network)

Suppose you keep all your homework in a directory “school”, in your home directory:

user@machine [~/school] $ ls
cs198-8
cs61a
cs262
some_r1a
ds8

and you want to back everything to your student VM. A command like:

$ rsync -rvPaz ~/school user@user.decal.xcf.sh:school-backup/

would result in a copy of your school directory being made inside the school-backup directory on your student VM.

Play around with rsync yourself. We’ve provided some test files here. (hint: use wget). After extracting the archive, (tar xzvf rsync-example.tar.gz), you’ll see two directories with over 100MB of files in them.

What happens if you try rsyncing dir1 to dir2 using rsync -racv dir1/ dir2/? (hint: it should finish almost instantaneously). Why might this be the case?
What rsync commands would you use to keep both directories in sync, from updating the source to dest and dest to source?
What happens if you truncate one of the files in the dest directory (e.g. dir2, by truncate --size=5M file2.txt) and then try rsync from source to dest again?

Conclusion

Now that you know how to set up a Gitolite server and add users and repositories to it, you can now use your VM (before we delete it at the end of the semester) as a git remote to back up your programming projects, fanfic collections, etc., free from the prying eyes of our Github overlords.

Checkoff

The checkoff form is here. Please double-check that you’ve completed the Gitolite section successfully, as we will be running ssh gitolite3@[username].decal.xcf.sh info to check you off on the Gitolite section. If you’d like to double-check that you added the decal-checkoff user properly, you can add another new user with a different SSH key (maybe from your laptop or an OCF desktop), and run the ssh info command from that machine.