Lab 9 - Version Control and Backups
Overview
In this lab, we’ll be talking about using version control, in particular Git, for keeping track of changes to your files and backing up your files to other machines.
We’ll be setting up Gitolite, an open-source Git server that allows you to host Git repositories on other machines,
and set access to Git repositories hosted on the Gitolite server.
A Git Primer
Version control tools are designed to keep track of changes to files, and allow you to rewind to some last saved state.
Git, as one of the most popular version control tools in use today, supports these features, but also allows for more advanced workflows such as splitting a filebase into “branches” for experimentation.
In order to complete this lab, you’ll need to know a few basic git commands and some terminology.
Getting started in git: Initializing/cloning a repository
A “repository” is just a fancy way of saying “a folder where git is set up.”
In order to start working with git, you’ll need to either create a repository from scratch,
or “clone” (download and copy) a repository from somewhere else.
A folder can be made into a git repository by running git init
, telling git that files under the folder can start being tracked by git. When a repository is initialized, git creates a hidden .git
folder within the root folder to track the files’ histories.
bzh@tsunami ~$ mkdir example-project
bzh@tsunami ~$ cd example-project/
bzh@tsunami ~/example-project$ git init
Initialized empty Git repository in /home/b/bz/bzh/example-project/.git/
A git repository can also be cloned from online by running git clone [repository url]
, which will create a copy of the repository
on your machine, and set the address you cloned the repo from to be the “origin” remote (more on this later).
bzh@tsunami /tmp$ git clone https://github.com/ocf/ocfweb.git
Cloning into 'ocfweb'...
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 13063 (delta 1), reused 0 (delta 0), pack-reused 13053
Receiving objects: 100% (13063/13063), 3.75 MiB | 0 bytes/s, done.
Resolving deltas: 100% (9159/9159), done.
bzh@tsunami /tmp$ cd ocfweb
bzh@tsunami /tmp/ocfweb$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working tree clean
bzh@tsunami /tmp/ocfweb$ git remote -v
origin https://github.com/ocf/ocfweb.git (fetch)
origin https://github.com/ocf/ocfweb.git (push)
Adding files to git
Now that you have a git repository on your local machine, you’ll want
to use it start tracking file history, yeah?
However, files do not have their histories recorded by git by default. In order to tell git to start recording changes made to a file, run git add [file]
within a repository.
In git parlance, a file added through git add
is called a “tracked file”,
and all other files in a repository are “untracked files”.
Git doesn’t track every file in a repository by default because codebases tend to accumulate a lot of cruft
–temporary files, build files, caches, and secrets that would take up
a lot of space if tracked by git.
Basically: only add the files you care about.
bzh@tsunami ~/example-project$ ls
file1 file2 file3 subdir/
bzh@tsunami ~/example-project$ ls subdir/
cruft.pyc script.py
Add a single file:
bzh@tsunami ~/example-project$ git add file1
bzh@tsunami ~/example-project$ git status
On branch master
Initial commit
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: file1
Untracked files:
(use "git add <file>..." to include in what will be committed)
file2
file3
subdir/
Add multiple files at once:
bzh@tsunami ~/example-project$ git add file2 file3
bzh@tsunami ~/example-project$ git status
On branch master
Initial commit
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: file1
new file: file2
new file: file3
Untracked files:
(use "git add <file>..." to include in what will be committed)
subdir/
Add a directory (be careful when using this, since it will add everything within the directory*, which can include cruft files).
*except for things added to .gitignore.
bzh@tsunami ~/example-project$ git add subdir
bzh@tsunami ~/example-project$ git status
On branch master
Initial commit
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: file1
new file: file2
new file: file3
new file: subdir/cruft.pyc
new file: subdir/script.py
Committing
git groups a set of changes into a commit.
Tracked files in a git repository do not have their contents saved in git until git commit
is run.
You can think of git commit
as creating a snapshot of all modifications made to tracked files since the last commit.
In almost every case, you’ll want to use the -m "<message>"
flag when making a commit, describing the changes you made in that commit.
Committing often is very important.
The basic unit of history in git is the commit, so you can’t roll back changes that were made in between commits,
and if you delete something that hasn’t been committed in a long time, you’re out of luck.
bzh@tsunami ~/example-project$ git status
On branch master
Initial commit
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: file1
new file: file2
new file: file3
new file: subdir/cruft.pyc
new file: subdir/script.py
bzh@tsunami ~/example-project$ git commit -m "Hello world."
[master (root-commit) f61de77] Hello world.
5 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 file1
create mode 100644 file2
create mode 100644 file3
create mode 100644 subdir/cruft.pyc
create mode 100644 subdir/script.py
Adding files (again): Staging
If you’ve added and committed some files, and then modified them after the commit, even though the files are tracked, git will not automatically add them to the next commit. You’ll need to “stage” them by running git add
again.
bzh@tsunami ~/example-project> echo "hello" $ file1
bzh@tsunami ~/example-project$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: file1
no changes added to commit (use "git add" and/or "git commit -a")
bzh@tsunami ~/example-project$ git add file1
bzh@tsunami ~/example-project$ git status
On branch master
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
modified: file1
Branching
Imagine that you want to develop a new feature, feature1
, in example_project
.
However, the feature would require a lot of time and change many files in example_project
,
and you don’t want to break the project while you’re working on the feature. What can you do?
You could add the feature in a single giant commit, but that runs directly counter to the philosophy of “commit early, commit often.”
To solve this, git has the concept of branching. You can think of a branch as a series of commits across time.
The main branch, and the one usually reserved for stable production code is master
.
You can create a new branch using git branch [branch name]
; in our example, we would run:
bzh@tsunami ~/example-project$ git branch feature1
In order to start adding commits to the branch, you have to checkout the branch. This will move our “view” of the files from the parent branch to the new branch.
More technically, git checkout [branch name]
shifts a marker called HEAD
from the tip of the current branch to the tip of the new branch.
bzh@tsunami ~/example-project$ git status
On branch master
nothing to commit, working tree clean
bzh@tsunami ~/example-project$ git checkout feature1
Switched to branch 'feature1'
bzh@tsunami ~/example-project$ git status
On branch feature1
nothing to commit, working tree clean
You can combine the “create new branch, then checkout” commands by running git checkout -b [branch name]
.
Now that you’ve checked out the new branch, you can make commits to it, and those commits won’t show up in the parent branch, and vice versa.
bzh@tsunami ~/example-project$ ls
file1 file2 file3 subdir/
bzh@tsunami ~/example-project$ git rm file1
rm 'file1'
bzh@tsunami ~/example-project$ git commit -m "removed file1."
[feature1 e80901d] removed file1.
1 file changed, 0 insertions(+), 0 deletions(-)
delete mode 100644 file1
bzh@tsunami ~/example-project$ ls
file2 file3 subdir/
bzh@tsunami ~/example-project$ git checkout master
Switched to branch 'master'
bzh@tsunami ~/example-project$ ls
file1 file2 file3 subdir/
Merging
Imagine you’ve completed feature1
, and you want to use it in the master
branch.
You can merge a branch into its parent using the command git merge [branch you want to merge into your current branch]
.
This will integrate the tip of the branch into its parent’s tip, so changes from the branch show up in its parent.
bzh@tsunami ~/example-project$ git status
On branch master
nothing to commit, working tree clean
bzh@tsunami ~/example-project$ ls
file1 file2 file3 subdir/
bzh@tsunami ~/example-project$ git merge feature1
Updating f61de77..e80901d
Fast-forward
file1 | 0
1 file changed, 0 insertions(+), 0 deletions(-)
delete mode 100644 file1
bzh@tsunami ~/example-project$ ls
file2 file3 subdir/
Merge conflicts: the worst part of git
If you’ve edited the same file in both branches, when you try to merge them,
you’ll run into merge conflicts.
git will refuse to merge the branches together until you’ve gone to the offending files and fixed the conflicts.
This is beyond the scope of this primer, but here is a good guide on it.
Rebasing: an alternative to merging
Instead of merging, you can rebase a branch on top of another one. using git rebase [branch you want to revase on top of your current branch]
.
This will take all the commits you’ve made to the source branch, and apply them to the tip of the destination branch.
To use a tree analogy, if merging is like a branch growing back into the trunk, rebasing is like cutting off the branch, and grafting it onto the top of the trunk.
Many people like rebasing a lot better than merging, since rebasing will produce a nicer history.
Both have their own benefits and drawbacks, as presented nicely here.
bzh@tsunami ~/example-project$ git checkout master
Switched to branch 'master'
bzh@tsunami ~/example-project$ ls
file1 file2 file3 subdir/
bzh@tsunami ~/example-project$ git rebase feature1
file1 | 0
1 file changed, 0 insertions(+), 0 deletions(-)
delete mode 100644 file1
First, rewinding head to replay your work on top of it...
Fast-forwarded master to feature1.
bzh@tsunami ~/example-project$ ls
file2 file3 subdir/
A small aside: git status
As you’ve seen, the git status
command is very useful.
It displays which branch you’re on, which commit you’re viewing, tracked, untracked, staged, and unstaged files.
Pushing and Pulling
After you’ve made some commits, you’ll want to share them with other people,
or upload them to a production server.
In the other direction, you’ll want to periodically download changes made by other people to the repository, to keep up to date.
In order to do this, git has you to set remotes. A remote is, literally, another git repository that you can upload to/download from.
To see which remotes you have, and what URLs they point to, you can run git remote -v
.
Adding a remote is done by running git remote add [remote name] [remote url]
.
If you’ve cloned a repository from elsewhere, you probably won’t need to do this, since the clone URL is automatically set as the “origin” remote
(which is naming convention for the primary remote of a project).
In order to upload changes on a local repository to a remote, run git push [remote name] [branch name]
.
In order to download changes from a remote to a local repository, run git pull [remote name] [branch name]
.
When you push and pull to a remote, you may end up with merge conflicts that you’ll need to fix (yay).
bzh@tsunami ~/example-project$ git remote -v
bzh@tsunami ~/example-project$ git remote add origin git@github.com:redplanks/example-project.git
bzh@tsunami ~/example-project$ git remote -v
origin git@github.com:redplanks/example-project.git (fetch)
origin git@github.com:redplanks/example-project.git (push)
bzh@tsunami ~/example-project$ git push origin master
Counting objects: 4, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (4/4), 317 bytes | 0 bytes/s, done.
Total 4 (delta 0), reused 0 (delta 0)
remote:
remote: Create a pull request for 'master' on GitHub by visiting:
remote: https://github.com/redplanks/example-project/pull/new/master
remote:
To github.com:redplanks/example-project.git
* [new branch] master -> master
bzh@tsunami ~/example-project$ git pull origin master
From github.com:redplanks/example-project
* branch master -> FETCH_HEAD
Already up-to-date.
Summary: the git workflow
make a new repo
||
\/
pull from remote -> add files
/\ ||
|| \/
push changes <- commit files
branch and merge/rebase to taste
Gitolite
Generating an SSH key
Gitolite identifies its users through their SSH public key. Since you’ll need to interact with Gitolite from your OCF account on tsunami, in
order to complete the lab, you’ll first have to generate an SSH keypair on tsunami.
On tsunami.ocf.berkeley.edu
, create a directory called .ssh
in your home directory if it doesn’t already exist.
Within .ssh
, run ssh-keygen
and hit enter until
it finishes generating a public-private keypair. You’ll need to copy the contents of .ssh/id_rsa.pub
later when installing Gitolite.
Installing Gitolite
Log into your DigitalOcean VM and install Gitolite using apt
. The package name in the apt system is gitolite3
.
When you are prompted for the administrator’s SSH key, paste in the contents of id_rsa.pub (not id_rsa
!) from the key you just
generated on tsunami
.
If you mess up when configuring the admin SSH key or accidentally paste in the contents of id_rsa
, you can run sudo apt purge gitolite3
to remove the package and all of its config files, and try installing it again.
After you’ve finished installing Gitolite, verify that it is running by running the command ssh gitolite3@[your VM's address] info
from
tsunami
. It should produce an output similar to:
hello admin, this is gitolite3@test running gitolite3 3.6.6-1 (Debian) on git 2.11.0
R W gitolite-admin
R W testing
Configuring Gitolite
Gitolite configuration, unusually, is done by cloning the gitolite-admin
git repository hosted on the Gitolite server, modifying it,
and sending the updated repository back by pushing.
In order to clone the configuration repository to tsunami, run git clone gitolite3@[username].decal.xcf.sh:gitolite-admin
from tsunami.
If all goes well, a folder called gitolite-admin
will appear in your current directory.
Adding a New User
As the administrator of the Gitolite server, you automatically get an account on the Gitolite server (Gitolite accounts are entirely separate
from Linux user accounts). For the purposes of this lab’s checkoff, we’ll have you add a new user to the Gitolite server, which will be done
by adding a new file in the keydir/
subdirectory that contains the SSH public key of the new Gitolite user.
Navigate to the directory keydir
under the gitolite-admin
folder. Create a new file named decal-checkoff.pub
using the editor of your choice, and paste in this SSH public key:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDcNRJUAALGBkFfjxC3Vf/qBbeCdPeHHXXKIcTLYghhHNKK1Ob5MQObs6dDGRmur+x6SlDCTokTK+xK/xRoHMwZNKFVJC8fneRjztjy3iy2wap5PzaVw36AvuC+j/N2eInWSf6INQnKvea5/kOlKYMajpI0HOsSxwRvFbMtU0wMX2jhyTalsC7qJkkZuWITBeSTG3iFCMCZ1K3RAk7oUF3dbv36ePAu5lDoq+tyxrH9WJLQeCNkZLn/56m6fY1FMdPLJNWgyE+7QtvhJB/hvrdt9/QiRx1hWHFCJNgYagfAw3FwKKL7859JRy4fLKL4OCjvC/F7cfFi29G96whlYVUx test@test
After the key has been added, you’ll need to add, commit, and push the file to the Gitolite server through Git.
Assuming you are in keydir/
, run git add decal-checkoff.pub
to track the file under Git, git commit
with a message describing the changes made in the commit,
and finally git push origin
to push the changes up to the Gitolite server.
Adding a new Git repository
Now that you’ve added a new Gitolite user, you’ll need to create a new Git repository that they can access.
In Gitolite, this is done by modifying the conf/gitolite.conf
file within the gitolite-admin
repository.
Edit gitolite.conf
to create a new repository with name decal-checkoff
with RW+
access given to both admin
(you) and decal-checkoff
.
Hopefully, you can figure out what format Gitolite expects by looking at the entires for the admin
and testing
repositories.
The only gotcha is that multiple users have to separated by a space and nothing else.
Once you have done that, follow the same Git workflow as with decal-checkoff.pub
to add, commit, and push the changes to gitolite.conf
to the Gitolite server.
If you’ve done everything correctly, running ssh gitolite3@[username].decal.xcf.sh info
should display something like:
hello admin, this is gitolite3@test running gitolite3 3.6.6-1 (Debian) on git 2.11.0
R W decal-checkoff
R W gitolite-admin
R W testing
However, please make sure that you have added the decal-checkoff
user and given it permissions for the decal-checkoff
repository,
as we will be using the ssh info
command to check you off for the lab, and the command only works if the decal-checkoff
user has been
added properly.
Backups
We’ll be briefly practicing using rsync
, a powerful file transfer tool often used in backup scenarios. It is difficult to
overstate the degree to which rsync
and the rsync delta-transfer algorithm are used in the real world.
On a superficial level, rsync
works similarly to scp
- it copies files from src
to dst
, optionally over the network,
over SSH. What makes rsync
in particular so powerful is its ability to keep two directories synchronized
while transferring little beyond the absolute difference in files between the two directories. It does this by calculating
the differences between blocks in a file, and only transferring the difference (the ‘delta’). By default, it assumes that a file is unchanged if its last modified time is not newer.
For example, if one line in a 1-GB text file is changed, rsync won’t transfer the entire file again, but will send the one line that changed.
The basic syntax of rsync
is rsync [options] [[user]@host:]source [[user]@host:]dest
- you can either transfer files
in the local filesystem, between the local system and a network host, or between network hosts.
rsync
has dozens of options, but the most common are the following:
-r
--recursive
: recursively transfer all files and subdirectories in the source
-v
--verbose
: verbosely print out file names and transfer speeds
-P
--partial --progress
: keep partially transferred files if the transfer is interrupted, show progress of file transfers
-a
--archive
: combination option, implies -r
, and preservation of most file metadata properties
-u
--append
: append data to the end of files on the destination that are smaller than the corresponding
file on the source (useful for log files, dangerous if used on files that do not only grow by appending data)
-z
--compress
: compress files before transfer (Only useful for rsync
ing over the network)
Suppose you keep all your homework in a directory “school”, in your home directory:
user@machine [~/school] $ ls
cs198-8
cs61a
cs262
some_r1a
ds8
and you want to back everything to your student VM. A command like:
$ rsync -rvPaz ~/school user@user.decal.xcf.sh:school-backup/
would result in a copy of your school
directory being made inside the school-backup
directory on your student VM.
Play around with rsync
yourself. We’ve provided some test files here. (hint: use wget
).
After extracting the archive, (tar xzvf rsync-example.tar.gz
), you’ll see two directories with over 100MB of files in them.
-
What happens if you try rsync
ing dir1
to dir2
using rsync -racv dir1/ dir2/
? (hint: it should finish almost instantaneously). Why might this be the case?
-
What rsync
commands would you use to keep both directories in sync, from updating the source to dest and dest to source?
-
What happens if you truncate one of the files in the dest directory (e.g. dir2
, by truncate --size=5M file2.txt
) and then try rsync
from source to dest again?
Conclusion
Now that you know how to set up a Gitolite server and add users and repositories to it, you can now use your VM (before we delete it at the end of the semester)
as a git remote to back up your programming projects, fanfic collections, etc., free from the prying eyes of our Github overlords.
Checkoff
The checkoff form is here. Please double-check that you’ve completed the Gitolite section successfully, as we will be running ssh gitolite3@[username].decal.xcf.sh info
to
check you off on the Gitolite section. If you’d like to double-check that you added the decal-checkoff user properly, you can add another new user with a different SSH key (maybe from your laptop or an OCF desktop),
and run the ssh info
command from that machine.