Version control is the collective memory of the project. It's what lets you move files from one machine to another without clobbering stuff you just spent three hours writing and without worrying about whether you forgot something important. It also lets you undo your mistakes: if you spend an hour or two going down the wrong path, and want to get back to where you were, version control lets you do it reliably with a single command. And if all that wasn't enough, version control keeps track of who did what so that you know who to turn to with questions.
Dozens of version control systems exist. CVS was the workhorse of the open source world for many years; it was replaced by Subversion, which fixed many of its predecessor's flaws while introducing a few minor ones of its own. Both of these were centralized systems:
One repository stored the definitive copy of the project's files.
Nobody ever edited the contents of the main repository directly. Instead, everyone worked in a local copy.
In order to share files with other people (or to create a backup, which is really just a way to share files with your future self) people would push the contents of their copy to the main repository. To get other people's work, they would pull changes from the main repository and combine them with their own work.
Centralized version control systems have largely been replaced by decentralized ones, and in particular by Git. In theory, Git doesn't need a main repository: developers can merge the contents of any repository into any other. In practice, almost every project designates one repository as the master copy so that people know where to look to find the current state of the project.
Unfortunately, Git has a needlessly complicated interface. [PerezDeRosso2013] found that even experienced users have a mental model of how it works that contradicts its actual operation in important ways, and each of those contradictions produces a steady stream of "what the hell?" moments. Other distributed version control systems like Mercurial are proof that this complexity and pain are unnecessary. The fact that most people don't immediately realize that the random Git manual page generator is a hoax says a lot as well.
So why do people keep using Git? The answer these days is, "Because it's the tax they have to pay in order to use GitHub." At the time of writing, GitHub has over 40 million users and hosts over 28 million public repositories, including those for many well-known open source projects. It is easily the most popular software portal in existence, and offers all of the tools a small software team needs. Other portals exist, such as Bitbucket and GitLab, but GitHub's share of the educational market is even larger than its share among professional developers. If you're using anything in class, you're almost certainly using it, and it's probably helping you become a better programmer [Hsing2019].
Why can't we fix it?
If Git's interface is a problem, why can't we build a new one? [PerezDeRosso2016] tried, but as they report, the gravity of the existing interface is simply too powerful: as soon as people run into a problem and start searching online for solutions, they're thrown back into the world of original Git.
This chapter won't try to teach you Git from scratch: GitHub's guides and the Atlassian Git tutorial do an excellent job of that, as does the Carpentries lesson on Git. Instead, we will review the basics that we hope you have learned previously, then look at how to use Git and GitHub to collaborate in Chapter 8. We will show the commands as if you were running them in the Unix shell, but we recommend that you use a graphical interface like GitKraken, SourceTree, or the one that comes with your IDE (Chapter 17). These are layered on top of the commands we are going to discuss, so they (should) all work the same way.
When I am working on a solo project or in a small team, seven commands account for roughly 85% of my Git activity. Adding two more commands to set things up produces a toolkit that uses Git as a file backup system.
The first step is to make sure that Git knows who we are by telling it our name and email address:
$ git config --global user.name "Peggy Springsteen" $ git config --global user.email "firstname.lastname@example.org"
Breaking this down:
Git commands are written as
git verb, where
verb is a sub-command telling Git exactly what we want
--global option is used to specify a value for the command in the same
way that parameters are used to pass values to functions. In this case,
we're telling Git that we want to configure things globally (i.e., for every
project we have on this computer). We can also configure things
repository-by-repository if we want to have a different name or email
address for different projects.
user.name is the thing we want to configure. There are a lot of these;
git config --list displays them all.
Finally, we specify the values we want
have. We wrap these in quotes because they contain spaces and the
symbol, which might otherwise confuse the shell.
Now that Git knows who we are, let's set up a
project. If we are starting from scratch, we create a directory, go into
it, and run
git init. This may or may not print out some messages depending on
what version of Git you have and how much of its output your GUI shows (if
you're using one). Either way, this command creates a sub-directory called
.git inside your project directory. That special sub-directory is what makes
something a project: it stores the data Git uses to keep track of what files you
have and how they've changed.
Don't edit the files in your
.gitdirectory yourself—it will have the same unfortunate effect as editing a spreadsheet or an image as if it was a text file. If you'd like to know more about what they're for and how Git uses them, please see [Chacon2014] or [Cook2019].
If your instructor or one of your teammates has already created a project, you
git init. Instead, you will use
git clone followed by the
project's URL to get a local copy called a clone.
For example, if you want a clone of this book, you can do
$ git clone https://github.com/gvwilson/buildtogether.tech.git
This will create a directory with the same name as the project, create a
sub-directory inside it, and download the project's history so that you can
Regardless of how you create your repository, you can use
git log to look at
its history. If I run this command right now for
this book, I get:
$ git log
commit d4351c4f093f60d03f303737b66b28ebb6b9ed45 Author: Greg Wilson <email@example.com> Date: Fri Feb 19 09:48:37 2021 -0500 Writing the first section of the chapter on version control. commit 80d38a8cbf650431fe4719ec768bd890e00c7431 Author: Greg Wilson <firstname.lastname@example.org> Date: Thu Feb 18 11:21:00 2021 -0500 Adding more citations to the description of team formation. commit 6e30bd5e5af2c3491f25f014c03d5e9ff5c79879 Author: Greg Wilson <email@example.com> Date: Wed Feb 17 20:48:04 2021 -0500 Moving the discussion of code review into its own chapter. ...
Each entry has:
A line labeled
commit followed by a large hexadecimal number. This number
is a unique label for the state of the
project's files at that point in time. if we want to refer to a particular
version of our project, we can use this or its first half-dozen digits (such
6e30bd) so long as the latter is unambiguous.
Two lines showing who saved the state of the project and when. The name and
email address in the
Author field are the ones we set up with
config; the timestamp is accurate to the second, and includes time zone
information like the
-0500 showing that I'm in Eastern time so that anyone
in the world can tell exactly when these files were saved.
A short comment called a commit message that tells us what this change is all about. We will take a look in Chapter 8 at how to write a good commit message; for now, just remember that if you and your teammates have made a hundred changes to the project over the course of ten or twelve weeks, you're going to want something more informative than "Fixed stuff."
All right: what are commits and where do they come from? A commit is a snapshot of the project at a particular moment in time; we create them using a command like:
$ git commit -m "Made the status bar display the user name"
commit is the verb and the
-m (short for "message") option is followed
by the comment we want to save in the log.
The most popular question on Stack Overflow
If you use Git on the command line and you don't provide a message using the
-moption, it will launch an editor so that you can write a longer message. This is a good thing, except that the default editor on most Unix systems is something called Vim, whose interface is no more intuitive than Git's. (In fact, one of the most frequently-viewed questions on Stack Overflow is "How do I exit the Vim editor?". Unsurprisingly, another frequently-viewed question on Stack Overflow is "How do I make Git use the editor of my choice for my commits?" One of the many reasons you should interact with Git through a GUI is to avoid this issue.
Before we run
git commit, though, we need to tell Git which files we want to
save in the commit, which we do using
$ git add version-control.md _data/glossary.yml
One way to think about
this is that
git add puts things in a box to be shipped, while
commit actually sends the package. Git requires us to do this in two steps
because we might change our mind about what we want to store: for example, we
git add a file, then realize we need to make a few more edits,
it again, and then
git commit. Alternatively, we might add a bunch of files,
then realize that some of them (like editor backup files or temporary files
created by the compiler) shouldn't be saved, so we take them out before
Teach us to care and not to care
You can tell Git to ignore certain kinds of files by putting their names, or patterns that match multiple names, in a file called
.gitignore. For example, the
.gitignorefile for this project includes:
*.pyc *~ .DS_Store .jekyll-cache .jekyll-metadata .sass-cache __pycache__ _site
Be careful not to put files containing passwords or API keys for web services into version control: even if the repository is private now, it might be public in future, or the team might grow to include someone who shouldn't have access (Chapter 13).
We can keep track of which changes haven't yet been added and which ones have
git status. If I run this command
right now in this book's project I get:
$ git status
On branch main Your branch is up to date with 'origin/main'. Changes to be committed: (use "git restore --staged <file>..." to unstage) modified: _includes/links.md modified: bibliography.md modified: version-control.md Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: version-control.md
The first paragraph titled "Changes to be committed" tells me which files I have
asked Git to save using
git add. The second paragraph, "Changes not staged for
commit", shows that I have modified
version-control.md (this chapter) since I
last asked Git to save a snapshot. Both paragraphs tell me that I can use
restore with or without the
--staged option to put a file back the way it was
if I decide I don't want to save the changes I've made.
I can use
git restore to recover an
old version of a file from any previous commit. Being able to do this was the
original motivation for version control systems, and is still one of the main
reasons people use them. For example, if I want to get the version of this file
from two days ago, I can use
git log to find the commit ID
$ git restore --source 2be70844 version-control.md
I can also count backward from where I am now. The most recent commit has the
special symbolic name
HEAD; the expression
means "the one before it", while
HEAD~2 goes back two commits and so
on. Regardless of how I specify what I want, restoring an old version doesn't
erase any of the ones I have saved since then: the project's history stays
Finally, I should make sure there's a second physical copy of my work so that if my drive fails or my laptop is stolen I don't lose everything I've done. If I created the repository by cloning something on GitHub, then Git will automatically have created a bookmark called a remote that points at the original repository. I can get a list of remotes like this:
$ git remote -v
origin https://github.com/gvwilson/buildtogether.tech.git (fetch) origin https://github.com/gvwilson/buildtogether.tech.git (push)
-v option (short for "verbose") tells Git to print more than just the
remote's name. The remote itself is called
origin, and Git lists two URLs for
it because in theory you can download (or "fetch") from one and upload (or
"push") to another. (I have been using Git for sixteen years, and have never
once needed this feature.)
One of the differences between a version control system like Git and a file backup tool like Dropbox is that Git doesn't automatically synchronize local changes to the remote repository. If I want to save everything I've done locally on GitHub, I have to push them explicitly:
$ git push origin main
The verb is
push; the word
origin identifies where I want to send things,
and the word
main identifies the branch I'm on. We'll discuss branches in the
next section, but for now, you can run
git branch to see which ones you have
and which one you're working in.
The counterpart of
git push is
git pull. It downloads changes from the
remote repository and merges them into your local copy:
$ git pull origin main
Pushing and pulling changes allows you and your teammates to synchronize your work. They're also very useful operations if you're working on your own and using two or more computers (such as a personal laptop and your school's servers).
Clean and build
Many instructors require learners to submit work by committing it to a Git repository. One way to check that what works for you will work for whoever is grading it is to clone a fresh copy of the project in a temporary directory and make sure that everything builds and runs there. Doing that will tell you if you or one of your teammates has forgotten to commit a key file. In an advanced course, you might be asked to do this automatically every time someone commits changes; we'll explore this in Chapter 17.
So far we have only used a sequential timeline with Git: each change builds on the one before, and only on the one before. However, there are times when we want to work on several things at once. To do this, we can use branches to work on separate tasks in parallel. Each branch is like a parallel timeline: changes made to one branch have no effect on other branches unless and until we explicitly merge them.
We can see what branches exist in a repository like this:
$ git branch
When we initialize a repository, Git automatically creates a branch called
master; most people now rename this to
main by running:
$ git branch -m main
immediately after running
git init. The
branch is usually considered the "official" version of the repository,
i.e., the version of the project that should be graded or published for other
people to use. The asterisk
* indicates that it is currently active, i.e.,
that all changes we make will take place in this branch by default.
To create a new branch called
homework3 we run:
$ git branch homework3
The name of the branch should indicate what it's for, just like the names of
files and variables. We can check that the branch exists by running
$ git branch
* main homework3
Our branch is there, but the
* shows that we are still in the
To switch to our new branch we use the
$ git checkout homework3 $ git branch
main * homework3
We haven't made any changes since switching to the
homework3 branch, so at
homework3 are at the same point in the repository's
history. Commands like
git log therefore show that the files and
history haven't changed.
Where branches are saved
Git saves every version of every file in the repository's
.gitdirectory. When we switch from one branch to another, it copies files out of there and rearranges directories to restore that state of the world.
Why go to all this trouble? Because it allows us to work on several things at
once without stepping on our own toes, just as putting variables inside objects
and classes allows us to ignore the details of this when we're working on
that. For example, if we are close to finishing homework 3 but want to get an
early start on homework 4, we can create a new branch from
homework4 and start setting things up in there.
When we are done, we can merge the state of one branch back into another. Merging doesn't change the source branch, but once it's done, all of the changes made there are in the destination branch.
To see what the differences are between two branches, we use
git diff with those branches' names:
$ git diff homework3..main
More generally, we can use
git diff to compare any two states of the
repository, including old versions with current ones:
$ git diff HEAD~3..HEAD
diff --git a/bin/html2tex.py b/bin/html2tex.py index 4c756f4..10efe1c 100755 --- a/bin/html2tex.py +++ b/bin/html2tex.py @@ -35,10 +35,13 @@ def html2tex(options): '''Main driver.''' update_numbering(options.numbering) config = utils.read_yaml(options.config) - filenames = get_filenames(options.site, config) + entries = get_filenames(options.site, config) accum =  - for f in filenames: - convert_file(f, accum) + for [kind, filename] in entries: + if kind == 'entry': + convert_file(filename, accum) + elif kind == 'appendix': + accum.append('\n\\appendix\n') result = ''.join(accum) display(options, config, result)
The output marks additions with a
+ and deletions with a
-. A line that has
changed is shown as a deletion followed by an addition, and the lines marked
@@ show where in the file the change occurred.
See the difference
You have to be a bit of a masochist to read diffs like this; it's a lot easier using a GUI like DiffMerge. You can use other tools to view diffs between files that aren't plain text, but only if such tools exist. They don't for many common file formats: for example, there isn't an easy way to see the differences between two version of an SVG diagram or between two spreadsheets. If you are looking for projects to work on that people will actually use, these would be good ones.
Once we're sure we actually want to merge changes, we do so like this:
$ git merge homework3 main
Git automatically creates a new commit to represent the merge. If we now run
git diff main..homework3, Git doesn't print anything because there aren't any
differences to show.
After we merge the changes from
main there is no need to keep
homework3 branch, so we can delete it:
$ git branch -d homework3
Deleted branch homework3 (was 1577404).
main went smoothly, but if we are going to use
branches, we must learn how to merge conflicts. These occur when a line has been changed
in different ways in two branches or when a file has been deleted in one branch
but edited in the other.
If the file
README.md has been changed in both
git diff will show the conflict:
$ git diff homework4..main
When we try to merge
main, Git doesn't know which of these
changes to keep:
$ git merge homework4 main
After we run this command, Git has put both sets of changes into
but has marked which came from where:
$ cat README.md
The lines from
<<<<<<< HEAD to
======= are what was in
main, while the
lines from there to
>>>>>>> docs show what was in
homework4. If there were
several conflicting regions in the same file, Git would mark each one this way.
Once again, you have to hate yourself a little to view these conflicts as raw
text files; even legacy text editors like Emacs will highlight them,
and Git GUIs will help you view and edit these regions.
We have to decide what to do next: keep the
main changes, keep those from
homework4, edit this part of the file to combine them, or write something new.
Whatever we do, we must remove the
<<< markers. Once we are
done, we can add the file and commit the change like we would any other edit:
$ git add README.md $ git commit -m "Merging README additions"
Our branch's history will now show a single sequence of commits with the
changes on top of the earlier
$ git log --oneline -n 4
If we want to see what happened, we can add the
--graph option to
$ git log --oneline --graph -n 4
At this point we can delete the
homework branch or switch back to it and do
some more work. Each time we switch to it, we merge changes from
homework4, do our editing (while switching back to
main or other
branches as needed to work on the code), and then merge from
main once the documentation is updated.
One way to make the history of a repository easier to read is to squash several consecutive commits into one. This is called rebasing, and can be done using:
$ git rebase -i START
STARTidentifies the commit before the ones you want to start merging (i.e., the last one not to modify). Rebasing can go wrong in a lot of confusing ways, particularly if you have merged changes from another branch into the one you're squashing, so we recommend that you avoid it for schoolwork.
Branches can be confused, but this workflow will help you keep track of what you are doing:
git checkout main to make sure you are in the
git checkout -b name-of-feature to create a new branch. Always create a
branch when making changes, since you never know what else might come up.
The branch name should be as descriptive as a variable name or filename
Make your changes. If something occurs to you along the way—for example,
if we are writing a new function and realize that the documentation for some
other function should be updated—don't do that work in this branch.
Instead, commit our changes, switch back to
main, and create a new branch
for the other work.
When the new feature is complete, use
git merge to get any changes you
main after creating
name-of-feature and resolve any
conflicts. This is an important step: you want to test that everything
works while you are in your feature branch, not in
Finally, switch back to
git merge name-of-feature main to merge
your changes into
main. You should not have any conflicts, and all of
your tests should pass.
Most developers use this branch-per-feature workflow, but what exactly is a "feature"? These rules make sense for small projects:
Anything cosmetic that is only one or two lines long can be done in
and committed right away. "Cosmetic" means changes to comments or
documentation: nothing that affects how code runs, not even a simple
A pure addition that doesn't change anything else is a feature and goes into a branch. For example, if you are adding a feature to the user interfaces, that should be done on its own branch because it might take several tries to get right, and you might interrupt yourself to fix things you discover along the way.
Every bug fix is also done in a separate branch
The hardest thing about using a branch-per-feature workflow is sticking to it for small changes. As the first point in the list above suggests, most people are pragmatic about this on small projects; on large ones, where dozens of people might be committing, even the smallest and most innocuous change needs to be in its own branch so that it can be reviewed (which we discuss below).