Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

GIT: Revision Control in the Sciences

Work in Progress: For lab meeting on August 20th, 2013 and General communication about GIT in the Sciences
by

Brandon King

on 20 August 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of GIT: Revision Control in the Sciences

GIT: Revision Control in the Sciences
Brandon W. King
Knight Lab
UC Berkeley

August 20th, 2013
http://bit.ly/git-prezi

What is revision control?
(a.k.a. version control)

Reversibility of File Changes
Concurrency
Annotation
Why?
References
Where's the Real Bottleneck in Scientific Computing? Greg Wilson, Univ. of Toronto (http://www.americanscientist.org/libraries/documents/200512511610_307.pdf)
Why use version control? (http://www.catb.org/~esr/writings/version-control/version-control.html#why_vcs)
Reproducibility
"Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method."
~Wikipedia: Reproducibility
Science Research Code
What version of your code was used to generate your results?
Are there bugs in any of the versions of your code? What version did they get introduced in?
Is your code published so your results can be reproduced?
Definition of Version: Any time a file was saved (ctrl-s) is a version.
Are you sure you know which version you used?
Most software engineers struggle with being able to answer this question... They have tools to answer it though. Do you?
A one character change can cause a bug.
Article: "Publish your computer code: it is good enough"
13 October 2010 | Nature 467, 753 (2010) | doi:10.1038/467753a
Blog post: Data reproducibility
http://blog.shazino.com/articles/science/data-reproducibility/
Revision Control: Instant Backup*
Nature Article: "Announcement: Reducing our irreproducibility"
24 April 2013, http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852
* When the master repository is hosted on another machine or service, such as GitHub.
Point:
"If you aren't using revision control with any and all code you write in a science lab, you are ignoring a fundamental part of the scientific method." -Brandon King
Okay, okay... But...
But, the
Good news
:
It's easier than you think
and there are
benefits...
Sharing code is easy*
More on this later...
Collaboration is Easier
Revision Control is Easier
Image Source: http://codicesoftware.blogspot.com/2010/11/version-control-timeline.html
SCCS
1970s
http://www.flourish.org/blog/?p=397
40+ Years of solving the revision control problem.
Maybe it is worth seeing how these tools
can help you?
Maybe you want the productivity gains that come with it?
Now let's move on to the myths and benefits...
Myths and Benefits
Myth 1: Only teams need revision control
The
instant backup
copy alone is worth using git on EVERY piece of code or text based document you write.
The annotation feature is
like a lab notebook
. It makes it easy to see why you did what you did and when. And it's quicker than writing in a notebook. Simple.
Myth 2: I don't want people to edit my code.
Getting someone to actually edit your code [in a lab setting] is surprisingly hard.
If some does, you can revert to your version (
it is version control after all
).
GitHub makes it easy to fork code (if you want control) and resynchronize if you want changes (Google: 'github fork' and 'github pull request').
Little side note
The
policy
I made in the lab was that every person in the lab has read-write access to every repository.

No code was harmed due to this policy.

In general, people are just as
afraid to edit your code
as you are of them editing it. You can accept, reject or revert contributions. It is what revision control was designed to deal with.
Myth 3: I don't have time to learn a new technology.
In my experience, it takes someone about 2 hours of practice to start to get handle of the minimal features required to get started with git. (i.e. be able to start using it on a project)
The first time you need to go back and look at the annotation history to answer a question, or figure out when a bug was introduced, it will make up for that 2 hour investment. Many of the other benefits can make up for the investment as well...
Code Writing
GIT
1970s
2010s
Suggest a Myth
Have another myth? Email it to kingb @ cal.berkeley.edu.
Benefits
Revision History (Annotation)
Click
Diff: Comparison of versions
D3.js Network Graph (i.e. Designed for Collaboration)
Who wrote what, when, and why.
Commit
User
Timestamps & Commit Comment
Next
History
Myth #4: My code isn't good enough
You are writing code to do science, not as a commercial software product.
It's okay
that its "research quality" code.
Getting feedback from more experienced coders can save you lots of time and prevent you from
reinventing the wheel
over and over.
On rare occasions it works:
http://www.codinghorror.com/blog/2009/02/dont-reinvent-the-wheel-unless-you-plan-on-learning-more-about-wheels.html
Experience with
Revision Control
is something you can put on your
resume
.
How?
Step 1: Understand the Basic Concepts
I
Repository
The
[magical]
place where the code revisions are stored and tracked.
Any sufficiently advanced technology is indistinguishable from magic.
~Arthur C. Clark
Workspace
The folder where you edit the code or files.
Any sufficiently advanced technology is indistinguishable from magic.
~Arthur C. Clark
Remote and Local Repository
Remote Repository on GitHub.com*
Workspace
Local Repo
Clone
Other choices include:
Remote repo on another machine.
Gitorious
GitLab
Gitosis
Gitolite
Local
Local Repo
Staging
Area
Workspace
Hidden area where files go when you are selecting what to 'commit' to the local repository.
Life of a File: a.txt
Local Repo
Staging
Area
Workspace
a.txt
Life of a File: a.txt
Local Repo
Staging
Area
Workspace
a.txt
Add
a.txt
Commit
a.txt
Life of File: a.txt
Local --> Remote
Remote Repository on GitHub.com*
Other choices include:
Remote repo on another machine.
Gitorious
GitLab
Gitosis
Gitolite
Local Repo
a.txt
Push
a.txt
Get the latest changes:
A new files: b.txt
Remote Repository on GitHub.com*
Other choices include:
Remote repo on another machine.
Gitorious
GitLab
Gitosis
Gitolite
Local Repo
a.txt
a.txt
b.txt
b.txt
Pull
Workspace
a.txt
b.txt
Local Repo
Staging
Area
Workspace
Minimal Commands to Get Started
git clone
git status
git add
git commit
git push
git pull
Extra credit/useful:
git diff
git log
git blame
git init

Intermediate
git branch
git checkout
Github: git clone
1) Copy clone url from git repository

2) In console, type:
$ git clone <paste in copied url>
1)
A Git Tutorial Using Github
Create or Login to your Github.com account.
Go to:
https://github.com/kingb/revision_control_in_the_sciences
Fork the Repository

This will create your own 'clone' of the revision_control_in_the_sciences repository on github.
Edit diffs/simple.txt
Add 'Ecology' in 'simple.txt' found in the 'diffs' folder.
simple.txt
simple.txt status
Local Repo
Staging
Area
Workspace
simple.txt
simple.txt
Has 'Ecology' in the file.
Does NOT have 'Ecology' in the file.
git status
Check the status of the repository:
$ git status
git diff <file_path> or
git diff
Show all changes:
$ git diff

Show one file change (relative to current working directory):
$ git diff simple.txt
git add <file_path(s)>
Stage simple.txt for commit:
$ git add simple.txt

View the status:
$ git status
simple.txt status
Local Repo
Staging
Area
Workspace
simple.txt
simple.txt
Has 'Ecology' in the file.
Does NOT have 'Ecology' in the file.
git add
simple.txt
git commit
Commit all staged changes to the repository.

$ git commit -m "Added Ecology to the list."

Or with text editor for message:
$ git commit
simple.txt status
Local Repo
Staging
Area
Workspace
simple.txt
simple.txt
Has 'Ecology' in the file.
Does NOT have 'Ecology' in the file.
git add
simple.txt
simple.txt
git commit
Now the repository has both versions of simple.txt
Local and Remote Repository Status
Remote Repository on GitHub.com
Local Repo
simple.txt
simple.txt
simple.txt
git push
Push the commits from the local repository to the remote repository:

$ git push
Local and Remote Repository Status
Remote Repository on GitHub.com
Local Repo
simple.txt
simple.txt
simple.txt
git push
simple.txt
git pull
Update local repository with changes from the remote repository:

$ git pull
Local and Remote Repository Status
Remote Repository on GitHub.com
Local Repo
simple.txt
simple.txt
simple.txt
simple.txt
git pull
Local Repo
Staging
Area
Workspace
Summary
git clone
git status
git add
git commit
git push
git pull
Extra credit/useful:
git diff
git log
git blame
git init

Intermediate
git branch
git checkout
Used commands
Not too bad right?
Unfortunately, you will run into conflicts... but they aren't that bad either with the right setup.
Controlled Conflict Resolution:
A Demo
Step 1: The Demo Repository
Make sure you have the
revision_control_in_the_sciences
repository forked and cloned as per 'Fork this Repository' slide (#70).
'cd' to the top level directory, where you will find conflict.py
Detour: Install p4merge (free) or other visual merge tool
http://www.perforce.com/product/components/perforce-visual-merge-and-diff-tools
Setup p4merge
Set p4merge path:
$ git config --global mergetool.p4merge.path <path_to_application>
Set default merge tool to p4merge
$ git config --global merge.tool p4merge

Where <path_to_application> is something like (on Mac OS X):
/Applications/p4merge.app/Contents/Resources/launchp4merge
Back to the Repository
The file we will be creating a conflict in.
Note, there are two branches (a concept we will cover later).
Indicates current
branch.
master branch:
conflict_demo branch:
Conflict point
Conflict point
conflict.py
Create conflict: merging conflict_demo branch into master branch
Show the branches
Merge conflict_demo into current (master) branch
Oh look, conflict.
conflict.py in conflicted state
Conflict block 1
Conflict block 2
master branch
conflict_demo branch
It's possible to clean this up manually and mark it as resolved with 'git add'

but there is a better way...
git mergetool
Oops, 2 conflicts
master branch renamed function from conflict() to create_conflict()
conflict_demo branch renamed function from conflict() to add()
original function name was conflict()
The resolved conflict.py view
click to choose conflict(x,y)
click to choose create_conflict(x,y)
click to choose add(x,y)
I choose made my choices
Resulting in these changes
Save & then quit
status and commit
The message you want to see.
A reminder of
what to do next
Commit, and you're done.
Check the current status
Special Thanks
Prof. Robert Knight
Kelly Buchanan
Chris Holdgraf
Matar Haller
Stephanie Martin
Jochem W. Riger
Jamie Lubell
Kira Xie
Avgusta Shestyuk
Knight Lab, UC Berkeley
... and the rest of the lab who will be trying git shortly.
Prof. Barbara Wold
Diane Trout
Joe Roden
Chris Hart
Henry Amrhein
Wold Lab, Caltech
... and the rest of the lab.
Full transcript