git-annex
It is harder to manage large (100 MB or more) files in git, as checking them
in/out would take a much longer time. Also, it is undesirable to expose raw
data outside of the collaboration.
We use a git addon, git-annex, to manage large files. git-annex stores all
tracked files under <project_root>/.git/annex, and link/copy these files to
the expected locations.
A typical workflow to add a file to the annex, commit it, and sync the remotes, is
git annex add <file> ## Moves file to the annex, and replaces it with a soft link in the git repo
git add <file>
git commit -m "Committed <file> (well, a soft link to it)"
git push
git annex copy --to glacier <file> ## Copies the actual file to glacier
## Make sure you do not have uncommitted changes in the repo, because the sync commits everything
git annex sync
Initialize git-annex repository
Before you proceed
This needs to be done only once for each repository!
We have a private server1 that hosts git repositories with git-annex
capabilities.
After cloning the umd-lhcb/lhcb-ntuples-gen repository from github, add our private repository:
git remote add julian git@129.2.92.92:lhcb-ntuples-gen
Note
Please send us a SSH key so that we can give your read/write permission on
the git-annex repository.
Then we need to initialize the annex component:
git annex init --version=7
Warning
It is important to use a git-annex repository of v7 or newer!2
To upgrade your git-annex repository to at least v7, issue the
following command inside your git repository:
git annex upgrade
Note
Dropbox will not synchronize any symbolic links, so if the repository is placed within your Dropbox folder, and you have multiple computers, the symbolic links will be replaced by the actual files on all but the initial computers.
Add files
If you are adding large files that are unlikely to change in the future, such
as .dst data files, use the following command:
git annex add <path_to_file>
Note
You typically don't need this. It is left here for completeness.
git add will add files to the git repository, not git-annex
repository by default. Configuration is required to add only .root
files to git-annex, and the rest to git. This has been done for this
repository, in:
<project_root>/.gitattributes
See this article
for more information on configuring .gitattributes.
Change the content of annexed files
Files added via git annex add are read only. For example:
echo change > <path_to_annexed_file>
> bash: <annexed_file>: Permission denied
To change them, we need to unlock them first:
git annex unlock <path_to_annexed_file>
Now you can edit the unlocked file as you wish. After editing, use
git annex add to keep the changes and lock it again.
Note
When you commit, git-annex will notice that you are committing an
unlocked file, add its new content.
A pointer to that content is what gets committed to git; the actual
content will go to git-annex.
Warning
If you don't need to modify the file after all, or want to discard
modifications, use git annex lock.
Doing so will result in all modifications discarded. Proceed with care!
Files added via git add can be changed just like a regular file.
Change the name of annexed files
Once a file has been annexed with git annex add, the actual file will be
moved automatically by git-annex inside .git folder in your project, and
git-annex will create a symbolic link in-place pointing to that file.
So, if you just want to rename the annexed file, without changing its
content, just view that symbolic link as a regular file added to git.
Example
Consider the following example:
- Place
a.rootinfolderA/a.root. - Annex the file with:
git annex add folderA/a.root - Now
folderA/a.rootwill be just a symbolic link, and the actual root file is placed in.gitin your project root - Suppose you want to rename
a.roottob.root. In this case, you can:mv folderA/a.root folderA/b.root git add folderA/a.root folderA/b.root # <-- We are not using annex here!!! git commit -a
Synchronize files between local and remote repositories
First check that you have committed all changes:
git status
Make sure NO entry looks like this:
changes not staged for commit
If there are uncommitted local changes, commit then and write sensible
messages. This way, git annex sync won't make unwanted commits!
Before you proceed
Do git pull origin master to get latest changes from origin first.
Now, you can do this:
git annex sync
Note
The command above doesn't download the actual data; rather, it only download
the metadata so that git annex knows how to download the actual data.
The command above will also make sure your local master is now identical
to remote master. That's why it's better to do git pull origin master
beforehand to avoid surprises.
Error
By default, git annex sync will commit all previously uncommitted
changes before synchronizing!
This can be disabled on a per-repository basis by:
git annex config --set annex.autocommit false
Other clones will also be configured properly after they do a:
git annex sync.
If you want to download every single file from the git-annex repo (which is
probably a couple of GBs), add the --content flag in the second step and
download not only the metadata, but also the data:
git annex sync --content julian
Download and upload individual files
Downloading is simple:
git annex get <path_to_files>
So is uploading:
git annex copy --to julian <path_to_files>
git annex sync
Drop local files
The following command will remove the local copy of the file only, and will not delete the file from remote3:
git annex drop <path_to_files>
Note
git would still think the working directory is clean, i.e. no change has
been made.
Check annexed file size
For a single file this can be done via git annex info. For example:
$ git annex info ntuples/pre-0.9.0/Dst-std/Dst--19_09_05--std--data--2012--md.root
file: ntuples/pre-0.9.0/Dst-std/Dst--19_09_05--std--data--2012--md.root
size: 1.8 gigabytes
key: SHA256E-s1800364650--cb5222668f21032b81ede5f18eb86026e21188c54441917258e8aad4d072f791.root
present: false
For directories, we have a home-made wrapper script scripts/count_root_files.py. For example:
$ ./scripts/count_root_files.py ntuples
2 .root total: 171.30 MiB local: 0.00 KiB ntuples/0.9.1-partial_refit
2 .root total: 171.30 MiB local: 0.00 KiB ntuples/0.9.1-partial_refit/Dst_D0-cutflow_mc
5 .root total: 47.49 GiB local: 0.00 KiB ntuples/ref-rdx-run1
1 .root total: 397.92 MiB local: 0.00 KiB ntuples/ref-rdx-run1/Dst-mc
1 .root total: 29.62 GiB local: 0.00 KiB ntuples/ref-rdx-run1/D0-mix
1 .root total: 1.70 GiB local: 0.00 KiB ntuples/ref-rdx-run1/Dst-std
1 .root total: 1.60 GiB local: 0.00 KiB ntuples/ref-rdx-run1/D0-std
1 .root total: 14.18 GiB local: 0.00 KiB ntuples/ref-rdx-run1/Dst-mix
2 .root total: 0.98 GiB local: 0.00 KiB ntuples/0.9.0-cutflow
2 .root total: 0.98 GiB local: 0.00 KiB ntuples/0.9.0-cutflow/Dst-cutflow_mc
7 .root total: 37.90 GiB local: 0.00 KiB ntuples/pre-0.9.0
2 .root total: 46.24 MiB local: 0.00 KiB ntuples/pre-0.9.0/Dst-cutflow_mc
2 .root total: 17.50 GiB local: 0.00 KiB ntuples/pre-0.9.0/Dst-cutflow_data
1 .root total: 179.56 MiB local: 0.00 KiB ntuples/pre-0.9.0/Dst-mc
2 .root total: 20.19 GiB local: 0.00 KiB ntuples/pre-0.9.0/Dst-std
Info
If you are in the nix shell, count_root_files.py is added to PATH so
you can call it directly.
Check annexed file availability
We can use git annex list for this. For example:
$ git annex list ntuples/0.9.0-cutflow
here
|Julian
||origin
|||web
||||bittorrent
|||||
_X___ ntuples/0.9.0-cutflow/Dst-cutflow_mc/Dst--20_06_05--cutflow_mc--bare--MC_2011_Beam3500GeV-2011-MagDown-Nu2-Pythia8_Sim08h_Digi13_Trig0x40760037_Reco14c_Stripping20r1NoPrescalingFlagged_11874091_ALLSTREAMS.DST.root
_X___ ntuples/0.9.0-cutflow/Dst-cutflow_mc/Dst--20_06_05--cutflow_mc--bare--MC_2016_Beam6500GeV-2016-MagDown-Nu1.6-25ns-Pythia8_Sim09b_Trig0x6138160F_Reco16_Turbo03_Stripping26NoPrescalingFlagged_11874091_ALLSTREAMS.DST.root
-
As of now, the server is sitting on Yipeng's desktop. It is named
Julian, after Julian Schwinger. ↩ -
Deleting files from remote is dangerous! As the remote might be the last copy of the file so we may lose the file permanently.
Still, if you insist, please refer to the official guide. ↩