Submitting DaVinci jobs to GRID
Rationale
-
The main limitation for local
DaVincidocker is: Raw data files (.dst) need to be downloaded locally. Given that the size of.dstfiles is on the order of TBs, this method is only used for developingDaVinciscripts and first-order validation -
On
lxplus, several officialDaVinciversions are provided. However, there are some drawbacks:.dstfiles still need to be downloaded to somelxplususer directory- While
DaVinciis running, the connection tolxplusmust be kept alive
LHCb collaboration provides a solution: Submitting and running DaVinci jobs on
a GRID. The advantages are:
-
GRID know how to access
.dstfiles directly, thus there's no need to manually download them. Instead, users need to specify the links (LFNs) to desired.dstfiles -
While the GRID
DaVincijobs are running, there's no need to keep a connection tolxplus.
The rest of this doc is divided in two parts:
- GRID job preparation and submission on
lxplus - Offline slimming and annexing of the produced ntuples on a local machine,
most possibly on
glacier.
GRID job preparation and submission on lxplus
Setup LHCb GRID certificate
Following this twiki to:
- Apply for a GRID certificate
- Setup the certificate on
lxplus
Note
The twiki claimed that a new user must physically go to CERN's user office to be able to apply for a new cert via ca.cern.ch. But I didn't have to do that.
Compile a local DaVinci on lxplus
We are using some non-official TupleTool, so we need to compile DaVinci on lxplus.
First, we need to figure out the appropriate platform for our version of DaVinci: v45r6.
$ lb-run -L DaVinci/v45r6
WARNING:lb-run:Using default siteroot of /cvmfs/lhcb.cern.ch/lib
x86_64+avx2+fma-centos7-gcc9-opt
x86_64-centos7-gcc9-opt
x86_64-centos7-gcc9-dbg
x86_64-centos7-gcc9-do0
For v45r6, we pick the following platform: x86_64-centos7-gcc9-opt.
Note that DaVinci v45r6 only provides CentOS7-based platforms.
However, CentOS7 reached End-of-Life in June 2024 (see OTG0145248), and the CentOS7-based lxplus7 nodes are no longer available.
Consequently, our DaVinci jobs cannot be natively run on the available EL9-based lxplus nodes, so we need to use containers.
To setup a CentOS7 container, run the following command (adapted from the Ganga FAQ):
apptainer exec --bind $PWD --env LBENV_SOURCED= --bind /cvmfs:/cvmfs:ro /cvmfs/lhcb.cern.ch/containers/os-base/centos7-devel/prod/amd64/ bash --rcfile /cvmfs/lhcb.cern.ch/lib/LbEnv
lb-set-platform x86_64-centos7-gcc9-opt
export LCG_hostos=x86_64-centos7
And then proceed to compile DaVinci from inside the container.
First, we should set the chosen platform as the default for lxplus. Add to your login shell config:
export CMTCONFIG=x86_64-centos7-gcc9-opt
export BINARY_TAG=$CMTCONFIG
Now, run this script
anywhere on lxplus to build a DaVinci with our customizations.
The DaVinci will be available at $HOME/build/DaVinciDev_v45r6.
Once DaVinci is compiled, you can exit the container.
Note
Our Ganga script also requires specific settings for DaVinci to be run in a container on the grid (see the Ganga FAQ). If at some point we update our DaVinci version and the use of a container is no longer necessary, those settings should be removed.
Configure ganga
ganga is a command-line interface for LHCb GRID. We need to configure ganga
job output directory. On lxplus:
- Run
gangaonce. This should create a.gangarcin$HOME. -
Locate
gangadiroption, point it to your larger AFS storage, for example:gangadir = /afs/cern.ch/user/s/suny/work/gangadir -
Go into your
gangadir, create aworkspacesymblink pointing to somewhere in your EOS. For example:workspace -> /eos/home-s/suny/gangadir-workspace -
Copy
ganga/ganga.sample.pyto$HOME/.ganga.pyonlxplus.
Submitting a job with ganga
Submitting jobs is very simple, you will simply copy a previous script that has similar features (eg. data, MC tracker-only), modify it appropriately, and run it. The detailed steps are
- Log on to
lxplusand get a proxy withlhcb-proxy-init - Clone
lhcb-ntuples-genand go to the appropriatejobsfolder, for instancelhcb-ntuples-gen/run2-rdx/jobs/ -
Copy a script with similar characteristics and rename it with the current date and some description, for instance
cp 22_02_03-tracker_only_Bs.sh 22_02_03-tracker_only_ddx_22to25.sh -
Modify the new script appropriately, for instance adding the MC IDs that you want to run. The script automatically sends the MagDown and MagUp for each sample.
-
Run the script, eg.
./22_02_03-tracker_only_ddx_22to25.sh -
If there are no errors in the submission, commit the script to
gitNote
An error such as
GangaDiracError: All the files are only available on archive SEs. It is likely the data set has been archived. Contact data management to request that it be staged (consider --debug option for more information)simply means that the files that you are trying to run on do not exist, perhaps because you have a typo.
-
Monitor the status of your jobs by entering interactive
ganga(simply typegangainlxplusafter getting a proxy) and typingjobs
Usage of ganga_jobs.py
See this appendix.
Manage ganga job output
You need to keep ganga running in a lxplus session to get up-to-date info
about your jobs and download output of completed (sub)jobs.
Update subjob status and force status to be "failed" when necessary
Sometimes ganga would stuck at updating job status. To reset the status for "completing" and "failed" subjobs, do:
jobs[63].backend.reset(True)
If that still doesn't bring a job to a stable state (i.e. "finished" or "failed"), force the job to fail:
jobs[63].force_status("failed", force=True)
Handling failing subjobs
The GRID job are split into subjobs, enabling parallel execution. Some subjobs may fail. Considering the following jobs output in ganga shell:
[01:30:22]
Ganga In [2]: jobs
Ganga Out [2]:
Registry Slice: jobs (30 objects)
--------------
fqid | status | name | subjobs | application | backend | backend.actualCE | comment | subjob status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
62 | completed |Dst_D0--20 | 26 | GaudiExec | Dirac | None | | 0/0/0/26
63 | failed |Dst_D0--20 | 1007 | GaudiExec | Dirac | None | | 0/1/0/1006
66 | failed |Dst_D0--20 | 1 | GaudiExec | Dirac | None | | 0/1/0/0
73 | completed |Dst_D0--20 | 772 | GaudiExec | Dirac | None | | 0/0/0/772
Jobs 63 and 66 are marked as failed because all of their subjobs are either completed or failed.
To resubmit failed subjobs for, say, jobs[63]:
jobs[63].resubmit()
However, above won't work unless all sub-jobs are either completed or failed. To resubmit the failed sub-jobs ASAP:
jobs[63].subjobs.select(status="failed").resubmit()
If you forced a "failed" status, some sub-jobs may be in "killed" state. A
simple job[63].resubmit() won't resubmit these killed jobs. To resubmit them:
jobs[63].subjobs.select(status="killed").resubmit()
There's a helper function to print failed subjobs:
show_subjobs(63)
show_subjobs(63, status='running') # show subjobs of 'running' status
If for some jobs keep failing, consider the backend bad and ban it in:
-
ban_site_for_jobhelper function with interactiveganga:ban_site_for_job(63, 'LCG.site.a') ban_site_for_job(63, ['LCG.site.a', 'LCG.site.b']) # you can ban multiple sites at onceThen
resubmitthe job using methods described above -
ganga/ganga_jobs.pyin this repo: InBannedSites $HOME/.ganga.pyonlxplus: Inremake_uncompleted_jobfunction parameter
Sometimes it's needed to remake new jobs (say we want to change number of files per subjob). To do so:
remake_uncompleted_job(63)
Info
The remake_uncompleted_job creates a new job for each failed subjob,
and add the subjob index to the comment property of the new job.
After the remade job has finished successfuly, merge all of its output .root
files and place it in the correct directory of the failing job.
Ntuples can be merged with:
hadd -fk <name>.root */output/*.root
Slimming and annexing of GRID ntuples
Offline slimming
Before you proceed
Clone lhcb-ntuples-gen on glacier and setup git-annex normally.
Also don't forget to do make install-dep in the nix shell!
We prefer to not merge .root files at all. Still, we need to give output files sane
names and slim them to remove unneeded branches.
Yipeng has prepared a ganga function that generates a bash script to aid the process.
To use it:
Before you proceed
The whole procedure takes a long time. It's recommeded to do it inside a tmux session.
-
Use
scpto copy finished jobs to a folder<glacierntuples>under in your$HOMEonglacierExample
Say your
gangadiris at~/eos/gangadir-workspace/suny/LocalXML, and you want to proceed job index63, then copy the folder63toglacier. -
In
gangashell, type inhadd_completed_job_output(63), where63is some job index.Info
This will generated a
batch_hadd.shin$HOME. The generated script contains commands to merge output.rootfiles for all completed jobs with a index that is greater or equal to the specified index.A script named
batch_skim.shwill be generated in$HOMEonlxplus.A sample
batch_skim.shis displayed here. -
Copy
batch_skim.shto yourlhcb-ntuples-genroot folder onglacier, then give it execution permission withchmod +x batch_skim.sh.Alternative slimming
At some point, we decided to change our folder structure for storing our rdx tuples (to be easier to find files, but note that this procedure won't work for other tuples, eg.
JpsiK). This can be achieved by either continuing with this procedure and following the (new) step 6, or alternatively you can usescripts/step1_slim.pyto do this more easily (and also automatically keep track of your total events + failedgangasubjobs!). To use this script, with your copiedbatch_skim.shinsidelhcb-ntuples-gen/, runpython scripts/step1_slim.py <ganga_output_dir> <new_slimmed_ntuple_dir> batch_skim.sh (eg. python scripts/step1_slim.py ../ntuples_to_merge ntuples/0.9.12-all_years batch_skim.sh)This script automatically chooses
postprocess/skims/rdx_mc.ymlfor the slimming. If you run this, the rdx folder structure should be correct, and you can skip steps 4-6. -
Set the input folder and postprocessing rules as needed in
batch_skim.sh:INPUT_DIR=<glacierntuples> SKIM_CONFIG=./postprocess/skims/rdx_mc.yml # NOTE: Make sure you pick the right one!!!Info
If you don't want to remove any branches, then replace:
./ganga/ganga_skim_job_output.py ${OUTPUT_DIR}/$2 ${INPUT_DIR}/$1 ${SKIM_CONFIG}with
./ganga/ganga_skim_job_output.py ${OUTPUT_DIR}/$2 ${INPUT_DIR}/$1 ${SKIM_CONFIG} --copyNoting the use of
--copyflag. -
Go into a nix shell in your
lhcb-ntuples-genwithnix develop, then in the project root, run:./batch_skim.sh ntuples/<folder_to_ntuple_output>For example,
<folder_to_ntuple_output>can be:0.9.6-2016-production/Dst_D0-mc-tracker-only -
We decided to change our folder naming scheme, so before annexing (in principle this works after annexing too, but it's safer to do it before annexing), if not following the alternative slimming procedure above, run
python scripts/move_step1_ntuples.py <step1_tuple_dir_old_paths> <output_dir> (eg. python scripts/move_step1_ntuples.py ntuples/0.9.100-new_rdx_tuples ntuples/0.9.100-new_rdx_tuples)This script assumes your
rdx-run2-analysisandlhcb-ntuples-genrepos live in the same directories; if they don't, you'll have to specify an additional parameter when running the above command-s <path_to_rdx-run2-analysis>/fit/spec/histos.yml(the script references our rdx templates to create the new folder naming scheme).
Annex ntuples
Info
We decide to use a pull-request-based workflow for ntuple annexation to minimize errors.
...but if you consider yourself a trusted user (ie. are comfortable enough adding tuples to the annex), you can ignore this section and just annex+commit the tuples on your own.
-
Create a new branch on your
lhcb-ntuples-genproject onglacier, then checkout that branch:git branch <branch_name> git checkout <branch_name>Example
git branch yipeng-ddx_part1 git checkout yipeng-ddx_part1 -
Go to the folder that contains your newly slimmed ntuples, then do
git annex add .and commit changes withgit commit . -m <comment>.Example
cd ntuples/0.9.6-2016-production/Dst_D0-mc-tracker-only git annex add . git commit . -m "Part 1 of DDX MC" -
Push this branch to Github with
git push origin <branch_name>, then create a pull-request.Info
If the
git push origin <branch_name>failed, check yourgithistory and make sure you didn't added ntuples (large files) directly withgit add.Github will refuse files larger than 100 MB.
Once everything checks out, one of Manuel or Yipeng will merge the request.
-
Once the request is merged, do the following:
git checkout master git pull origin masterverify that your annexed ntuples are there, then do the following in the
lhcb-ntuples-genproject root:git annex sync git annex copy . --to glacier git annex sync # YES, you need to do it twice, once before copying, and once after! -
Once you finish all these, inform Manuel or Yipeng.
Info
For more info on git-annex usage, review the
git-annex entry.
Appendix
Sample batch_skim.sh
#!/usr/bin/env bash
INPUT_DIR=~/ntuples # NOTE: Configure this before proceed!!!
SKIM_CONFIG=./postprocess/skims/rdx_mc.yml # NOTE: Make sure you pick the right one!!!
OUTPUT_DIR=$1
MIN_NTUPLE_SIZE=500 # in KiB
RED='[0;31m'
NC='[0m' # No Color
function check_job () {
local error=0
local job_dir=${INPUT_DIR}/$1
local num_of_subjobs=$2
echo "Verifying output for Job $1, which has $2 subjobs..."
# might require GNU find
local folders=$(find $job_dir -mindepth 1 -maxdepth 1 -type d -regex ".*/[0-9]*" | wc -l)
if [[ $folders -ne $num_of_subjobs ]]; then
echo -e "${RED}Found $folders subjobs, which =/= $num_of_subjobs. Terminate now.${NC}"
exit 1
fi
for sj in $(ls $job_dir | grep -E "^[0-9].*$"); do
local file=$(find $job_dir/$sj/output -name '*.root')
if [[ -z $file ]]; then
let "error++"
echo -e "${RED}subjob $sj: ntuple missing!${NC}"
else
local size=$(du -b $file | awk '{print int($1 / 1024)}') # in KiB
if [ $size -lt ${MIN_NTUPLE_SIZE} ]; then
let "error++"
echo -e "${RED}subjob $sj: ntuple has a size of $size KiB, which is too small!${NC}"
fi
fi
done
if [ $error -gt 0 ]; then
echo -e "${RED}Job $1 output verification failed with $error error(s).${NC}"
exit $error # exit early to make errors easy to see
fi
return $error
}
function concat_job () {
check_job $1 $2
if [ $? -eq 0 ]; then
./ganga/ganga_skim_job_output.py ${OUTPUT_DIR}/$3 ${INPUT_DIR}/$1 ${SKIM_CONFIG}
fi
}
concat_job 180 111 Dst_D0--22_02_07--mc--tracker_only--MC_2016_Beam6500GeV-2016-MagDown-TrackerOnly-Nu1.6-25ns-Pythia8_Sim09k_Reco16_Filtered_11894600_D0TAUNU.SAFESTRIPTRIG.DST.root
General usage of ganga_jobs.py
For this repo, there is a central ganga job submitter that should handle
all job submissions for all analyses with all reconstruction
scripts in all folders.
The script is located at: ganga/ganga_jobs.py. For it
The general syntax is:
ganga_jobs.py <reco_script> <cond_files>
For instance, for run 1 \(R(D^{*})\) signal Monte Carlo:
ganga_jobs.py run1-rdx/reco_Dst_D0.py run1-rdx/cond/cond-mc-2012-md-sim08a.py -p mu -P Pythia6 -d 11574020
Note
- The
-pand-Pare optional. They have default values. - The
-dflag has a dummy default of00000000. For actual values, consult data sources.
Info
The usage of ganga_jobs.py is described by:
ganga_jobs.py --help
Most of submissions are wrapped in shell scripts. Some of them can be found here.