Archiving
Archiving simulation inputs, scripts and output data is a common need for computational physicists. Here are some popular tools and workflows to make archiving easy.
HPC Systems: HPSS
A very common tape filesystem is HPSS, e.g., on NERSC or OLCF.
What’s in my archive file system?
hsi ls
Already something in my archive location?
hsi ls 2019/cool_campaign/
as usualLet’s create a neat directory structure:
new directory on the archive:
hsi mkdir 2021
create sub-dirs per campaign as usual:
hsi mkdir 2021/reproduce_paper
Create an archive of a simulation:
htar -cvf 2021/reproduce_paper/sim_042.tar /global/cfs/cdirs/m1234/ahuebl/reproduce_paper/sim_042
This copies all files over to the tape filesystem and stores them as a single
.tar
archiveThe first argument here will be the new archive
.tar
file on the archive file system, all following arguments (can be multiple, separated by a space) are locations to directories and files on the parallel file system.Don’t be confused, these tools also create an index
.tar.idx
file along it; just leave that file be and don’t interact with it
Change permissions of your archive, so your team can read your files:
Check the unix permissions via
hsi ls -al 2021/
andhsi ls -al 2021/reproduce_paper/
Files must be group (g) readable (r):
hsi chmod g+r 2021/reproduce_paper/sim_042.tar
Directories must be group (g) readable (r) and group accessible (x):
hsi chmod -R g+rx 2021
Restore things:
mkdir here_we_restore
cd here_we_restore
htar -xvf 2021/reproduce_paper/sim_42.tar
this copies the
.tar
file back from tape to our parallel filesystem and extracts its content in the current directory
Argument meaning: -c
create; -x
extract; -v
verbose; -f
tar filename.
That’s it, folks!
Note
Sometimes, for large dirs, htar
takes a while.
You could then consider running it as part of a (single-node/single-cpu) job script.
Desktops/Laptops: Cloud Drives
Even for small simulation runs, it is worth to create data archives. A good location for such an archive might be the cloud storage provided by one’s institution.
Tools like rclone can help with this, e.g., to quickly sync a large amount of directories to a Google Drive.
Asynchronous File Copies: Globus
The scientific data service Globus allows to perform large-scale data copies, between HPC centers as well as local computers, with ease and a graphical user interface. Copies can be kicked off asynchronously, often use dedicated internet backbones and are checked when transfers are complete.
Many HPC centers also add their archives as a storage endpoint and one can download a client program to add also one’s desktop/laptop.
Scientific Data for Publications
It is good practice to make computational results accessible, scrutinizable and ideally even reusable.
For data artifacts up to approximately 50 GB, consider using free services like Zenodo and Figshare to store supplementary materials of your publications.
For more information, see the open science movement, open data and open access.
Note
More information, guidance and templates will be posted here in the future.