Archiving and compressing
In this video we will be working with a ZIP file that you can download and unpack with
$ wget http://bit.ly/bashfile -O bfiles.zip
$ unzip bfiles.zip
Unlike an SSD or a hard drive on your laptop, the filesystem on HPC cluster was designed to store large
files, ideally with parallel I/O. As a result, it handles any large number of small I/O requests (reads
or writes) very poorly, sometimes bringing the I/O system to a halt. For this reason, we strongly
recommend that users do not store many thousands of small files – instead you should pack them into a
small number of large archives. This is where the archiving tool tar
comes in handy.
Working with tar
and gzip/gunzip
(8 min)
Covered topics: tar
and g(un)zip
.
Managing many files with Disk ARchiver (DAR)
tar
is by far the most widely used archiving tool on UNIX-like systems. Since it was originally
designed for sequential write/read on magnetic tapes, it does not index data for random access to its
contents. A number of 3rd-party tools can add indexing to tar
. However, there is a modern version of
tar
called DAR (stands for Disk ARchiver) that has some nice features:
- each DAR archive includes an index for fast file list/restore,
- DAR supports full / differential / incremental backup,
- DAR has build-in compression on a file-by-file basis to make it more resilient against data corruption and to avoid compressing already compressed files such as video,
- DAR supports strong encryption,
- DAR can detect corruption in both headers and saved data and recover with minimal data loss,
and so on. Learning DAR is not part of this course. In the future, if you want to know more about working with DAR, please watch our DAR webinar (scroll down to see it).