Data management at Trillium/en
Understanding the various filesystems, and how to use them properly, is critical to optimizing your workflow and being a good SciNet citizen. This page describes the various Trillium filesystems, and how to properly use them.
Performance¶
The filesystems on SciNet, with the exception of /archive, are GPFS, a high-performance filesystem which provides rapid reads and writes to large datasets in parallel from many nodes. As a consequence of this design, however, the filesystem performs quite poorly at accessing data sets which consist of many, small files. For instance, you will find that reading data in from one 16MB file is enormously faster than from 400 40KB files. Such small files are also quite wasteful of space, as the blocksize for the scratch and project filesystems is 16MB. This is something you should keep in mind when planning your input/output strategy for runs on SciNet.
Warning
If you run multi-process jobs, having each process write to a file of its own is not a scalable I/O solution. A directory gets locked by the first process accessing it, so all other processes have to wait for it. Not only has the code just become considerably less parallel, chances are the filesystem will have a time-out while waiting for your other processes, leading your program to crash mysteriously. Consider using MPI-IO (part of the MPI-2 standard), which allows files to be opened simultaneously by different processes, or using a dedicated process for I/O to which all other processes send their data, and which subsequently writes this data to a single file.
Purpose of each filesystem¶
Trillium accesses several different filesystems. Note that not all of these filesystems are available to all users.
/home ($HOME)¶
/home is intended primarily for individual user files, common software or small datasets used by others in the same group, provided it does not exceed individual quotas. Otherwise you may consider /scratch or /project. /home is read-only on the compute nodes.
/scratch ($SCRATCH)¶
/scratch is to be used primarily for temporary or transient files, for all the results of your computations and simulations, or any material that can be easily recreated or reacquired. You may use scratch as well for any intermediate step in your workflow, provided it does not induce too much IO or too many small files on this disk-based storage pool, otherwise you should consider burst buffer (/bb). Once you have your final results, those that you want to keep for the long term, you may migrate them to /project or /archive. /scratch is purged on a regular basis and has no backups.
/project ($PROJECT)¶
/project is intended for common group software, large static datasets, or any material very costly to be reacquired or re-generated by the group.
Warning
Material on /project is expected to remain relatively immutable over time.
Temporary or transient files should be kept on scratch, not project. High data turnover induces stress and unnecessary consumption tapes on the TSM backup system, long after this material has been deleted, due to backup retention policies and the extra versions kept of the same file. Even renaming top directories is enough to trick the system into assuming a completely new directory tree has been created, and the old one deleted, hence think carefully about your naming convention ahead of time, and stick with it. Users abusing the project filesystem and using it as scratch will be flagged and contacted. Note that on Trillium /project is only available to groups with RAC allocation.
/bb ($BBUFFER)¶
/bb, the burst buffer, is a very fast, very high performance alternative to /scratch, made of solid-state drives (SSD). You may request this resource if you anticipate a lot of IOPs (Input/Output Operations) or when you notice your job is not performing well running on scratch or project because of I/O (Input/Output) bottlenecks. See here for more details.
/archive ($ARCHIVE)¶
/archive is a nearline storage pool, if you want to temporarily offload semi-active material from any of the above filesystems. In practice users will offload/recall material as part of their regular workflow, or when they hit their quotas on scratch or project. That material can remain on HPSS for a few months to a few years. Note that on Trillium /archive is only available to groups with RAC allocation.
/dev/shm (RAM)¶
On the Trillium nodes a ramdisk is available. Ramdisk is much faster than real disk, and faster than Burst Buffer. Up to 70 percent of the RAM on the node (i.e. 202GB) may be used as a temporary local filesystem. This is particularly useful in the early stages of migrating desktop-computing codes to a HPC platform such as Trillium, especially those that use a lot of file I/O (Input/Output). Using a lot of I/O is a bottleneck in large scale computing, especially on parallel filesystems (such as the GPFS used on Trillium), since the files are synchronized across the whole network.
$SLURM_TMPDIR (RAM)¶
For consistency with the general-purpose clusters, the environment variable $SLURM_TMPDIR will be set on Trillium compute jobs. Note that this variable will point to RAMdisk, not to local hard drives. The $SLURM_TMPDIR directory will be empty when your jobs starts and its content gets deleted after the job has finished.
Per-job temporary burst buffer space ($BB_JOB_DIR)¶
For every job on Trillium, the scheduler creates a temporary directory on the burst buffer called $BB_JOB_DIR. The $BB_JOB_DIR directory will be empty when your jobs starts and its content gets deleted after the job has finished. This directory is accessible from all nodes of a job.
$BB_JOB_DIR is intended as a place for applications that generate many small temporary files or that create files that are accessed very frequently (i.e., high IOPS applications), but that do not fit in ramdisk.
It should be emphasized that if the temporary files do fit in ramdisk, then that is generally a better location for them as both the bandwidth and iops of ramdisk far exceeds that of the burst buffer. To use ramdisk, you can either directly access /dev/shm or use the environment variable $SLURM_TMPDIR.
Note that Trillium compute nodes have no local disks so $SLURM_TMPDIR lives in memory (ramdisk), in contrast to general-purpose clusters like Fir or Nibi where this variable points to a directory on a node-local SSD.
Quotas and purging¶
You should familiarize yourself with the various filesystems, what purpose they serve, and how to properly use them. This table summarizes the various filesystems.
| Location | Quota (User) | Quota (Group) | Block Size | Expiration Time | Backed Up | On Login Nodes | On Compute Nodes |
|---|---|---|---|---|---|---|---|
| $HOME | 100 GB per user | N/A | 1 MB | yes | yes | read-only | |
| $SCRATCH | 25 TB per user (if group quota not reached) | Up to 4 users: 50 TB Up to 11 users: 125 TB Up to 28 users: 250 TB Up to 60 users: 400 TB Over 60 users: 500 TB |
16 MB | 2 months | no | yes | yes |
| $PROJECT | N/A | by group allocation | 16 MB | yes | yes | yes | |
| $ARCHIVE | N/A | by group allocation | dual-copy | no | no | ||
| $BBUFFER | 10 TB per user | N/A | 1 MB | very short | no | yes | yes |
- Inode vs. Space quota (PROJECT and SCRATCH)
- dynamic quota per group (SCRATCH)
- Compute nodes do not have local storage.
- Archive space is on HPSS, and is not accessible on the Trillium login, compute, or datamover nodes.
- Backup means a recent snapshot, not an archive of all data that ever was.
$BBUFFERstands for Burst Buffer, a faster parallel storage tier for temporary data.
How much disk space do I have left?¶
The diskUsage command, available on the login nodes and datamovers, provides information in a number of ways on the home, scratch, project and archive filesystems. For instance, how much disk space is being used by yourself and your group (with the -a option), or how much your usage has changed over a certain period ("delta information") or you may generate plots of your usage over time. Please see the usage help below for more details.
Usage: diskUsage [-h|-?| [-a] [-u <user>]
-h|-?: help
-a: list usages of all members on the group
-u <user>: as another user on your group
Did you know that you can check which of your directories have more than 1000 files with the /scinet/niagara/bin/topUserDirOver1000list command and which have more than 1GB of material with the /scinet/niagara/bin/topUserDirOver1GBlist command?
Note: information on usage and quota is only updated every 3 hours!
Scratch disk purging policy¶
In order to ensure that there is always significant space available for running jobs we automatically delete files in /scratch that have not been accessed or modified for more than 2 months by the actual deletion day on the 15th of each month. Note that we recently changed the cut out reference to the MostRecentOf(atime,ctime). This policy is subject to revision depending on its effectiveness. More details about the purging process and how users can check if their files will be deleted follows. If you have files scheduled for deletion you should move them to more permanent locations such as your departmental server or your /project space or into HPSS (for PIs who have either been allocated storage space by the RAC on project or HPSS).
On the first of each month, a list of files scheduled for purging is produced, and an email notification is sent to each user on that list. You also get a notification on the shell every time your login to Trillium. Furthermore, at/or about the 12th of each month a 2nd scan produces a more current assessment and another email notification is sent. This way users can double check that they have indeed taken care of all the files they needed to relocate before the purging deadline. Those files will be automatically deleted on the 15th of the same month unless they have been accessed or relocated in the interim. If you have files scheduled for deletion then they will be listed in a file in /scratch/t/todelete/current, which has your userid and groupid in the filename. For example, if user xxyz wants to check if they have files scheduled for deletion they can issue the following command on a system which mounts /scratch (e.g. a scinet login node): ls -1 /scratch/t/todelete/current |grep xxyz. In the example below, the name of this file indicates that user xxyz is part of group abc, has 9,560 files scheduled for deletion and they take up 1.0TB of space:
[xxyz@nia-login03 ~]$ ls -1 /scratch/t/todelete/current |grep xxyz
-rw-r----- 1 xxyz root 1733059 Jan 17 11:46 3110001___xxyz_______abc_________1.00T_____9560files
The file itself contains a list of all files scheduled for deletion (in the last column) and can be viewed with standard commands like more/less/cat - e.g. more /scratch/t/todelete/current/3110001___xxyz_______abc_________1.00T_____9560files
Similarly, you can also verify all other users on your group by using the ls command with grep on your group. For example: ls -1 /scratch/t/todelete/current |grep abc. That will list all other users in the same group that xxyz is part of, and have files to be purged on the 15th. Members of the same group have access to each other's contents.
Note
Preparing these assessments takes several hours. If you change the access/modification time of a file in the interim, that will not be detected until the next cycle. A way for you to get immediate feedback is to use the ls -lu command on the file to verify the ctime and ls -lc for the mtime. If the file atime/ctime has been updated in the meantime, coming the purging date on the 15th it will no longer be deleted.
Moving data¶
Data for analysis and final results need to be moved to and from Trillium. There are several ways to accomplish this.
Using rsync/scp¶
Move amounts less than 10GB through the login nodes.
- Trillium login nodes and datamovers are visible from outside SciNet.
- Use
scporrsyncto connect to any oftri-dm{2,3,4}.scinet.utoronto.ca - This will time out for amounts larger than about 10GB.
Move amounts larger than 10GB through the datamover nodes.
- From a Trillium login node,
sshtonia-datamover1ornia-datamover2. From there you can transfer to or from Trillium. - Alternatively, you may also login/scp/rsync directly to the datamovers from the outside:
nia-datamover1.scinet.utoronto.cania-datamover2.scinet.utoronto.ca - If you do this often, consider using Globus, a web-based tool for data transfer.
Using Globus¶
Please check the comprehensive documentation here and here.
The Trillium "endpoint" for globus is " alliancecan#trillium "
The HPSS "endpoint" for globus is " alliancecan#hpss "
Moving data to HPSS/Archive/Nearline¶
HPSS is for long-term storage of data.
- HPSS is a tape-based storage solution, and is SciNet's nearline a.k.a. archive facility
- Storage space on HPSS is allocated through the annual Resource Allocation Competition.
File ownership management and access control lists¶
- By default, at SciNet, users within the same group already have read permission to each other's files (not write)
- You may use access control list (ACL) to allow your supervisor (or another user within your group) to manage files for you (i.e., create, move, rename, delete), while still retaining your access and permission as the original owner of the files/directories. You may also let users in other groups or whole other groups access (read, execute) your files using this same mechanism.
Using mmputacl/mmgetacl¶
- You may use GPFS' native
mmputaclandmmgetaclcommands. The advantages are that you can set "control" permission and that POSIX or NFS v4 style ACL are supported. You will need first to create a/tmp/supervisor.aclfile with the following contents:
user::rwxc
group::----
other::----
mask::rwxc
user:[owner]:rwxc
user:[supervisor]:rwxc
group:[othegroup]:r-xc
Then issue the following two commands:
1) $ mmputacl -i /tmp/supervisor.acl /project/g/group/[owner]
2) $ mmputacl -d -i /tmp/supervisor.acl /project/g/group/[owner]
(every *new* file/directory inside [owner] will inherit [supervisor] ownership by default as well as
[owner] ownership, ie, ownership of both by default, for files/directories created by [supervisor])
$ mmgetacl /project/g/group/[owner]
(to determine the current ACL attributes)
$ mmdelacl -d /project/g/group/[owner]
(to remove any previously set ACL)
$ mmeditacl /project/g/group/[owner]
(to create or change a GPFS access control list)
(for this command to work set the EDITOR environment variable: export EDITOR=/usr/bin/vi)
NOTES:
* There is no option to recursively add or remove ACL attributes using a GPFS built-in command to existing files. You'll need to use the -i option as above for each file or directory individually. Here is a sample bash script you may use for that purpose.
-
mmputaclwill not overwrite the original Linux group permissions for a directory when copied to another directory already with ACLs, hence the "#effective:r-x" note you may see from time to time withmmgetacf. If you want to give rwx permissions to everyone in your group you should simply rely on the plain unix 'chmod g+rwx' command. You may do that before or after copying the original material to another folder with the ACLs. -
In the case of PROJECT, your group's supervisor will need to set proper ACL to the
/project/G/GROUPlevel in order to let users from other groups access your files. -
ACL's won't let you give away permissions to files or directories that do not belong to you.
-
We highly recommend that you never give write permission to other users on the top level of your home directory (
/home/G/GROUP/[owner]), since that would seriously compromise your privacy, in addition to disable ssh key authentication, among other things. If necessary, make specific sub-directories under your home directory so that other users can manipulate/access files from those.
For more information on using mmputacl or mmgetacl see their man pages.
Recursive ACL script¶
You may use/adapt this sample bash script to recursively add or remove ACL attributes using GPFS built-in commands
Courtesy of Agata Disks (http://csngwinfo.in2p3.fr/mediawiki/index.php/GPFS_ACL)