Introduction to HyPER-C3

Cluster Structure

The Emory Hybrid High-Performance Computing Platform for Education and Research Community Cloud Custer (HyPER C3) is a distributed computing system hosted on AWS at Emory. The cluster is designed to provide researchers with the computational resources they need to conduct research. HyperC3 cluster is HIPAA compliant. This means that the cluster is designed to protect the privacy and security of patient health information.

HyPer-C3 Structure

Account

PI can request new user account through the Application Form.
Once granted access to a cluster, a user must log in within 60 days, or their account will be automatically inactivated.
There is a 30-day grace period before automatic deactivation. Account data for inactive users is automatically deleted 90 days from their last login. This represents the 60-day account expiration plus the 30-day grace period. The account Sponsor will be notified before this happens.

Login

First login to Emory VPN even if you are inside of Emory network.
Use CA-Signed Host Key (only need to set up for once)
1. Download the CA Public Key: Download the ssh_ca.pub file, and save it to your home directory.
2. Add the CA Key to Your Known Hosts. Please follow the instructions based on your operating system:

🔹 For Windows Users: Open Command Prompt and run the following commands (adjust the path as needed):

type C:\Users\YourUsername\ssh_ca.pub >> %USERPROFILE%\.ssh\known_hosts
echo @cert-authority cirrostratus.it.emory.edu >> %USERPROFILE%\.ssh\known_hosts
echo @cert-authority ondemand.it.emory.edu >> %USERPROFILE%\.ssh\known_hosts

🔹 For Mac/Linux Users: Open a terminal and run the following commands (adjust the path as needed):

echo "@cert-authority cirrostratus.it.emory.edu $(cat /path/to/ssh_ca.pub)" >> ~/.ssh/known_hosts
echo "@cert-authority ondemand.it.emory.edu $(cat /path/to/ssh_ca.pub)" >> ~/.ssh/known_hosts

Login using ssh by typing the following command in respective terminal ssh <your_Emory_NetID>@cirrostratus.it.emory.edu or ssh <your_Emory_NetID>@ondemand.it.emory.edu.
1. On Mac/Linux system, open a terminal and type ssh
2. Under Windows 10, open up a Powershell (search for “Powershell” in the search tool next to the Start menu) and type ssh. (Older versions of the operating system may require the installation of a third party ssh client, like Putty.)
SSH login without password

Node

Includes both CPU and GPU nodes. Currently has 11 partitions (OnDemand + Reserved). All compute partitions have a wall time setting of 7 days. Default job length is 3 days.
LoginNode: The login node is the node where cluster users will be logged into and submit jobs from.
GPU requirements: Users must specify the number of GPUs, CPU cores, and memory required for their jobs. This information is used by Slurm to schedule jobs and ensure that they do not exceed the available resources. Default partition: If a user does not specify a partition, their job will be submitted to the default partition “c64-m512". This partition is a general-purpose partition that is suitable for most jobs.
GPU partition: The GPU partition is only available for jobs that require GPUs. Jobs submitted to this partition will be scheduled on nodes that have GPUs available. Jobs that require GPUs cannot be submitted to non-GPU partitions.
GPU hours: The maximum number of GPU hours per user is 500 for A100, L40s partitions. This limit is in place to ensure that all users have fair access to the GPU resources.
Head node: The head node is the main node of the cluster. It is not intended for running jobs and can only be accessed by cluster admins.
Job managed by SLURM

Storages

/users : Default home directory. This is a personal directory used to store user-specific files such as configuration files, login scripts, and personal data. Limited to 50GB, and is purged when the user account becomes inactive and deactivated. Backed up nightly..
/scratch : Uses FSx for Lustre storage. Users should use for input/output data storage. Permanent directory, limited to 1TB per user. Contents of this directory are purged after 60 days, but it is strongly recommended to remove unused data after 30 days. Note that contents in /scratch directories are not backed up and cannot be restored administratively so should only be used to hold data that can be easily re-created or backed up someplace else.
/group : Shared storage accessible to all group members. It is used to store files used by multiple users, such as data sets, software libraries, and documentation. This directory is permanent and has 1TB storage provided for free per group. Not backed up. Excess storage will be billed to the sponsor of the group. Users have read and write access to their own files unless otherwise specified. Users also have read-only access to their groups’ files unless given specific permission.
Recommend running Python codes under the /users and not the /group directory for security purposes.
Email to request mounting AWS S3 bucket to the login node.
Check remaining storage:
- /users directory: du -sh
- /scratch directory: lfs quota -u netID /scratch
- /group directory: du -sh /group/[group directory]

Cost

The use of the community compute partitions, the home storage and scratch storage are at no cost to the users. Group share over 1TB per Sponsor User (i.e. PI) will be charged at $0.3 per GB-mo.

Need help: hpc.help@emory.edu

Mount Remote Directory to Local Computer

Use WinSCP for Windows system
Use sshfs for MAC
1. Install SSHFS 2.5.0
2. Install macFUSE 4.4.2
3. Mount HyPER-C3 server directory to your local mount directory Example command to mount remote directory /users/jyang51/ to local mount directory /Users/jyang51/HYPERC3/:

mkdir -p /Users/jyang51/HYPERC3/ # create directory on local computer
sshfs jyang51@cirrostratus.it.emory.edu:/users/jyang51/ /Users/jyang51/HYPERC3/ -o auto_cache -ovolname=HYPERC3 -o follow_symlinks

Copying Data to/from the Cluster

Command rsync is recommended
- Transfer data from local machine to cluster, or from cluster to local machine
- Transfer data within cluster or across clusters
- See Example usage
- rsync [options] [SOURCE] [DESTINATION]
General command by using scp as follows:
- See Example usage
- scp <path_to_file> <username>@<remote_system_name>:<destination_path>
SFTP
- Secure File Transfer Protocol (SFTP) allows you to navigate and copy files interactively.
- Initiate an SFTP session: sftp [netid]@cirrostratus.it.emory.edu.
- Navigate to the share directory: cd /scratch/[netid]
- Local to Cirrostratus: put [filename]
- Cirrostratus to local: get [filename]
Globus Data Transfer. To gain access to Globus / HySci, please email the Network Team at hysci.help@emory.edu.
AWS CLI2
- aws s3 cp [filename] s3://mybucket/[filename]
- aws s3 cp s3://mybucket/[filename] [filename]

Handle Data Files on Linux Cluster

Command `man [command]` would give help page for the following commands, e.g., `man rsync`. Type `q` to exit from the help page.

Command rsync is recommended for copying data on cluster
Command cp also copys data within cluster
Delete data or directory by rm
Make directories by mkdir
Move a directory by mv
List all files under a directory by ls
List all files with their sizes by ls -l -h
Use vi or nano to edit text files on the cluster
Read text file by less, cat
Consider gzip your text file to save space by gzip
Open gzipped text file by zcat [file_name.gz] | less -S
Command man to see user manual/instructions of Linux tools, e.g., man ls
Use pipe | to take output from the command before | as the input for the command after |
Command awk is highlly recomended for handling large text files.
Split your screen by tmux. See tmux cheat sheet.

Software Management

Available software

Anaconda (Conda), Spack, Singularity, R, Python, TensorFlow, PyTorch, Go, Apache MXNet, Nvidia CUDA 11
Check if a software is available on the cluster which [software_executible_name]

Conda

Create a conda environment named myenv: conda create -n myenv
List conda environments : conda env list
Remove conda environment : conda env remove --name myenv
Activate conda environment: conda activate myenv
Deactivatae: conda deactivate
Delete a virtual environment: conda env remove -n myenv

Using a software installed on the cluster

Use software command to run commands
- Provide full directory to the data files you want to read or write or plot.
- For example, type command R to start using R. You may copy and paste your R commands from a text script to the session window.
- Save figures as pdf into the disk to view your plots by R function ggsave("temp.pdf") or
- ```
pdf("temp.pdf");
plot(); 
dev.off();
```
- Type command python in the session to start a python session.
- One can open the script on cluster by local Editor (e.g., Sublime, Visual Studio Code) after mounting cloud directory to local machine, and just copy/paste command into these R or python sessions

Singularity: A container platform for HPC.

Install a software without root access

Install software under ~/.local/.
Install python libraries by conda under virtual environment.

Setup one's `.bashrc` file

See useful things that you can setup your own bashrc file under your home directory to save tying of commands.
- Set up PATH system variables. See example.
- Create alias by alias in .bashrc to set up your own shortcup of commands. See Examples.
- Setup environment variables. See example.
- Load software modules

General Job Workflow

Small analysis that does not require large data input/output

Use /users as data input/output directories

Analysis that requires larger data

Use /scratch as temperary space for input/output data files.

Submit Jobs by SLURM

Basic SLURM commands

sbatch to submit jobs
- Set arguments to sbatch in a wrapper shell (job submission) script, for example, you may use sbatch normal.sh with the following normal.sh script:

#!/bin/bash
#SBATCH --job-name=normal.R
#SBATCH -–account=general # Use account=’a100v100’ for a100 gpus
#SBATCH --nodes=1 # Number of nodes requested
#SBATCH --ntasks=1 # Number of tasks
SBATCH --partition=a10g-1-gm24-c32-m128 # Name of the partition
#SBATCH --time=4:00:00 # set 4hrs of running time. default time is 7-00:00:00

## This puts all output files in a separate directory.
#SBATCH --output=Out/normal.%A_%a.out
#SBATCH —error=Err/normal.%A_%a.err
--gpus=1 # Enter no.of gpus needed
#SBATCH --mem=8G # Memory Needed
#SBATCH --mail-type=begin # send mail when job begins
#SBATCH --mail-type=end # send mail when job ends
#SBATCH --mail-type=fail # send mail if job fails
#SBATCH --mail-user=[netid]@emory.edu # Replace with your mailid

## Submitting 10 instances of the bash commands listed below
#SBATCH —array=0-10

## For notification purposes. Use your Emory email address only!
#SBATCH —mail-user=<your_email_address>@emory.edu
#SBATCH --mail-type=END,FAIL

## The following are your commands
conda init bash > /dev/null 2>&1 #initializes Conda for Bash shell
source ~/.bashrc 
conda activate myenv
python example.py
Rscript /home/<user>/normal.R

squeue to display information about the run queue.
- -a : Displays jobs in all partitions.
- --help: Displays help message.
- -l : Reports more of the available information for the selected jobs.
- -u netid : Displays only jobs by the specified user.
scancel [jobid] used to cancel or kill a job.
scontrol used to show information about running or pending jobs
- scontrol show job [jobid] to show system details of a submitted job
- scontrol show node [node_name] to show node information
- scontrol show partition [partition_name] to show partition configuration
srun runs a parallel job on cluster managed by SLURM.
- Can be used to login to the compute nodes using your jobid: srun --jobid [jobid] --pty --preserve-env bash.
- Use nividia-smi to track gpu utilization after login to the ocmpute node.
- * Simply exit the interactive session by exit.
sinfo used to report the state of the cluster partition and nodes.
- Naming convention of the non-GPU partitions: c<no. of vCPU cores>-m<amount of memory>
- Naming convention of the GPU partition: <name of the GPU>-<no. of GPUs>-gm<amount of GPU memory>-c<no. of vCPU cores>-m<amount of memory>

Best Practices and Limitations for Environment Setup

Avoid installing Jupyter Notebook: Installing Jupyter Notebook has been restricted in the current upgrade.
Using VSCode only for Editing Purpose: VSCode can be installed on the login node for accessing the files and also users can use the VScode terminal to submit jobs.
Avoid running jobs on Login Node.

Additional Resources

Emory VPN
The Linux Command Line book is recommended to read ahead
SLURM Quickstart
Cirrostratus User Guide
Open OnDemand Overview