Introduction to New Human Genetics Compute Cluster (HGCC)

HGCC Cluster Structure

New HGCC Website : Access within Emory network or after login to Emory VPN. Submit a ticket on the HGCC wewbsite if IT help is needed.
Operating System: Red Hat Enterprise Linux 8
Shared Storage (beegfs) : 1.3 PB (10TB per lab)
1 Head/Gateway/Login Node head.hgcc; 1 Data Transfer Node (DTN)
- No CPU or memory intensive programs are allowed on the headnode and the DTN.
- 2CPU and 8G RAM per user
- Default interactive job time: 1 hour
14 Compute Nodes of the same Partition:
- 64 cores per node; 16GB RAM per core
- High-bandwidth: 100Gb/s interconnect fabric. Compute node and storage volumes.
- 10Gb/s uplinks to the Emory fiber-optic backbone. Between Emory clusters.
- Scheduling system: Slurm
Shared Fast Scratch Storage: /scratch/
- 84 TB, first come first serve
- All files will be deleted after 30 days
- Recomend to use for jobs that need frequent large data file access, I-O intensive jobs.
Data Drive of YangLab: /nfs/yangfss2/data/
- This is our previous data drive 2 (350TB) in the old cluster. Our data drive 1 (45TB) is retired.
- A softlink of the datadrive has been created under your home directory: ~/yangfss2/
- commons/[user] : User work directories.
  - All data files under your user work directories on our lab data drive 1 and 2 on the old HGCC cluster are copied here.
  - All files on the old hgcc main disk icebreaker are not accessible and not coped into this new system.
- shared/** : Shared data directories and software directory shared/Software/ for the lab.
- projects/ : Project directories shared by the lab.
- public/ : Files to be shared with other groups. Public to be viewed by all users on the cluster.
All files are not backed up on the cluster.

hgcc_architecture

Login to HGCC

Login to Emory VPN .
- Anti-virus Software is needed for your local computer.
- BIG-IP Edge Client is needed for VPN connection.
- Contact Emory IT if further help is needed.
Login using ssh by typing the following command in the terminal with your Emory Net ID and Emory password: ssh <Emory_Net_ID>@hgcc.emory.edu
- Terminal App is available on Mac OS X and Linux system
- Under Windows 10, open up a Powershell (search for “Powershell” in the search tool next to the Start menu) and type ssh. (Older versions of the operating system may require the installation of a third party ssh client, like Putty.)
SSH login without typing password : repeat this setup for each of your local computer.
- First, generate a pair of authentication key under the home directory on your local computer: ssh-keygen -t rsa.
  - Your identification has been saved in ~/.ssh/id_rsa.
  - Your public key has been saved in ~/.ssh/id_rsa.pub.
- Second, run the following command under the home directory on your local computer to create a directory ~/.ssh under your home directory on the HGCC cluster and append your local authentication key to ~/.ssh/authorized_keys on HGCC:
  - ssh-copy-id <Emory_Net_ID>@hgcc.emory.edu

Setup one's `~/.bashrc` file

See useful things that you can setup your own ~/.bashrc file under your home directory to automaticlly run commands for each session.
- Set up $PATH system variables. See example.
- Create alias by alias in ~/.bashrc to set up your own shortcup of commands. See Examples.
- Setup environment variables. See example.
- Load software modules.

Mount Remote Cluster Home Directory to Local Computer

Use WinSCP for Windows system

Use macFUSE/SSHFS for MAC

Install SSHFS
Install macFUSE
Mount the remote HGCC home directory /home/<jyang51>/ to local directory ~/HGCC/. Replace <jyang51> by your EmoryNetID. sshfs <jyang51>@hgcc.emory.edu:/home/<jyang51>/ ~/HGCC/ -o auto_cache -ovolname=HGCC -o follow_symlinks
Now you can access all files on HGCC through the mounted local directory.

Transfer Data Files to/from Cluster

Command rsync is recommended
- Transfer data from local machine to cluster, from cluster to local machine
- Transfer data within cluster or across clusters
- See Example usage
rsync [options] [SOURCE] [DESTINATION]
Command scp can also do the job
- See Example usage scp <path_to_local_file> <Emory_Net_ID>@hgcc.emory.edu:<destination_remote_path_on_hgcc>

Useful Basic Linux Commands for Handling Data Files on Cluster

Command rsync is recommended for moving data files within the cluster, between local and cluster, between clusters.
Command cp also copys data.
Delete data or directory by rm.
Make directories by mkdir.
Move a directory by mv.
List all files under a directory by ls.
List all files with their sizes by ls -l -h.
Use vi or nano to edit text files on the cluster. Recomend edit text files through local mounted directory.
Read text file on the cluster by less, cat. less -S is recommended for viewing large text files on cluster.
Consider gzip your text file to save space by gzip.
Unzip gzipped text file by gunzip [file_name.gz].
Use tar for zipping and unzipping directories.
Command man to see user manual/instructions of Linux tools, e.g., man ls .
Use pipe | to take output from the command before | as the input for the command after |. For example, cat xx_file.txt | head will print the top 10 rows of the xx_file.txt file on the bash window.
Create alias by alias to set up your own shortcup of commands in the ~/.bashrc file. See Examples. Example alias commands such as alias c='clear', can be seen through /home/jyang51/.bashrc. This would set up c as an alias/short-cut for the command clear.

Command man [command] would give help page for Linux commands, e.g., man rsync. Type q to exit from the help page.

Strategies for Using HGCC

Only use headnode to login and submit jobs. Never run large computation jobs on headnode.
Login to the data transfer node by ssh dtn01 with your emory NetID and emory passowrd for large data transfers.
Login to an interactive compute node by command srun -N 1 -n 1 --pty --preserve-env bash.
- -N is the number of nodes, -n is number of tasks, --pty gives you a pseudo terminal that runs bash.
- Use Interactive bash session for testing your jobs.
- Multiple interactive sessions could be opened.
Wrap your jobs in shell scripts and submit jobs to compute node by sbatch under headnode.
- Job Array is recommended for a large number of jobs of the same function.
- Think about breaking your one big job into multiple small jobs/steps and submit array jobs. Multiple small jobs may be more efficient than a single large job.
- Default running time for a job is up to 30 days
Use scratch space (84TB) /scratch/ to avoid extensive I/O between compute node memory and storage disks.
1. Create temperary working directory under the scratch directory, mkdir -p /scratch/tmp_jyang51/, with a unique directory name tmp_jyang51.
2. Copy data into temperary working directory in the scratch space, rsync [path_to_data_file]/[data_file_name] /scratch/tmp_jyang51/.
3. Write output files into the temperary working directory, /scratch/tmp_jyang51/.
4. Copy output files back to your storage, rsync /scratch/tmp_jyang51/ [path_to_output_files]/.
5. Delete temperary working directory, rm -rf /scratch/tmp_jyang51.
6. All files in the scratch space will be deleted after 30 days.
Do not make another copy of the same data unless you need to make changes.
Delete files that you will no longer to use.
Keep your working directoreis organized and cleaned.
Write README.md file for each data directory.
Only request up to 7 nodes if you are submiting a large number of jobs, by specifying the requested node list sbatch --nodelist node[01-07]

Using a Software on HGCC

Check if a software is available on the cluster

Use command spack find -x to list all installed software modules on HGCC.
Use command spack find --loaded to list all loaded software modules.
Check if an executible file of an installed or loaded software is available under ${PATH} by which [software_name].
Add a symbolic link of the executible file under ${PATH} to avoid typing full directory of the executible file.

export PATH="~/.local/bin:$PATH" # add local directory to PATH. add this to ~/.bashrc file to avoid running this for each session.
echo $PATH; # show current PATH directories
cd ~/.local/bin; #
ln -s [directory_to_executible_file]; # create symbolic link to the software

Using a software installed on the cluster by `spack`

Load a software module into the current session by spack load [software_name].
- For example, use command spack load plink2@2.00a4.3 to load plink.
- Type command plink2 --help to see plink user mannual after loading plink.
- Unload plink module by spack unload plink2@2.00a4.3.
Type command R in the current session to start an R session after loading R by spack load r@4.4.0.
Type command python in the session to start a python session after loading Anaconda3 by spack load anaconda3@2023.09-0.
One can open text scripts on HGCC by local text/script Editor (e.g., Sublime, Visual Studio Code) after mounting cloud directory to local machine, and just copy/paste command into these bash, R, or python sessions.

Install a software without root access

Install software under user home directory, e.g., ~/.local/. Include ~/.local/bin/ in the $PATH environment variable. Create a symbolic link to software executible file under ~/.local/bin/.
Create a virtual python environment by conda to install python libraries under your virtual python environment.
Submit a ticket through the HGCC website to request a software to be installed by IT.

Share installed software with the lab

Make a soft link of the executible tool under /nfs/yangfss2/data/shared/bin by the following commands:

export PATH="/nfs/yangfss2/data/shared/bin:$PATH"; # include this line of command in your `~/.bashrc` file to automatically run this for each session.
cd /nfs/yangfss2/data/shared/bin
ln -s [tool_directory]

Submit Jobs by SLURM

Basic SLURM commands

Use command sbatch to submit jobs. See more instructions at https://slurm.schedmd.com/sbatch.html. Set arguments to sbatch in a wrapper shell (job submission) script. For example, you may use command sbatch norm_sim.sh to submit an array job to run 10 times of simulations, with norm_sim.sh given as follows:

#!/bin/bash
#SBATCH --job-name=normal_sim ## specify job name
#SBATCH --nodes=1 ## request 1 node 
#SBATCH --mem=8G ## request 8G memory
#SBATCH --cpus-per-task=4 ## request 4 cpus/cores per job
#SBATCH --time=24:00:00 ## specify job running time for 24 hrs
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out ## specify slurm output file directory
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err ## specify slurm error file directory
#SBATCH --array=0-10 ## Submitting 10 instances of commands listed below


## The following commands will be run for 10 times by 10 jobs under the specified array. Each with their unique task ID given by $SLURM_ARRAY_TASK_ID, in {1..10}.

## Change working directory
cd /home/jyang51/yangfss2/public/ExampleData

## Create SLURM_OUT/ under the current working directory to save slurm output and error files
mkdir -p ./SLURM_OUT/


## Print SLURM array task ID $SLURM_ARRAY_TASK_ID (1..10)
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID


## Load R software
spack load r@4.4.0


## Using the Rscript to simulate a vector x from standard normal distribution and write x to a text data file under /home/jyang51/yangfss2/public/ExampleData/
## Use the SLURM array task ID to save unique output data files, or configue the job.
Rscript /home/jyang51/yangfss2/public/ExampleScripts/norm_sim.R /home/jyang51/yangfss2/public/ExampleData/ $SLURM_ARRAY_TASK_ID

And example norm_sim.R script given as follows:

#!/usr/bin/env Rscript
Sys.setlocale('LC_ALL', 'C')

###############
options(stringsAsFactors=F)


###############
args=(commandArgs(TRUE))
print(args)
if(length(args)==0) {
  stop("Error: No arguments supplied!")
} else {
  out_prefix = args[[1]]
  n_sim = args[[2]]
}

x = rnorm(100, mean = 0, sd = 1)

print(paste("mean of simulated data =", mean(x)))

print(paste("standard deviation of simulated data =", sd(x)))

print("Write simulated data to a text file:")

write.table(data.frame(x = x), file = paste0(out_prefix, "sim_", n_sim, ".txt"), quote = FALSE, sep = "\t", row.names = FALSE)

Use command squeue to display current slurm job queue.
Use command scancel [jobid] used to cancel/kill a job.
Use command scontrol used to show information about running or pending jobs.
- scontrol show job [jobid] to show system details of a submitted job.
Use command srun to run an interactive instance: srun --pty bash.
- Open an interactive session to test your R code:
```
srun -N 1 -n 1 --pty bash
spack load r@4.4.0
R
```
- Simply exit the interactive session by exit.
Use command sinfo to report the state of the cluster partition and nodes.

Monitoring storage usage

Check storage space for all data drives on HGCC by df -h
Check space used by current working directory by du -h -d1

Additional Resources

New HGCC Website. Need to be in Emory network or Emory VPN to access.
Spack website.
The Linux Command Line book.

Introduction to New Human Genetics Compute Cluster (HGCC)

HGCC Cluster Structure

Login to HGCC

Setup one's ~/.bashrc file

Mount Remote Cluster Home Directory to Local Computer

Use WinSCP for Windows system

Use macFUSE/SSHFS for MAC

Transfer Data Files to/from Cluster

Useful Basic Linux Commands for Handling Data Files on Cluster

Strategies for Using HGCC

Using a Software on HGCC

Check if a software is available on the cluster

Using a software installed on the cluster by spack

Install a software without root access

Share installed software with the lab

Submit Jobs by SLURM

Basic SLURM commands

Monitoring storage usage

Additional Resources

Setup one's `~/.bashrc` file

Using a software installed on the cluster by `spack`