Introduction to New Human Genetics Compute Cluster (HGCC)

HGCC Cluster Structure

hgcc_architecture

Login to HGCC

  1. Login to Emory VPN .

  2. Login using ssh by typing the following command in the terminal with your Emory Net ID and Emory password: ssh <Emory_Net_ID>@hgcc.emory.edu

  3. SSH login without typing password : repeat this setup for each of your local computer.

Mount Remote Cluster Home Directory to Local Computer

Use WinSCP for Windows system

Use sshfs for MAC

  1. Install the latest SSHFS.
  2. Install the lastest macFUSE.
  3. Mount the remote HGCC home directory /home/<jyang51>/ to local directory ~/HGCC/. Replace <jyang51> by your EmoryNetID. sshfs <jyang51>@hgcc.emory.edu:/home/<jyang51>/ ~/HGCC/ -o auto_cache -ovolname=HGCC -o follow_symlinks
  4. Now you can access all files on HGCC through the mounted local directory.

Transfer Data Files to/from Cluster

  1. Command rsync is recommended

    rsync [options] [SOURCE] [DESTINATION]

  2. Command scp can also do the job

Useful Basic Linux Commands for Handling Data Files on Cluster

  1. Command rsync is recommended for moving data files within the cluster, between local and cluster, between clusters.
  2. Command cp also copys data.
  3. Delete data or directory by rm.
  4. Make directories by mkdir.
  5. Move a directory by mv.
  6. List all files under a directory by ls.
  7. List all files with their sizes by ls -l -h.
  8. Use vi or nano to edit text files on the cluster. Recomend edit text files through local mounted directory.
  9. Read text file on the cluster by less, cat. less -S is recommended for viewing large text files on cluster.
  10. Consider gzip your text file to save space by gzip.
  11. Unzip gzipped text file by gunzip [file_name.gz].
  12. Use tar for zipping and unzipping directories.
  13. Command man to see user manual/instructions of Linux tools, e.g., man ls .
  14. Use pipe | to take output from the command before | as the input for the command after |. For example, cat xx_file.txt | head will print the top 10 rows of the xx_file.txt file on the bash window.
  15. Create alias by alias to set up your own shortcup of commands in the ~/.bashrc file. See Examples. Example alias commands such as alias c='clear', can be seen through /home/jyang51/.bashrc. This would set up c as an alias/short-cut for the command clear.

Command man [command] would give help page for Linux commands, e.g., man rsync. Type q to exit from the help page.

Strategies for Using HGCC

  1. Only use headnode to download/move data and submit jobs. Never run large computation jobs on headnode.
  2. Login to an interactive compute node by command srun -N 1 -n 1 --pty bash.
  3. Wrap your jobs in shell scripts and submit jobs to compute node by sbatch under headnode.
  4. Use scratch space (84TB) /scratch/ to avoid extensive I/O between compute node memory and storage disks.
    1. Create temperary working directory under the scratch directory, mkdir -p /scratch/tmp_jyang51/, with a unique directory name tmp_jyang51.
    2. Copy data into temperary working directory in the scratch space, rsync [path_to_data_file]/[data_file_name] /scratch/tmp_jyang51/.
    3. Write output files into the temperary working directory, /scratch/tmp_jyang51/.
    4. Copy output files back to your storage, rsync /scratch/tmp_jyang51/ [path_to_output_files]/.
    5. Delete temperary working directory, rm -rf /scratch/tmp_jyang51.
    6. All files in the scratch space will be deleted after 30 days.
  5. Do not make another copy of the same data unless you need to make changes.
  6. Delete files that you will no longer to use.
  7. Keep your working directoreis organized and cleaned.
  8. Write README.md file for each data directory.

Using a Software on HGCC

Check if a software is available on the cluster

  1. Use command spack find -x to list all installed software modules on HGCC.
  2. Use command spack find --loaded to list all loaded software modules.
  3. Check if an executible file of an installed or loaded software is available under ${PATH} by which [software_name].
  4. Add a symbolic link of the executible file under ${PATH} to avoid typing full directory of the executible file.
export PATH="~/.local/bin:$PATH" # add local directory to PATH. add this to ~/.bashrc file to avoid running this for each session.
echo $PATH; # show current PATH directories
cd ~/.local/bin; #
ln -s [directory_to_executible_file]; # create symbolic link to the software

Using a software installed on the cluster by spack

  1. Load a software module into the current session by spack load [software_name].
  2. Type command R in the current session to start an R session after loading R by spack load r@4.4.0.
  3. Type command python in the session to start a python session after loading Anaconda3 by spack load anaconda3@2023.09-0.
  4. One can open text scripts on HGCC by local text/script Editor (e.g., Sublime, Visual Studio Code) after mounting cloud directory to local machine, and just copy/paste command into these bash, R, or python sessions.

Install a software without root access

  1. Install software under user home directory, e.g., ~/.local/. Include ~/.local/bin/ in the $PATH environment variable. Create a symbolic link to software executible file under ~/.local/bin/.
  2. Create a virtual python environment by conda to install python libraries under your virtual python environment.
  3. Submit a ticket through the HGCC website to request a software to be installed by IT.

Share installed software with the lab

export PATH="/nfs/yangfss2/data/shared/bin:$PATH"; # include this line of command in your `~/.bashrc` file to automatically run this for each session.
cd /nfs/yangfss2/data/shared/bin
ln -s [tool_directory]

Setup one's ~/.bashrc file

Submit Jobs by SLURM

Basic SLURM commands

#!/bin/bash
#SBATCH --job-name=normal_sim ## specify job name
#SBATCH --nodes=1 ## request 1 node 
#SBATCH --mem=8G ## request 8G memory
#SBATCH --cpus-per-task=4 ## request 4 cpus/cores per job
#SBATCH --time=24:00:00 ## specify job running time for 24 hrs
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out ## specify slurm output file directory
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err ## specify slurm error file directory
#SBATCH --array=0-10 ## Submitting 10 instances of commands listed below


## The following commands will be run for 10 times by 10 jobs under the specified array. Each with their unique task ID given by $SLURM_ARRAY_TASK_ID, in {1..10}.

## Change working directory
cd /home/jyang51/yangfss2/public/ExampleData

## Create SLURM_OUT/ under the current working directory to save slurm output and error files
mkdir -p ./SLURM_OUT/


## Print SLURM array task ID $SLURM_ARRAY_TASK_ID (1..10)
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID


## Load R software
spack load r@4.4.0


## Using the Rscript to simulate a vector x from standard normal distribution and write x to a text data file under /home/jyang51/yangfss2/public/ExampleData/
## Use the SLURM array task ID to save unique output data files, or configue the job.
Rscript /home/jyang51/yangfss2/public/ExampleScripts/norm_sim.R /home/jyang51/yangfss2/public/ExampleData/ $SLURM_ARRAY_TASK_ID
#!/usr/bin/env Rscript
Sys.setlocale('LC_ALL', 'C')

###############
options(stringsAsFactors=F)


###############
args=(commandArgs(TRUE))
print(args)
if(length(args)==0) {
  stop("Error: No arguments supplied!")
} else {
  out_prefix = args[[1]]
  n_sim = args[[2]]
}

x = rnorm(100, mean = 0, sd = 1)

print(paste("mean of simulated data =", mean(x)))

print(paste("standard deviation of simulated data =", sd(x)))

print("Write simulated data to a text file:")

write.table(data.frame(x = x), file = paste0(out_prefix, "sim_", n_sim, ".txt"), quote = FALSE, sep = "\t", row.names = FALSE)

Monitoring storage usage

  1. Check storage space for all data drives on HGCC by df -h
  2. Check space used by current working directory by du -h -d1

Additional Resources