Introduction to Human Genetics Compute Cluster (HGCC)

Cluster Structure

Login

  1. Login to Emory VPN
  2. Login using ssh by typing the following command in respective terminal ssh <hgcc_user_ID>@hgcc.genetics.emory.edu

  3. Login without typing password

Mount Remote Cluster Home Directory to Local Computer

Use WinSCP for Windows system

Use sshfs for MAC

  1. Install SSHFS 2.5.0
  2. Install macFUSE 4.1.0
  3. Create a mount point directory on local computer mkdir ~/RHPC/
  4. Mount HGCC home directory /home/jyang/ to local mount directory ~/RHPC/. Replace jyang by your HGCC user ID. sshfs jyang@hgcc.genetics.emory.edu:/home/jyang/ ~/RHPC/ -o auto_cache -ovolname=HGCC -o follow_symlinks

Copying Data Files to the Cluster

  1. Command rsync is recommended

    rsync [options] [SOURCE] [DESTINATION]

  2. Command scp can also do the job

    scp <path_to_file> <username>@hgcc.genetics.emory.edu:<destination_path_on_hgcc>

Handle Data Files on Cluster

  1. Command rsync is recommended for copying data on cluster
  2. Command cp also copys data
  3. Delete data or directory by rm
  4. Make directories by mkdir
  5. Move a directory by mv
  6. List all files under a directory by ls
  7. List all files with their sizes by ls -l -h
  8. Use vi or nano to edit text files on the cluster
  9. Read text file by less, cat
  10. Consider gzip your text file to save space by gzip
  11. Open gzipped text file by zcat [file_name.gz] | less -S
  12. Command man to see user manual/instructions of Linux tools, e.g., man ls
  13. Use pipe | to take output from the command before | as the input for the command after |
  14. Create alias by alias to set up your own shortcup of commands. See Examples.

Strategies for Using HGCC

Command man [command] would give help page for the following commands, e.g., man rsync. Type q to exit from the help page.

  1. Only use headnode to download/move data and submit jobs (Never run large computation jobs on headnode)
  2. Login to an interactive compute node by command qlogin under headnode or the command qlogin -l h=[node07.local] with specification of interactive compute node.
  3. Test your code/jobs on interactive compute node.
  4. Two parallele interactive sessions are allowed per user.
  5. Wrap your jobs in shell scripts and submit jobs to compute node by qsub under headnode.
    1. Requesting two cores for a job by -pe smp 2: qsub –q b.q –cwd –j y –N [jobname] –pe smp 2 [job commands].
    2. Request an appropriate number of cores based on the memory (8GB per core) and parallele computing ability of your job
    3. qsub is limited up to 500 jobs
    4. Submit Array Jobs for a large number of jobs by qsub -t. Array jobs need to be written to account for the single job ID in an Array job. See Instructions.
    5. Think about breaking your one big job into multiple small jobs/steps and running those jobs/steps concurrently on multiple nodes. Multiple small jobs may be more efficient than a single large job.
    6. Default running time for a job is up to 240 hours
  6. Use scratch space (2GB) /scratch/ on each compute node to avoid extensive I/O between compute node memory and head node storage disks.
    1. Copy data into scratch space first
    2. Create temperary working directory under the scratch directory
    3. Write into the temperary working directory
    4. Copy results back to hard disk storage
    5. Delete temperary working directory
    6. Not suitable if your work require >2GB storage space
    7. Not needed if you are using our lab data storage spaces, /mnt/YangFSS/data, /mnt/YangFSS/data2
    8. Your home directory /home/[userid] is likely to be located on our lab data storage disks
  7. Check job status and node status by qstat -f
  8. Check the complete job queue by qstat -f -u '*'
  9. Delete a job qdel [jobid] or all of your jobs by qdel -u [userid]
  10. Shared genomic data on hgcc /sw/hgcc/Data. Do not make another copy of the data unless you need to make changes.

Using a Software

Check if a software is available on the cluster

  1. Logonto interactive node first by qlogin first, and then use command module avail to list all installed software modules on HGCC
  2. Check if an executible file of an installed or loaded software is available under ${PATH} by which [software_name]

Using a software installed on the cluster

  1. Load a software module into the current session by module load [software_name]
  2. List loaded software modules under the current session by module list
  3. Use software command to run jobs
    1. Type command R in the current session to start an R session after loading R by module load R/4.0.3
    2. Type command python in the session to start a python session after loading Anaconda3 by module load Anaconda3/5.2.0
    3. Check which python is called by command python with command which python
    4. One can open the script on HGCC by local Editor (e.g., Sublime, Visual Studio Code) after mounting cloud directory to local machine, and just copy/paste command into these R or python sessions
    5. Type command plink -h to see plink user mannual after loading plink by module load plink/1.90b53
  4. Command which can also be used to check if a software is available under the ${PATH} directory in current session
  5. Unload a software module by module unload [software_name]
  6. Unload all software modules by module purge

Install a software without root access

  1. Install software under user home directory, e.g., ~/.local/
  2. Install python libraries by pip --user. Or search "install without root access", "install with specificed installation directory". The installation directory should be those you have writting access to.
  3. Install software after login to interactive compute node by qlogin, as many C++ libraries only available on compute nodes

Share installed software with the lab

  1. Make a soft link of the executible tool under /mnt/YangFSS/data/bin by the following commands cd /mnt/YangFSS/data/bin ln -s [tool directory]
  2. Add /mnt/YangFSS/data/bin into the ${PATH} environment variable by including the following command line in your ~/.bashrc file

    export PATH=/mnt/YangFSS/data/bin:$PATH

Setup one's .bashrc file

Storage on the Cluster

Main HGCC data drive /mnt/icebreaker/data2/

Yang lab data drives

  1. Data Drive 1 /mnt/YangFSS/data/. All lab member home directories are located in this data drive. Consider using our data drive 2 if you need more storage space.
  2. Data Drive 2 /mnt/YangFSS/data2/.

    1. Large data sets are located in this data drive.
    2. Create a working directory on this data drive for your working directory and organize all of your work under this directory.
    3. For example,
    mkdir -p /mnt/YangFSS/data2/jyang
    cd /mnt/YangFSS/data2/jyang
    ls

Monitoring storage usage

  1. Check storage space for all data drives on HGCC by df -h
  2. Check space used by current working directory by du -h -d1

All files are not backed up on the cluster

Additional Resources