Introduction to New Human Genetics Compute Cluster (HGCC)

HGCC Cluster Structure

hgcc_architecture

Login to HGCC

  1. Login to Emory VPN .

  2. Login using ssh by typing the following command in the bash shell terminal with your Emory NetID and Emory passworbash shell. Replace <Emory_NetID> by your Emory NetID.

  3. Setup SSH login without typing password : repeat this setup for each of your local computer. Highly recomend to set this up first. It will save you tons of effort.

    1. Generate a pair of authentication key under the home directory on your local computer: ssh-keygen -t rsa.

      • Your identification has been saved in ~/.ssh/id_rsa.
      • Your public key has been saved in ~/.ssh/id_rsa.pub.
    2. Run the following command on remote HGCC create a directory ~/.ssh under your HGCC home directory

      • makdir -p ~/.ssh
    3. Now run the following command on your local computer to append your local authentication key to HGCC ~/.ssh/authorized_keys :

      • ssh-copy-id <Emory_NetID>@hgcc.emory.edu
      • If ssh-copy-id dose not work, just run the following command on your local computer: cat ~/.ssh/id_rsa.pub | ssh <Emory_NetID>@hgcc.emory.edu 'cat >> ~/.ssh/authorized_keys.
      • Input your Emory password if you are prompted for.

Setup one's ~/.bashrc file

Mount Remote Cluster Home Directory to Local Computer

Visual Studio Code for both Windows and MAC systems. Highly recommend.

  1. Install Visual Studio Code with Remote-SSH extension.

  2. Setup configuration file under local home directory. Type pwd in your local shell window to show your local home directory.

  3. Press shift + command + p on mac, or type > in the search window in the middle top of your VSCode window, to open the choice selection.

  4. See more information at Extension Documentation.

Other useful extensions on VSCode: Python, Jupyter, Python Debugger, C/C++, PowerShell, Markdown All in One, vscode-pdf, Rtools, GitHub Copilot Chat

Other choices of mounting remote clusters

Now you can access all files on HGCC through the mounted local directory.

Transfer Data Files to/from Cluster

  1. Command rsync is recommended

    rsync [options] [SOURCE] [DESTINATION]

  2. Command scp can also do the job

Useful Basic Linux Commands for Handling Data Files on Cluster

  1. Command rsync is recommended for moving data files within the cluster, between local and cluster, between clusters.
  2. Command cp also copys data.
  3. Delete data or directory by rm.
  4. Make directories by mkdir.
  5. Move a directory by mv.
  6. List all files under a directory by ls.
  7. List all files with their sizes by ls -l -h.
  8. Use vi or nano to edit text files on the cluster. Recomend edit text files through local mounted directory.
  9. Read text file on the cluster by less, cat. less -S is recommended for viewing large text files on cluster.
  10. Consider gzip your text file to save space by gzip.
  11. Unzip gzipped text file by gunzip [file_name.gz].
  12. Use tar for zipping and unzipping directories.
  13. Command man to see user manual/instructions of Linux tools, e.g., man ls .
  14. Use pipe | to take output from the command before | as the input for the command after |. For example, cat xx_file.txt | head will print the top 10 rows of the xx_file.txt file on the bash window.
  15. Create alias by alias to set up your own shortcup of commands in the ~/.bashrc file. See Examples. Example alias commands such as alias c='clear', can be seen through /home/jyang51/.bashrc. This would set up c as an alias/short-cut for the command clear.

Command man [command] would give help page for Linux commands, e.g., man rsync. Type q to exit from the help page.

Strategies for Using HGCC

  1. Only use headnode to login and submit jobs. Never run large computation jobs on headnode.
  2. Login to the data transfer node by ssh dtn01 with your emory NetID and emory passowrd for large data transfers.
  3. Login to an interactive compute node by command srun -N 1 -n 1 --pty --preserve-env bash.
  4. Wrap your jobs in shell scripts and submit jobs to compute node by sbatch under headnode.
  5. Use scratch space (84TB) /scratch/ to avoid extensive I/O between compute node memory and storage disks.
    1. Create temperary working directory under the scratch directory, mkdir -p /scratch/tmp_jyang51/, with a unique directory name tmp_jyang51.
    2. Copy data into temperary working directory in the scratch space, rsync [path_to_data_file]/[data_file_name] /scratch/tmp_jyang51/.
    3. Write output files into the temperary working directory, /scratch/tmp_jyang51/.
    4. Copy output files back to your storage, rsync /scratch/tmp_jyang51/ [path_to_output_files]/.
    5. Delete temperary working directory, rm -rf /scratch/tmp_jyang51.
    6. All files in the scratch space will be deleted after 30 days.
  6. Keep your working directoreis organized and cleaned.
    1. Do not make another copy of the same data unless you need to make changes.
    2. Delete files that you will no longer to use.
  7. Write README.md file for each data directory.
  8. Only request up to 7 nodes if you are submiting a large number of jobs, by specifying the requested node list sbatch --nodelist node[01-07]

Using a Software on HGCC

Check if a software is available on the cluster

  1. Use command spack find -x to list all installed software modules on HGCC.
  2. Use command spack find --loaded to list all loaded software modules.
  3. Check if an executible file of an installed or loaded software is available under ${PATH} by which [software_name].
  4. Add a symbolic link of the executible file under ${PATH} to avoid typing full directory of the executible file.
export PATH="~/.local/bin:$PATH" # add local directory to PATH. add this to ~/.bashrc file to avoid running this for each session.
echo $PATH; # show current PATH directories
cd ~/.local/bin; #
ln -s [directory_to_executible_file]; # create symbolic link to the software

Using a software installed on the cluster by spack

  1. Load a software module into the current session by spack load [software_name].
  2. Type command R in the current session to start an R session after loading R by spack load r@4.4.0.
  3. Type command python in the session to start a python session after loading Anaconda3 by spack load anaconda3@2023.09-0.
  4. One can open text scripts on HGCC by local text/script Editor (e.g., Sublime, Visual Studio Code) after mounting cloud directory to local machine, and just copy/paste command into these bash, R, or python sessions.

Install a software without root access

  1. Install software under user's home directory, e.g., ~/.local/, or work directory on lab data drive.
  2. Create a virtual python environment by conda to install python libraries under your virtual python environment.
  3. The total number of files is limited to 1,000,000 per group. Python libraries, environment files, and R library files can easily exceed the limit for a group. Once the limit is reached, the logins of all members might fail and experience lags.
  4. Submit a ticket through the HGCC website to request a software to be installed by IT.

It is recommended to create python environment on your lab data drive and install software to your lab data drive.

Share installed software with the lab

cd /nfs/yangfss2/data/shared/bin
ln -s [tool_directory]

Monitoring storage usage

  1. Check storage space for all data drives on HGCC by df -h
  2. Check space used by current working directory by du -h -d1
  3. Check number of files of a directory by find [directory_path] --type f | wc -l.

Additional Resources