Run Computation Tools and Submit Jobs on HGCC
Run tools on interactive node
Good practice to test your script/code
- Use command
qlogin
to login to one of the interactive computation nodes after you login to the gateway node00 - Check available modules/tools by command
module avail
- Load the module/tool you want to use, e.g.,
module load R/4.0.4
- Check loaded modules by
module list
- Unload a software module by example command:
module unload R/4.0.4
- Unload all software modules by
module purge
- Module User Guide
- Load the module/tool you want to use, e.g.,
- Write scripts using text editors like
Visual Studio Code
,Sublime
on your local computer under mounted cluster directories
R
- Type
R
in the interactive session to initiate an R session - Copy R commands from your R script and Paste into the R session
- Simple parallel computation can be enabled by using R packages
parallel
andforeach
, see Quick Instructions of Parallel Computation - Tips
- Provide full directory to the data files you want to read or write or plot
- Save figures as pdf into the disk to view your plots by R function
ggsave("temp.pdf")
or
pdf("temp.pdf");
plot();
dev.off()
Python/Anaconda
- Load Anaconda module, e.g.,
module load Anaconda3/5.2.0
- Interactive python session
- Type
python
in the interactive session to initiate a python session - Copy python commands from your python script and Paste into the python session
- Type
- Use Virtual Environment for access right to install python libraries
- Example command to create a new virtual environment with name myenv:
conda create --name myenv python=3.5 pandas numpy scipy scikit-learn statsmodels
- List available virtual environments,
conda env list
- Activate the environment with name myenv,
conda activate myenv
- Install new libraries in an existing environment:
conda install -n myenv package
- Remember to deactivate your environment when you are done,
conda deactivate
- Delete virtual environment when you no longer need to use it,
conda remove -n myenv -all
- Example command to create a new virtual environment with name myenv:
- Install python packages locally:
pip install --user package
- Install python virtual environment by
pip
andvirtualenv
- Install
pip
. Already installed with Anaconda - Install
virtualenv
:python3 -m pip install --user virtualenv
- Creating a virtual environment
myenv
under current directory:python3 -m venv myenv
- Activate a virtual environment
myenv
:source myenv/bin/activate
- Confirm you are in the virtual environment by
which python
. It should be in themyenv
directory. - Deactivate the virtual environment by
deactivate
- Now that you’re in your virtual environment you have the access right to install packages.
python3 -m pip install package
- Install
- Check your python path which is the directory python use to look for libraries:
echo $PYTHONPATH
- Add additional directory to your python path. The following example command could be added into your
~/.bashrc.user
file:
EXPORT PATHONPATH=PYTHONPATH=$PYTHONPATH:/home/jyang/.local/lib/python3.5/site-packages
PLINK
PLINK is a powerful and useful genetic data analysis tool to convert data format, QC, calculate MAF and HWE p-value, calculate kinship matrix, calculate top Principal Components, conduct single variant genetic association studies. Common data format is BED/BIM/FAM
.
- Load plink module, e.g.,
module load plink/1.90b53
- Quick command to look at plink input arguments:
plink -h
- See PLINK 1.9 Manual
BCFTools
BCFTools is a fast tool to manipulate sorted/bgzipped/tabixed
VCF files of genotype data
- Load Tabix module,
module load tabix/0.2.6
- Load bcftools module, e.g.,
module load bcftools/1.9
- Quick look at bcftools input arguments,
bcftools -h
- See BCFTools Manual
BEDTools
BEDTools utilities are a Swiss-army knife of tools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.
- Load bedtools module, e.g.,
module load bedtools/2.27.0
- Quick look at bedtools input arguments,
bedtools -h
- Bedtools can be used to intersect one SNP file and one annotation file to annotate SNPs with respect to genetic features.
- See BEDTools Manual
Submit Jobs by qsub
Create a bash script and submit a single job
Create an example bash script by touch phase.sh
and write the following commands into this file (see example shell script in /home/jyang/Scripts/Shell_Scripts/phase.sh
on HGCC):
#!/bin/bash
# setup bash variables to take input argument contents
vcf_file=$1
out_prefix=$2
n_thread=$3
# print out variable contents and log message
echo vcf file direction $vcf_file
echo output file prefix $out_prefix
echo number of threads to run $n_thread
echo Phase a VCF file by eagle v2.3
# command to run the phase tool eagle
/home/jyang/local/bin/eagle --geneticMapFile=/home/jyang/lib/Eagle_v2.3.5/tables/genetic_map_hg19_withX.txt.gz \
--vcfOutFormat=z --outPrefix=${out_prefix}.Phased --numThreads ${n_thread} \
--vcf=${vcf_file} 2>&1 > ${out_prefix}.eagle.out.txt
exit
- This example bash script will take three input arguments and pass the input contents to variables
vcf_file
,out_prefix
, andn_thread
- The cluster will only run all commands within the submitted shell script. Make sure to write your results into the hard drive.
- Use command
echo
to print out log messages or variable contents to debug and check the job status. - Make this shell script executable:
chmod 755 phase.sh
- Submit this shell script as a job requesting 4 cores with three input arguments:
qsub -q b.q -cwd -j y -pe smp 4 /home/jyang/Scripts/Shell_Scripts/example_single_job_phase.sh [vcf_file_directory] [output_file_prefix] 4
Remarks
If one want to use the /scratch
space local to each computation node to improve computation efficiency by avoiding heavy I/O, the bash script need to be updated to include commands creating a temporary directory ${TMPDIR}
under the /scratch
space (2GB per node).
#!/bin/bash
TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`
# For single job
TMPDIR="/scratch/${JOB_ID}_${TMP_NAME}"
if [ -e $TEMPDIR ]; then
echo "Error. Scratch dir already exists on `hostname`"
exit
else
mkdir $TEMPDIR
fi
# setup bash variables to take input argument contents
vcf_file=$1
out_prefix=$2
n_thread=$3
# copy input data files into the temporary directory
cp /home/jyang/lib/Eagle_v2.3.5/tables/genetic_map_hg19_withX.txt.gz ${TEMPDIR}/genetic_map_hg19_withX.txt.gz
cp ${vcf_file} ${TEMPDIR}/temp.vcf.gz
# Run the phase tool eagle with data files under the temporary directory
# Write to the temporary directory
/home/jyang/local/bin/eagle --geneticMapFile=${TEMPDIR}/genetic_map_hg19_withX.txt.gz \
--vcfOutFormat=z --outPrefix=${TEMPDIR}/Phased --numThreads ${n_thread} \
--vcf=${TEMPDIR}/temp.vcf.gz 2>&1 > ${TEMPDIR}/eagle.out.txt
# Copy results back to hard disk
cp ${TEMPDIR}/Phased.vcf.gz ${out_prefix}.Phased.vcf.gz
cp ${TEMPDIR}/eagle.out.txt ${out_prefix}.eagle.out.txt
# Remove temporary directory
rm -f -r ${TEMPDIR}
exit
- After copying data files from hard disk to this temporary directory by
cp
orrsync
- Run the analysis commands with input data files under the temporary directory
- Write results to this temporary directory
- Copy results back to the hard disk
- Erase the created temporary directory
Submit Array Jobs
Array jobs is a convenient way to submit multiple repetitive jobs that only differs by one input variables, e.g., 10000 repetitive simulations, association study for all 20K genome-wide genes.
- The following example commands to submit an Array job will submit repetitive jobs to run the same bash scripts
example_array_job_segment_VCF.sh
for1703
times, with the only difference of job id${SGE_TASK_ID}
. This command will only push100
jobs into the job queue at the same time.
# WGS VCF file directory
vcf_dir=/mnt/YangFSS/data2/AMP-AD/ROSMAP/WGS_JointCall
# bed file directory
bed_file=/mnt/YangFSS/data2/AMP-AD/ROSMAP/WGS_JointCall/LDdetect_SegmentedVCF/lddetect_1KG_EUR.bed
# output directory
out_dir=/mnt/YangFSS/data2/AMP-AD/ROSMAP/WGS_JointCall/LDdetect_SegmentedVCF
# Submit array job
qsub -q b.q -j y -cwd -N segVCF -t 1-1703 -tc 100 \
/home/jyang/Scripts/Shell_Scripts/example_array_job_segment_VCF.sh ${vcf_dir} ${bed_file} ${out_dir}
- The shell script has the following contents:
#!/usr/bin/bash
vcf_dir=$1
seg_bed=$2
out_dir=$3
module load tabix/0.2.6
## Obtain the corresponding input filename from ${seg_bed} input file with respect to the Array Job ID: ${SGE_TASK_ID}
line=`head -n ${SGE_TASK_ID} $seg_bed | tail -n1`
echo Segment VCF for $line
######## Create temp directory
TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`
TEMPDIR="/scratch/${JOB_ID}_${SGE_TASK_ID}_${TMP_NAME}"
if [ -e $TEMPDIR ]; then
echo "Error. Scratch dir already exists on `hostname`"
exit
else
mkdir $TEMPDIR
fi
######### Segment VCFs
## Write results to temp directory
## You may also copy your input data files to temp directory if needed
chr=$(echo ${line} | awk '{print $1}')
pos_start=$(echo ${line} | awk '{print $2}')
pos_end=$(echo ${line} | awk '{print $3}')
seg_name=$(echo ${line} | awk '{print $4}')
echo $chr $pos_start $pos_end $seg_name
echo VCF ${vcf_dir}/NIA_JG_1898_samples_GRM_WGS_b37_JointAnalysis01_2017-12-08_${chr}.recalibrated_variants.vcf.gz
## Grep header
tabix -h ${vcf_dir}/NIA_JG_1898_samples_GRM_WGS_b37_JointAnalysis01_2017-12-08_${chr}.recalibrated_variants.vcf.gz | grep "CHROM" > ${TEMPDIR}/${seg_name}.vcf
## Tabix genotype
tabix ${vcf_dir}/NIA_JG_1898_samples_GRM_WGS_b37_JointAnalysis01_2017-12-08_${chr}.recalibrated_variants.vcf.gz ${chr}:${pos_start}-${pos_end} >> ${TEMPDIR}/${seg_name}.vcf
bgzip -f ${TEMPDIR}/${seg_name}.vcf
tabix -f ${TEMPDIR}/${seg_name}.vcf.gz
######## Copy results back to output directory
echo copy results back to $out_dir
cp -f ${TEMPDIR}/${seg_name}.vcf.gz ${out_dir}/
cp -f ${TEMPDIR}/${seg_name}.vcf.gz.tbi ${out_dir}/
######## REMOVE temporary data directory
rm -fr ${TEMPDIR}
exit
- The example bash script is written to select the corresponding VCF segment information from
${seg_bed}
input file by the line number given by the job id${SGE_TASK_ID}
- Then subset the corresponding subset from the input VCF file to create a gzipped/tabixed segment VCF file
- This example uses the scratch space for writing outputs and then copy the output files to data drive afterwards. The script can be modified to directly write on the data drive.
Submit a R job
HGCC requires wrapping R scripts in a Bash Script and then submitting the bash script as a job. The following example commands submit a job for a R script wrapped in a bash script.
qsub -q b.q -cwd -j y -pe smp 1 /home/jyang/Scripts/R_scripts/example_Rjob.sh
- The executable shell script
example_Rjob.sh
has the following content
#!/usr/bin/bash
module load R/3.4.3
echo Run R script directly
/home/jyang/Scripts/R_scripts/example_Rjob.r 1 2 3
echo Run R script by Rscript command
Rscript /home/jyang/Scripts/R_scripts/example_Rjob.r 1 2 3
exit
- The wrapped R script
example_Rjob.r
can be run directly if it is made executable bychmod 755 example_Rjob.r
, or by using theRscript
command tool. - The wrapped R script
example_Rjob.r
has the following content
#!/usr/bin/env Rscript
Sys.setlocale("LC_ALL", "C")
.libPaths("/home/jyang/R/x86_64-pc-linux-gnu-library/3.4")
options(stringsAsFactors=F)
library(glmnet)
### Load input arguments
args=(commandArgs(TRUE))
print(args)
if(length(args)==0) {
stop("Error: No arguments supplied!")
} else {
var1 = args[[1]]
var2 = args[[2]]
var3 = args[[3]]
}
print(list(var1 = var1, var2 = var2, var3 = var3))
Submit a Python job
- Similar rules can be followed as submitting the R job. Replace the R script by a Python script
example.py
with content like
#! /usr/bin/python
import sys
import io
import os
import pandas as pd
import numpy as np
from numpy import *
from dfply import *
a = 2 + 4
print(a)
- Use command tool
python
to run a non-executable python script,python /home/jyang/Scripts/Python_scripts/example.py
Number of cores to request
- Check current availability of the cluster:
qstat -f -u '*'
- Choose the number of requested cores to the number of available cores on the nodes, in order to have your job accepted by the scheduler sooner
- Each core has
8GB
memory. Request an appropriate number of cores to accommodate the memory usage of your job, so that it will not fail in the middle. - For tools that can enable parallel computation with multiple threads, make sure to change the default number of threads set by the tool to at least the number of cores you request.
Additional Resources
- Need help: submit a ticket at https://hgcc.genetics.emory.edu:8443/ after login to Emory VPN.
- The Linux Command Line book is recommended to read ahead