Run Computation Tools and Submit Jobs on HGCC

Run tools on interactive node

Good practice to test your script/code

  1. Use command qlogin to login to one of the interactive computation nodes after you login to the gateway node00
  2. Check available modules/tools by command module avail
    • Load the module/tool you want to use, e.g., module load R/4.0.4
    • Check loaded modules by module list
    • Unload a software module by example command: module unload R/4.0.4
    • Unload all software modules by module purge
    • Module User Guide
  3. Write scripts using text editors like Visual Studio Code, Sublime on your local computer under mounted cluster directories

R

pdf("temp.pdf");
plot(); 
dev.off()

Python/Anaconda

EXPORT PATHONPATH=PYTHONPATH=$PYTHONPATH:/home/jyang/.local/lib/python3.5/site-packages

PLINK is a powerful and useful genetic data analysis tool to convert data format, QC, calculate MAF and HWE p-value, calculate kinship matrix, calculate top Principal Components, conduct single variant genetic association studies. Common data format is BED/BIM/FAM.

BCFTools

BCFTools is a fast tool to manipulate sorted/bgzipped/tabixed VCF files of genotype data

BEDTools

BEDTools utilities are a Swiss-army knife of tools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

Submit Jobs by qsub

Create a bash script and submit a single job

Create an example bash script by touch phase.sh and write the following commands into this file (see example shell script in /home/jyang/Scripts/Shell_Scripts/phase.sh on HGCC):

#!/bin/bash

# setup bash variables to take input argument contents
vcf_file=$1
out_prefix=$2
n_thread=$3

# print out variable contents and log message
echo vcf file direction $vcf_file
echo output file prefix $out_prefix
echo number of threads to run $n_thread

echo Phase a VCF file by eagle v2.3

# command to run the phase tool eagle
/home/jyang/local/bin/eagle --geneticMapFile=/home/jyang/lib/Eagle_v2.3.5/tables/genetic_map_hg19_withX.txt.gz \
--vcfOutFormat=z --outPrefix=${out_prefix}.Phased --numThreads ${n_thread} \
--vcf=${vcf_file} 2>&1 > ${out_prefix}.eagle.out.txt

exit
  1. This example bash script will take three input arguments and pass the input contents to variables vcf_file, out_prefix, and n_thread
  2. The cluster will only run all commands within the submitted shell script. Make sure to write your results into the hard drive.
  3. Use command echo to print out log messages or variable contents to debug and check the job status.
  4. Make this shell script executable: chmod 755 phase.sh
  5. Submit this shell script as a job requesting 4 cores with three input arguments:
    qsub -q b.q -cwd -j y -pe smp 4 /home/jyang/Scripts/Shell_Scripts/example_single_job_phase.sh [vcf_file_directory] [output_file_prefix] 4
    

Remarks

If one want to use the /scratch space local to each computation node to improve computation efficiency by avoiding heavy I/O, the bash script need to be updated to include commands creating a temporary directory ${TMPDIR} under the /scratch space (2GB per node).

#!/bin/bash

TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`

# For single job
TMPDIR="/scratch/${JOB_ID}_${TMP_NAME}"

if [ -e $TEMPDIR ]; then
  echo "Error. Scratch dir already exists on `hostname`"
  exit
else
   mkdir $TEMPDIR
fi

# setup bash variables to take input argument contents
vcf_file=$1
out_prefix=$2
n_thread=$3

# copy input data files into the temporary directory
cp /home/jyang/lib/Eagle_v2.3.5/tables/genetic_map_hg19_withX.txt.gz ${TEMPDIR}/genetic_map_hg19_withX.txt.gz
cp ${vcf_file} ${TEMPDIR}/temp.vcf.gz

# Run the phase tool eagle with data files under the temporary directory
# Write to the temporary directory
/home/jyang/local/bin/eagle --geneticMapFile=${TEMPDIR}/genetic_map_hg19_withX.txt.gz \
--vcfOutFormat=z --outPrefix=${TEMPDIR}/Phased --numThreads ${n_thread} \
--vcf=${TEMPDIR}/temp.vcf.gz 2>&1 > ${TEMPDIR}/eagle.out.txt

# Copy results back to hard disk
cp ${TEMPDIR}/Phased.vcf.gz ${out_prefix}.Phased.vcf.gz
cp ${TEMPDIR}/eagle.out.txt ${out_prefix}.eagle.out.txt

# Remove temporary directory
rm -f -r ${TEMPDIR}

exit

Submit Array Jobs

Array jobs is a convenient way to submit multiple repetitive jobs that only differs by one input variables, e.g., 10000 repetitive simulations, association study for all 20K genome-wide genes.

# WGS VCF file directory
vcf_dir=/mnt/YangFSS/data2/AMP-AD/ROSMAP/WGS_JointCall

# bed file directory
bed_file=/mnt/YangFSS/data2/AMP-AD/ROSMAP/WGS_JointCall/LDdetect_SegmentedVCF/lddetect_1KG_EUR.bed

# output directory
out_dir=/mnt/YangFSS/data2/AMP-AD/ROSMAP/WGS_JointCall/LDdetect_SegmentedVCF

# Submit array job
qsub -q b.q -j y -cwd -N segVCF -t 1-1703 -tc 100 \
/home/jyang/Scripts/Shell_Scripts/example_array_job_segment_VCF.sh ${vcf_dir} ${bed_file} ${out_dir}
#!/usr/bin/bash

vcf_dir=$1
seg_bed=$2
out_dir=$3

module load tabix/0.2.6

## Obtain the corresponding input filename from ${seg_bed} input file with respect to the Array Job ID: ${SGE_TASK_ID}

line=`head -n ${SGE_TASK_ID} $seg_bed | tail -n1`
echo Segment VCF for $line

######## Create temp directory
TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`
TEMPDIR="/scratch/${JOB_ID}_${SGE_TASK_ID}_${TMP_NAME}"

if [ -e $TEMPDIR ]; then
  echo "Error. Scratch dir already exists on `hostname`"
  exit
else
   mkdir $TEMPDIR
fi

######### Segment VCFs 
## Write results to temp directory
## You may also copy your input data files to temp directory if needed

        chr=$(echo ${line} | awk '{print $1}')
        pos_start=$(echo ${line} | awk '{print $2}')
        pos_end=$(echo ${line} | awk '{print $3}')
        seg_name=$(echo ${line} | awk '{print $4}')
        echo $chr $pos_start $pos_end $seg_name

        echo VCF ${vcf_dir}/NIA_JG_1898_samples_GRM_WGS_b37_JointAnalysis01_2017-12-08_${chr}.recalibrated_variants.vcf.gz 

## Grep header
        tabix -h ${vcf_dir}/NIA_JG_1898_samples_GRM_WGS_b37_JointAnalysis01_2017-12-08_${chr}.recalibrated_variants.vcf.gz | grep "CHROM" > ${TEMPDIR}/${seg_name}.vcf

## Tabix genotype
        tabix ${vcf_dir}/NIA_JG_1898_samples_GRM_WGS_b37_JointAnalysis01_2017-12-08_${chr}.recalibrated_variants.vcf.gz ${chr}:${pos_start}-${pos_end} >> ${TEMPDIR}/${seg_name}.vcf
        bgzip -f ${TEMPDIR}/${seg_name}.vcf
        tabix -f ${TEMPDIR}/${seg_name}.vcf.gz

######## Copy results back to output directory  
echo copy results back to $out_dir
cp -f ${TEMPDIR}/${seg_name}.vcf.gz ${out_dir}/
cp -f ${TEMPDIR}/${seg_name}.vcf.gz.tbi ${out_dir}/

######## REMOVE temporary data directory
rm -fr ${TEMPDIR}

exit

Submit a R job

HGCC requires wrapping R scripts in a Bash Script and then submitting the bash script as a job. The following example commands submit a job for a R script wrapped in a bash script.

qsub -q b.q -cwd -j y -pe smp 1 /home/jyang/Scripts/R_scripts/example_Rjob.sh 
#!/usr/bin/bash

module load  R/3.4.3

echo Run R script directly
/home/jyang/Scripts/R_scripts/example_Rjob.r 1 2 3

echo Run R script by Rscript command
Rscript /home/jyang/Scripts/R_scripts/example_Rjob.r 1 2 3

exit
#!/usr/bin/env Rscript

Sys.setlocale("LC_ALL", "C")
.libPaths("/home/jyang/R/x86_64-pc-linux-gnu-library/3.4")
options(stringsAsFactors=F)

library(glmnet)

### Load input arguments
args=(commandArgs(TRUE))
print(args)
if(length(args)==0) {
        stop("Error: No arguments supplied!")
} else {
        var1 = args[[1]]
        var2 = args[[2]]
        var3 = args[[3]]
}

print(list(var1 = var1, var2 = var2, var3 = var3))

Submit a Python job

#! /usr/bin/python

import sys
import io
import os
import pandas as pd
import numpy as np
from numpy import *
from dfply import *

a = 2 + 4
print(a)

Number of cores to request

Additional Resources