Run Computation Tools and Submit Jobs on HGCC

Test commands in an interactive session on computation node

  1. Open an interactive bash session on a computation node by srun -N 1 -n 1 --pty --preserve-env bash.
  2. Load required software modules by spack load [software].
  3. Mount remote HGCC directory to your local computer. Highly recommend to use Visual Studio Code. See instructions in Introductions to HGCC.
  4. Write scripts using local text editors like Visual Studio Code, Sublime.
  5. Copy and paste to the interactive bash session, or open the terminal in the bottom of your VSCode window.

R

Python/Conda

PLINK

PLINK is a powerful and useful genetic data analysis tool to convert data format, QC, calculate MAF and HWE p-value, calculate kinship matrix, calculate top Principal Components, conduct single variant genetic association studies. Common data format is BED/BIM/FAM.

BCFTools

BCFTools is a fast tool to manipulate sorted/bgzipped/tabixed VCF files of genotype data.

BEDTools

BEDTools utilities are a Swiss-army knife of tools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

Submit Jobs by SLURM

Basic SLURM commands

#!/bin/bash
#SBATCH --job-name=normal_sim ## specify job name
#SBATCH --nodes=1 ## request 1 node 
#SBATCH --mem=8G ## request 8G memory
#SBATCH --cpus-per-task=4 ## request 4 cpus/cores per job
#SBATCH --time=24:00:00 ## specify job running time for 24 hrs
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out ## specify slurm output file directory
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err ## specify slurm error file directory
#SBATCH --array=0-10 ## Submitting 10 instances of commands listed below

## The following commands will be run for 10 times by 10 jobs under the specified array. Each with their unique task ID given by $SLURM_ARRAY_TASK_ID, in {1..10}.

## Change working directory
cd /home/jyang51/yangfss2/public/ExampleData

## Create SLURM_OUT/ under the current working directory to save slurm output and error files
mkdir -p ./SLURM_OUT/

## Print SLURM array task ID $SLURM_ARRAY_TASK_ID (1..10)
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

## Load R software
spack load r@4.4.0

## Using the Rscript to simulate a vector x from standard normal distribution and write x to a text data file under /home/jyang51/yangfss2/public/ExampleData/
## Use the SLURM array task ID to save unique output data files, or configue the job.
Rscript /home/jyang51/yangfss2/public/ExampleScripts/norm_sim.R /home/jyang51/yangfss2/public/ExampleData/ $SLURM_ARRAY_TASK_ID
#!/usr/bin/env Rscript
Sys.setlocale('LC_ALL', 'C')
options(stringsAsFactors=F)

###############
args=(commandArgs(TRUE))
print(args)
if(length(args)==0) {
  stop("Error: No arguments supplied!")
} else {
  out_prefix = args[[1]]
  n_sim = args[[2]]
}

x = rnorm(100, mean = 0, sd = 1)
print(paste("mean of simulated data =", mean(x)))
print(paste("standard deviation of simulated data =", sd(x)))
print("Write simulated data to a text file:")
write.table(data.frame(x = x), file = paste0(out_prefix, "sim_", n_sim, ".txt"), quote = FALSE, sep = "\t", row.names = FALSE)

Another example of submitting a job to map raw RNAseq data by STAR

Create an example bash script by touch star_genome_create.sh and write the following commands into this file (see example shell script in /home/jyang51/yangfss2/public/ExampleScripts/star_genome_create.sh on HGCC):

#!/bin/bash
#SBATCH --job-name=STAR_genomeGenerate
#SBATCH --nodes=1
#SBATCH --mem=128G ## request memory
#SBATCH --cpus-per-task=16 ## request cpu/cores 
#SBATCH --array=1 ## a single job; or remove this line of specifying an array job
#SBATCH --time=24:00:00 ## specify job running time for 24 hrs
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out ## save slurm output 
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err ## save slurm errors

spack load miniconda3@24.3.0
conda activate /home/jyang51/jyang51/local/myenv # assume the STAR is installed under your virtual environment
## Or load the STAR module

### Create genome index
echo Running STAR genomeGenerate ...
STAR --runThreadN 16 --runMode genomeGenerate \
--genomeDir /home/jyang51/yangfss2/public/GenomeIndex/star_indexes/hg38 \
--genomeFastaFiles /home/jyang51/yangfss2/public/GenomeIndex/iGenome/hg38_2021/genome.fa \
--sjdbGTFfile /home/jyang51/yangfss2/public/GenomeIndex/iGenome/hg38_2021/gencode.v46.basic.annotation.gtf \
--sjdbOverhang 150

conda deactivate
exit

Remarks about using the /scratch space for I/O intensive jobs

If one want to use the /scratch space to improve computation efficiency by avoiding heavy I/O, the bash script need to be updated to include commands creating a temporary directory ${TMPDIR} under the /scratch space (84TB shared).

#!/bin/bash

# Generate tmp directory name
TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`

# Create tmp directory
# TMPDIR="/scratch/${SLURM_JOB_ID}_${TMP_NAME}" # include slurm job id in the name
TMPDIR="/scratch/${TMP_NAME}"
echo $TMPDIR
mkdir -p "$TMPDIR"

# Copy input data files into the temporary directory
rsync /home/jyang51/yangfss2/public/ExampleData/Sample_ID.txt ${TMPDIR}/

# Run the following command under the temporary directory
cd ${TMPDIR}/
paste Sample_ID.txt Sample_ID.txt  > output_sample_ID_2.txt

# Copy results back to hard disk
rsync ${TMPDIR}/output_sample_ID_2.txt /home/jyang51/yangfss2/public/ExampleData/

# Remove temporary directory
rm -f -r ${TMPDIR}
exit
  1. Copying data files from your data drive to this temporary directory by rsync.Run the analysis commands with input data files under the temporary directory.
  2. Write results to this temporary directory.
  3. Copy results back to your data drive.
  4. Erase the created temporary directory.

Submit Array Jobs

sbatch /home/jyang51/yangfss2/public/ExampleScripts/star_sbatch.sh /home/jyang51/yangfss2/public/ExampleData
#!/bin/bash
#SBATCH --job-name=MAP_STAR
#SBATCH --nodes=1
#SBATCH --mem=64G
#SBATCH --cpus-per-task=8
#SBATCH --array=1-52
#SBATCH --time=24:00:00
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err

#### Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

#### Take output directory from first input argument
output_dir=$1
echo "Output file directory: $output_dir".

## define data directory
data_dir=/home/jyang51/yangfss2/projects/Bill_Li_RNAseq/BC_Biomarker_Normal_Biopsy/LBI13454-118133

# Generate tmp directory name
TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`

# Create tmp directory
# TMPDIR="/scratch/${SLURM_JOB_ID}_${TMP_NAME}" # include slurm job id in the name
TMPDIR="/scratch/${TMP_NAME}"
echo $TMPDIR
mkdir -p "$TMPDIR"
cd $TMPDIR

## Determine sample ID
sample=$(head -n ${SLURM_ARRAY_TASK_ID} /home/jyang51/yangfss2/public/ExampleData/Sample_ID.txt | tail -n1)

## Copy raw fastq files to temp idrectory under /scratch
rsync ${data_dir}/RawData/H7JNJDSXC_s1_1_SM_${sample}.fastq.gz ${TMPDIR}/
rsync ${data_dir}/RawData/H7JNJDSXC_s1_2_SM_${sample}.fastq.gz ${TMPDIR}/

###### Scripts that will be run for 52 times to Map 52 samples
## The sample ID will be determined with the given Array_Task_ID from 1 to 52;
## Three input variables are taken by the star_map.sh script;

### Use STAR 2.7.11a: either load STAR module by spack 
# spack load star@2.7.11a
### Or install spack by `conda install bioconda::bioconductor-starr` under your virtual environment. And activiate your virtual python environment.
spack load miniconda3@24.3.0
conda activate ~/.conda/envs/myenv


## create temp directory to save map output files
mkdir -p ${TMPDIR}/MAP_OUT/

# Allow use 64GB memory
STAR --genomeDir /home/jyang51/yangfss2/public/GenomeIndex/star_indexes/hg38/ \
--runThreadN ${n_threads} \
--limitBAMsortRAM  64000000000 --genomeLoad NoSharedMemory \
--readFilesIn ${TMPDIR}/H7JNJDSXC_s1_1_SM_${sample}.fastq.gz  ${TMPDIR}/H7JNJDSXC_s1_2_SM_${sample}.fastq.gz  \
--outFileNamePrefix ${TMPDIR}/MAP_OUT/${sample}. \
--readFilesCommand zcat \
--sjdbInsertSave All \
--quantMode TranscriptomeSAM GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped None \
--outSAMattributes Standard

## Copy output files to the output directory
rsync ${TMPDIR}/MAP_OUT/ ${output_dir}/

conda deactivate

rm -fr ${TMPDIR}/

exit