Run Computation Tools and Submit Jobs on HGCC

Test commands in an interactive session on computation node

  1. Open an interactive bash session on a computation node by srun -N 1 -n 1 --pty bash -i.
  2. Load required software modules by spack load [software].
  3. Mount remote HGCC directory to your local computer.
  4. Write scripts using local text editors like Visual Studio Code, Sublime.
  5. Copy and paste to the interactive bash session.

R

Python/Conda

PLINK

PLINK is a powerful and useful genetic data analysis tool to convert data format, QC, calculate MAF and HWE p-value, calculate kinship matrix, calculate top Principal Components, conduct single variant genetic association studies. Common data format is BED/BIM/FAM.

BCFTools

BCFTools is a fast tool to manipulate sorted/bgzipped/tabixed VCF files of genotype data.

BEDTools

BEDTools utilities are a Swiss-army knife of tools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

Submit Jobs by sbatch.

Create a bash script and submit a single job

Create an example bash script by touch star_genome_create.sh and write the following commands into this file (see example shell script in /home/jyang51/yangfss2/public/ExampleScripts/star_genome_create.sh on HGCC):

#!/bin/bash
#SBATCH --job-name=STAR_genomeGenerate
#SBATCH --nodes=1
#SBATCH --mem=128G ## request memory
#SBATCH --cpus-per-task=16 ## request cpu/cores 
#SBATCH --array=1 ## a single job; or remove this line of specifying an array job
#SBATCH --time=24:00:00 ## specify job running time for 24 hrs
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out ## save slurm output 
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err ## save slurm errors

spack load miniconda3@24.3.0
conda activate ~/.conda/envs/myenv

### Create genome index
echo Running STAR genomeGenerate ...
STAR --runThreadN 16 --runMode genomeGenerate \
--genomeDir /home/jyang51/yangfss2/public/GenomeIndex/star_indexes/hg38 \
--genomeFastaFiles /home/jyang51/yangfss2/public/GenomeIndex/iGenome/hg38_2021/genome.fa \
--sjdbGTFfile /home/jyang51/yangfss2/public/GenomeIndex/iGenome/hg38_2021/gencode.v46.basic.annotation.gtf \
--sjdbOverhang 150

conda deactivate

exit
  1. This example bash script will generate genome index of human genome of hg38 assembly for running the alignment tool STAR for mapping RNAseq data.
  2. Use command echo to print out log messages or variable contents to debug and check the job status.
  3. Make this shell script executable: chmod 755 phase.sh.
  4. Submit a job to run this shell script: sbatch /home/jyang51/yangfss2/public/ExampleScripts/star_genome_create.sh

Remarks about using the /scratch space for I/O intensive jobs

If one want to use the /scratch space to improve computation efficiency by avoiding heavy I/O, the bash script need to be updated to include commands creating a temporary directory ${TMPDIR} under the /scratch space (84TB shared).

#!/bin/bash

# Generate tmp directory name
TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`

# Create tmp directory
# TMPDIR="/scratch/${SLURM_JOB_ID}_${TMP_NAME}" # include slurm job id in the name
TMPDIR="/scratch/${TMP_NAME}"
echo $TMPDIR
mkdir -p "$TMPDIR"

# Copy input data files into the temporary directory
rsync /home/jyang51/yangfss2/public/ExampleData/Sample_ID.txt ${TMPDIR}/

# Run the following command under the temporary directory
cd ${TMPDIR}/
paste Sample_ID.txt Sample_ID.txt  > output_sample_ID_2.txt

# Copy results back to hard disk
rsync ${TMPDIR}/output_sample_ID_2.txt /home/jyang51/yangfss2/public/ExampleData/

# Remove temporary directory
rm -f -r ${TMPDIR}

exit
  1. Copying data files from your data drive to this temporary directory by rsync.Run the analysis commands with input data files under the temporary directory.
  2. Write results to this temporary directory.
  3. Copy results back to your data drive.
  4. Erase the created temporary directory.

Submit Array Jobs

sbatch /home/jyang51/yangfss2/public/ExampleScripts/star_sbatch.sh /home/jyang51/yangfss2/public/ExampleData
#!/bin/bash
#SBATCH --job-name=MAP_STAR
#SBATCH --nodes=1
#SBATCH --mem=64G
#SBATCH --cpus-per-task=8
#SBATCH --array=1-52
#SBATCH --time=24:00:00
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err

#### Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID

#### Take output directory from first input argument
output_dir=$1
echo "Output file directory: $output_dir".

## define data directory
data_dir=/home/jyang51/yangfss2/projects/Bill_Li_RNAseq/BC_Biomarker_Normal_Biopsy/LBI13454-118133

# Generate tmp directory name
TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`

# Create tmp directory
# TMPDIR="/scratch/${SLURM_JOB_ID}_${TMP_NAME}" # include slurm job id in the name
TMPDIR="/scratch/${TMP_NAME}"
echo $TMPDIR
mkdir -p "$TMPDIR"
cd $TMPDIR

## Determine sample ID
sample=$(head -n ${SLURM_ARRAY_TASK_ID} /home/jyang51/yangfss2/public/ExampleData/Sample_ID.txt | tail -n1)

## Copy raw fastq files to temp idrectory under /scratch
rsync ${data_dir}/RawData/H7JNJDSXC_s1_1_SM_${sample}.fastq.gz ${TMPDIR}/
rsync ${data_dir}/RawData/H7JNJDSXC_s1_2_SM_${sample}.fastq.gz ${TMPDIR}/

###### Scripts that will be run for 52 times to Map 52 samples
## The sample ID will be determined with the given Array_Task_ID from 1 to 52;
## Three input variables are taken by the star_map.sh script;

### Use STAR 2.7.11a: either load STAR module by spack 
# spack load star@2.7.11a
### Or install spack by `conda install bioconda::bioconductor-starr` under your virtual environment. And activiate your virtual python environment.
spack load miniconda3@24.3.0
conda activate ~/.conda/envs/myenv


## create temp directory to save map output files
mkdir -p ${TMPDIR}/MAP_OUT/

# Allow use 64GB memory
STAR --genomeDir /home/jyang51/yangfss2/public/GenomeIndex/star_indexes/hg38/ \
--runThreadN ${n_threads} \
--limitBAMsortRAM  64000000000 --genomeLoad NoSharedMemory \
--readFilesIn ${TMPDIR}/H7JNJDSXC_s1_1_SM_${sample}.fastq.gz  ${TMPDIR}/H7JNJDSXC_s1_2_SM_${sample}.fastq.gz  \
--outFileNamePrefix ${TMPDIR}/MAP_OUT/${sample}. \
--readFilesCommand zcat \
--sjdbInsertSave All \
--quantMode TranscriptomeSAM GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped None \
--outSAMattributes Standard

## Copy output files to the output directory
rsync ${TMPDIR}/MAP_OUT/ ${output_dir}/

conda deactivate

rm -fr ${TMPDIR}/

exit