srun -N 1 -n 1 --pty --preserve-env bash
.spack load [software]
.Visual Studio Code
. See instructions in Introductions to HGCC.Load R module by spack load r@4.4.0
.
Type R
in the interactive session to initiate an R session.
parallel
and foreach
, see Quick Instructions of Parallel Computation.Rtools
installed on VSCodecommand + return
on mac will send your R command line to the terminal to run .libPaths()
function commands in every new R session
.libPaths(c("/nfs/yangfss2/data/shared/Rlibs", .libPaths()))
.libPaths()
command in your ~/.Rprofile
file, to automatically setup when start a new R session.export R_LIBS_USER=/nfs/yangfss2/data/shared/Rlibs:${R_LIBS}
to your ~/.bashrc
file, to setup in every new bash session. R_LIBS_USER=/nfs/yangfss2/data/shared/Rlibs:${R_LIBS}
to your ~/.Renviron
file. This will also allow automatically setup when start a new R session.:
in the bash commands. The one in the front will be the first choice by the system for installing libraries. Connect to HGCC on Visual Studio Code
and open your Rscript in the VSCode window.
R Tools, R, R Extension Pack
. Setup auto-login using identify file without the need of typing password. R
in the bottom right of VSCode window to attach a remote R session. By default this R session is lauched on the gatway admin node. Install any extension or R libraries you are prompted to for using this function.
Sys.info()
in the current R session. nodename
. command + enter
on mac to sent your R commands in the editor to the bottom R session. R
sign on the left side of VSCode window to view current R environment information.Tips
ggplot2::ggsave("temp.pdf")
.Python 3.11.7, anaconda3@2023.09-0, miniconda3@24.3.0 are installed on the cluster. First load one of these modules.
Interactive python session:
spack load miniconda3@24.3.0
. Python 3.12.2
comes with the miniconda3@24.3.0
module.conda init; source ~/.bashrc;
for once for using conda
.python
in the interactive session to initiate a python session.Use Virtual Environment for access right to install python libraries.
/home/jyang51/jyang51/local/
: conda create --prefix conda create --prefix /home/jyang51/jyang51/local/myenv numpy pandas
. Now Python 3.14.0
comes with this virtual environment by default.conda env list
.conda activate /home/jyang51/jyang51/local/myenv
.conda install -n myenv [package]
.conda deactivate
conda remove -n myenv -all
conda -h
.Use Visual Studio Code
to edit and test python code on HGCC.
Python
. Setup auto-login using identify file without the need of typing password. .py
) in VSCodepython xx.xx
in the bottom right of VSCode window to select the python version that you want to use. Prefer to select the virtual environment you setup for yourself. By default, system python3.11.7
is lauched. Install any extension or python libraries you are prompted to for using this function. shift + enter
on mac to sent your python commands in the editor to the bottom python session in the terminal. PLINK is a powerful and useful genetic data analysis tool to convert data format, QC, calculate MAF and HWE p-value, calculate kinship matrix, calculate top Principal Components, conduct single variant genetic association studies. Common data format is BED/BIM/FAM
.
spack load plink2@2.00a4.3
plink2 -h
.spack unload plink2@2.00a4.3
.BCFTools is a fast tool to manipulate sorted/bgzipped/tabixed
VCF files of genotype data.
spack load tabix@2013-12-16
.spack load bcftools@1.19
.bcftools -h
.BEDTools utilities are a Swiss-army knife of tools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.
spack load bedtools2@2.31.1
.bedtools -h
.sbatch
to submit jobs. See more instructions at https://slurm.schedmd.com/sbatch.html. Set arguments to sbatch
in a wrapper shell (job submission) script. For example, you may use command sbatch norm_sim.sh
to submit an array job to run 10 times of simulations, with norm_sim.sh
given as follows: #!/bin/bash
#SBATCH --job-name=normal_sim ## specify job name
#SBATCH --nodes=1 ## request 1 node
#SBATCH --mem=8G ## request 8G memory
#SBATCH --cpus-per-task=4 ## request 4 cpus/cores per job
#SBATCH --time=24:00:00 ## specify job running time for 24 hrs
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out ## specify slurm output file directory
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err ## specify slurm error file directory
#SBATCH --array=0-10 ## Submitting 10 instances of commands listed below
## The following commands will be run for 10 times by 10 jobs under the specified array. Each with their unique task ID given by $SLURM_ARRAY_TASK_ID, in {1..10}.
## Change working directory
cd /home/jyang51/yangfss2/public/ExampleData
## Create SLURM_OUT/ under the current working directory to save slurm output and error files
mkdir -p ./SLURM_OUT/
## Print SLURM array task ID $SLURM_ARRAY_TASK_ID (1..10)
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
## Load R software
spack load r@4.4.0
## Using the Rscript to simulate a vector x from standard normal distribution and write x to a text data file under /home/jyang51/yangfss2/public/ExampleData/
## Use the SLURM array task ID to save unique output data files, or configue the job.
Rscript /home/jyang51/yangfss2/public/ExampleScripts/norm_sim.R /home/jyang51/yangfss2/public/ExampleData/ $SLURM_ARRAY_TASK_ID
norm_sim.R
script given as follows:#!/usr/bin/env Rscript
Sys.setlocale('LC_ALL', 'C')
options(stringsAsFactors=F)
###############
args=(commandArgs(TRUE))
print(args)
if(length(args)==0) {
stop("Error: No arguments supplied!")
} else {
out_prefix = args[[1]]
n_sim = args[[2]]
}
x = rnorm(100, mean = 0, sd = 1)
print(paste("mean of simulated data =", mean(x)))
print(paste("standard deviation of simulated data =", sd(x)))
print("Write simulated data to a text file:")
write.table(data.frame(x = x), file = paste0(out_prefix, "sim_", n_sim, ".txt"), quote = FALSE, sep = "\t", row.names = FALSE)
squeue
to display current slurm job queue.scancel [jobid]
used to cancel/kill a job.scontrol
used to show information about running or pending jobs.
scontrol show job [jobid]
to show system details of a submitted job.sinfo
to report the state of the cluster partition and nodes.Create an example bash script by touch star_genome_create.sh
and write the following commands into this file (see example shell script in /home/jyang51/yangfss2/public/ExampleScripts/star_genome_create.sh
on HGCC):
#!/bin/bash
#SBATCH --job-name=STAR_genomeGenerate
#SBATCH --nodes=1
#SBATCH --mem=128G ## request memory
#SBATCH --cpus-per-task=16 ## request cpu/cores
#SBATCH --array=1 ## a single job; or remove this line of specifying an array job
#SBATCH --time=24:00:00 ## specify job running time for 24 hrs
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out ## save slurm output
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err ## save slurm errors
spack load miniconda3@24.3.0
conda activate /home/jyang51/jyang51/local/myenv # assume the STAR is installed under your virtual environment
## Or load the STAR module
### Create genome index
echo Running STAR genomeGenerate ...
STAR --runThreadN 16 --runMode genomeGenerate \
--genomeDir /home/jyang51/yangfss2/public/GenomeIndex/star_indexes/hg38 \
--genomeFastaFiles /home/jyang51/yangfss2/public/GenomeIndex/iGenome/hg38_2021/genome.fa \
--sjdbGTFfile /home/jyang51/yangfss2/public/GenomeIndex/iGenome/hg38_2021/gencode.v46.basic.annotation.gtf \
--sjdbOverhang 150
conda deactivate
exit
STAR
. echo
to print out log messages or variable contents to debug and check the job status.chmod 755 star_genome_create.sh
.
sbatch /home/jyang51/yangfss2/public/ExampleScripts/star_genome_create.sh
/scratch
space for I/O intensive jobsIf one want to use the /scratch
space to improve computation efficiency by avoiding heavy I/O, the bash script need to be updated to include commands creating a temporary directory ${TMPDIR}
under the /scratch
space (84TB shared).
#!/bin/bash
# Generate tmp directory name
TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`
# Create tmp directory
# TMPDIR="/scratch/${SLURM_JOB_ID}_${TMP_NAME}" # include slurm job id in the name
TMPDIR="/scratch/${TMP_NAME}"
echo $TMPDIR
mkdir -p "$TMPDIR"
# Copy input data files into the temporary directory
rsync /home/jyang51/yangfss2/public/ExampleData/Sample_ID.txt ${TMPDIR}/
# Run the following command under the temporary directory
cd ${TMPDIR}/
paste Sample_ID.txt Sample_ID.txt > output_sample_ID_2.txt
# Copy results back to hard disk
rsync ${TMPDIR}/output_sample_ID_2.txt /home/jyang51/yangfss2/public/ExampleData/
# Remove temporary directory
rm -f -r ${TMPDIR}
exit
rsync
.Run the analysis commands with input data files under the temporary directory.Array jobs is a convenient way to submit multiple repetitive jobs that only differs by one input variables, e.g., 10000 repetitive simulations, association study for all 20K genome-wide genes.
The following example commands to submit an Array job will submit repetitive jobs to run the same bash scripts star_sbatch.sh
for 52
times, with the only difference of slurm job id $SLURM_ARRAY_TASK_ID
for getting the corresponding sample ID.
sbatch /home/jyang51/yangfss2/public/ExampleScripts/star_sbatch.sh /home/jyang51/yangfss2/public/ExampleData
star_sbatch.sh
has the following contents:#!/bin/bash
#SBATCH --job-name=MAP_STAR
#SBATCH --nodes=1
#SBATCH --mem=64G
#SBATCH --cpus-per-task=8
#SBATCH --array=1-52
#SBATCH --time=24:00:00
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err
#### Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
#### Take output directory from first input argument
output_dir=$1
echo "Output file directory: $output_dir".
## define data directory
data_dir=/home/jyang51/yangfss2/projects/Bill_Li_RNAseq/BC_Biomarker_Normal_Biopsy/LBI13454-118133
# Generate tmp directory name
TMP_NAME=`/usr/bin/mktemp -u XXXXXXXX`
# Create tmp directory
# TMPDIR="/scratch/${SLURM_JOB_ID}_${TMP_NAME}" # include slurm job id in the name
TMPDIR="/scratch/${TMP_NAME}"
echo $TMPDIR
mkdir -p "$TMPDIR"
cd $TMPDIR
## Determine sample ID
sample=$(head -n ${SLURM_ARRAY_TASK_ID} /home/jyang51/yangfss2/public/ExampleData/Sample_ID.txt | tail -n1)
## Copy raw fastq files to temp idrectory under /scratch
rsync ${data_dir}/RawData/H7JNJDSXC_s1_1_SM_${sample}.fastq.gz ${TMPDIR}/
rsync ${data_dir}/RawData/H7JNJDSXC_s1_2_SM_${sample}.fastq.gz ${TMPDIR}/
###### Scripts that will be run for 52 times to Map 52 samples
## The sample ID will be determined with the given Array_Task_ID from 1 to 52;
## Three input variables are taken by the star_map.sh script;
### Use STAR 2.7.11a: either load STAR module by spack
# spack load star@2.7.11a
### Or install spack by `conda install bioconda::bioconductor-starr` under your virtual environment. And activiate your virtual python environment.
spack load miniconda3@24.3.0
conda activate ~/.conda/envs/myenv
## create temp directory to save map output files
mkdir -p ${TMPDIR}/MAP_OUT/
# Allow use 64GB memory
STAR --genomeDir /home/jyang51/yangfss2/public/GenomeIndex/star_indexes/hg38/ \
--runThreadN ${n_threads} \
--limitBAMsortRAM 64000000000 --genomeLoad NoSharedMemory \
--readFilesIn ${TMPDIR}/H7JNJDSXC_s1_1_SM_${sample}.fastq.gz ${TMPDIR}/H7JNJDSXC_s1_2_SM_${sample}.fastq.gz \
--outFileNamePrefix ${TMPDIR}/MAP_OUT/${sample}. \
--readFilesCommand zcat \
--sjdbInsertSave All \
--quantMode TranscriptomeSAM GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped None \
--outSAMattributes Standard
## Copy output files to the output directory
rsync ${TMPDIR}/MAP_OUT/ ${output_dir}/
conda deactivate
rm -fr ${TMPDIR}/
exit
/home/jyang51/yangfss2/public/ExampleData/Sample_ID.txt
by the slurm job id $SLURM_ARRAY_TASK_ID
./scratch/
.