Skip to content

Passing $SCRATCH to an Input File

One of our users emailed me with this request:

I have been running lots of scripts on the HPC cluster and while most of them are working well, I have a problem with one of the cellranger pipelines, called cellranger multi. It requires two arguments, one is the job id and the other is a path to a configuration file, which contains a path to the genome reference and the path to the fastq files I’m processing. The problem I have is with the configuration file because I am listing the path as ${SCRATCH}/filename which cellranger can’t recognise because it doesn’t know what ${SCRATCH} is.

I was wondering if it’s possible to submit one script to create the ${SCRATCH} folder and copy the input data/genome across and then submit a second one transferring the conf file with the full path to the data/genome and using the cellranger multi command. Would this make sense, or do you know of a better way to do this?

One way to do this is to edit the configuration file from within your submission script using the “sed” streaming editor.

In the example below our pretend user has a login of u778899 and the Job ID is 12345.hpcnode0.

The Task

This was the error they were getting when they submitted the job:

$ qsub cellranger-multi.sh
.......

Running preflight checks (please wait)..
Log message:
Your [gene-expression] reference does not contain the expected files, 
or they are not readable. Please check your reference folder on hpcnode04.

This was their submission script. Note: The example submission scripts here have been edited to remove parts that are not directly applicable to the problem.

user$ cat cellranger-multi.sh 
#!/bin/bash

#################################################
# Run Cell Ranger multi for Fixed RNA Profiling #
#################################################

# PBS Commands
#PBS -N Cellranger
#PBS -l ncpus=16
#PBS -l mem=128GB
#PBS -l walltime=06:00:00

# Create a unique scratch directory
SCRATCH="/scratch/778899_${PBS_JOBID%.*}"
mkdir ${SCRATCH}

# Change to the PBS working directory where qsub was started from.
cd ${PBS_O_WORKDIR}

# Copy input files from working directory to scratch directory.
cp multiconfig.conf ${SCRATCH}
cp Human_Transcriptome_GRCh38.csv ${SCRATCH}
cp -R fastqs ${SCRATCH}
cp -R refdata-GRCh38 ${SCRATCH}

# Change to the scratch directory
cd ${SCRATCH}

# Run cellranger multi
cellranger multi --id=output-multi \
    --conf=multiconfig.conf \
    --localcores=16 \
    --localmem=115

# Copy data back to the working directory.
cp -R output-multi ${PBS_O_WORKDIR}

# Change directory to working directory and remove scratch directory.
cd ${PBS_O_WORKDIR}
rm -r ${SCRATCH}

The above submission script is fine. The Cellranger error message says the “[gene-expression] reference does not contain the expected files, or they are not readable.” The “[gene-expression] reference” is in the configuration file multiconfig.conf. I have highlighted that in yellow above.

This was their configuration file.

user$ cat multiconfig.conf

[gene-expression]
ref, $SCRATCH/refdata-GRCh38
probe-set, $SCRATCH/Human_Transcriptome_GRCh38.csv
no-bam,true

[libraries]
fastq_id,fastqs,feature_types
Gene_Fixed, $SCRATCH/fastqs, Gene Expression

In the code above I have highlighted in yellow the line where the user has used the $SCRATCH variable. And there lies the problem.

The $SCRATCH variable is set in the bash submission script and whenever its used in that script the bash shell will use “variable expansion” to expand $SCRATCH to the full path like /scratch/778899_12345. However the text $SCRATCH in the configuration file is not expanded. It’s just treated as literal text. (Indeed, one would not like bash to change things in our files unless we explicitly command it to.)

The Solution

We will use a very useful UNIX command called “sed” to edit a “template” configuration file and output a new configuration file which contains the correct path to our scratch directory. “sed” stands for streaming editor. We have used sed previously, see Users Solutions / Submitting Multiple Jobs.

1. Write a Template Configuration File

Write a “template” for the configuration file called multiconfig.template. See the example below. Use a placeholder for the scratch directory. Here I have used the literal text “SCRATCH”. Note there is no $ sign.

user$ cat multiconfig.template

[gene-expression]
ref, SCRATCH/refdata-GRCh38
probe-set, SCRATCH/Human_Transcriptome_GRCh38.csv
no-bam,true

[libraries]
fastq_id,fastqs,feature_types
Gene_Fixed, SCRATCH/fastqs, Gene Expression

The “sed” command in the submission script will replace the literal text “SCRATCH” with the expanded value of the $SCRATCH variable and write the new file multiconfig.conf. When written it will look like this:

user$ cat multiconfig.conf

[gene-expression]
ref, /scratch/778899_33973/refdata-GRCh38
probe-set, /scratch/778899_12345/Human_Transcriptome_GRCh38.csv
no-bam,true

[libraries]
fastq_id,fastqs,feature_types
Gene_Fixed, /scratch/778899_12345/fastqs, Gene Expression

You can see what sed does with this little command line test below. Run it and look at the output.

$ cat multiconfig.template | sed "s/SCRATCH/HELLO!/"

2. Modify our Submission Script

Your PBS submission script for this job will be like this the example below.

user$ cat cellranger-multi.sh 

#!/bin/bash

#################################################
# Run Cell Ranger multi for Fixed RNA Profiling #
#################################################

# PBS Commands
#PBS -N Cellranger
#PBS -l ncpus=16
#PBS -l mem=128GB
#PBS -l walltime=06:00:00

# Create a unique scratch directory
SCRATCH="/scratch/778899_${PBS_JOBID%.*}"
mkdir ${SCRATCH}

# Change to the PBS working directory where qsub was started from.
cd ${PBS_O_WORKDIR}

# Replace "SCRATCH" in the config file template with the expanded value
# of the $SCRATCH variable specified above.
cat multiconfig.template | sed "s/SCRATCH/${SCRATCH}/" > multiconfig.conf

# Copy input files from working directory to scratch directory.
cp multiconfig.conf ${SCRATCH}
cp Human_Transcriptome_GRCh38.csv ${SCRATCH}
cp -R fastqs ${SCRATCH}
cp -R refdata-GRCh38 ${SCRATCH}

# Change to the scratch directory
cd ${SCRATCH}

# Run cellranger multi
cellranger multi --id=output-multi \
    --conf=multiconfig.conf \
    --localcores=16 \
    --localmem=115

# Copy data back to the working directory.
cp -R output-multi ${PBS_O_WORKDIR}

# Change directory to working directory and remove scratch directory.
cd ${PBS_O_WORKDIR}
rm -r ${SCRATCH}

In the code above I have highlighted in yellow the single line where this script differs from the original. It’s just the sed command.

When this submission script runs, it will all work :-)

References

You can read a bit about sed here. Unix crash course / Six glorious commands: https://astrobiomike.github.io/unix/six-glorious-commands#sed