ESMFold

ESMFold is another class of prediction structure based on a Protein Language Model (PLM). It doesn't require any multiple sequence alignment and use solely the sequence of the protein of interest. It was developed by Meta (a.k.a Facebook).

ESMFold main limitation is the GPU memory as it takes a lot for the predictions (see below)

ESMFold is *really fast* : seconds for small sequences (up to ~100) and minutes for bigger ones (5-10minutes for a 800 sequences protein)

Version

It use the v1.0.3 available from the Github repository https://github.com/facebookresearch/esm

Ressources

To know more about ESMFold, I highly recommend to read:

the preprint : https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1
the GitHub repo: https://github.com/facebookresearch/esm
the available notebook: https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/ESMFold.ipynb

Installation

The installation follow the same process of AlphaFold.
It's available on nodes node061, node062, node063 and node081

The installation requires several Python packages to install. A conda environment esmfold was created for this purpose.

Utilization

Use the same queues as alphafold: alphafold or alphafold2

See here: http://www-lbt.ibpc.fr/wiki/doku.php?id=cluster-lbt:extra-tools:alphafold_tool#queues

Since the main limitations of ESMFold is the GPU memory, you should always use half of the node for the predictions.

Input file

ESMFold support only a fasta file.

For monomer predictions, you can give a multifasta and sequences will be treated as batch (one after another).

For multimeres predictions, you need to supply a fasta file filled as a single sequence, with chains separated by a “:” character.

GPU Memory

ESMFold use a lot of the GPU Memory for the predictions (like Omegafold):

~500Mb for a 70 sequences protein
~27GB for a 800 sequences protein

The GPU installed in nodes 6X have ~10Gb memory and the ones in the node81 have ~48Gb.

However, ESMFold has an option to decrease the memory used (and thus increase the prediction time) called –chunk-size .

Running

The first time you use ESMFold, it will download 3 weight files (esmfold_3B_v1.pt, esm2_t36_3B_UR50D.pt and esm2_t36_3B_UR50D-contact-regression.pt) and will copy it into ~/.cache/torch/hub/checkpoints directory.

A script is available called esmfold_inference.py:

(esmfold) [santuz@node081 ~]$ esmfold_inference.py -h
usage: esmfold_inference.py [-h] -i FASTA -o PDB [--num-recycles NUM_RECYCLES] [--max-tokens-per-batch MAX_TOKENS_PER_BATCH] [--chunk-size CHUNK_SIZE] [--cpu-only] [--cpu-offload]
 
optional arguments:
  -h, --help            show this help message and exit
  -i FASTA, --fasta FASTA
                        Path to input FASTA file
  -o PDB, --pdb PDB     Path to output PDB directory
  --num-recycles NUM_RECYCLES
                        Number of recycles to run. Defaults to number used in training (4).
  --max-tokens-per-batch MAX_TOKENS_PER_BATCH
                        Maximum number of tokens per gpu forward-pass. This will group shorter sequences together for batched prediction. Lowering this can help with out of memory issues, if these occur on short sequences.
  --chunk-size CHUNK_SIZE
                        Chunks axial attention computation to reduce memory usage from O(L^2) to O(L). Equivalent to running a for loop over chunks of of each dimension. Lower values will result in lower memory usage at the cost of speed. Recommended values: 128, 64, 32. Default: None.
  --cpu-only            CPU only
  --cpu-offload         Enable CPU offloading

Submission script

You can find below an example of a submission script to perform Omegafold computations.

Script version 21/11/2022

job_ESMFold.sh

#!/bin/bash
#PBS -S /bin/bash
#PBS -N ESMFold
#PBS -o $PBS_JOBID.out
#PBS -e $PBS_JOBID.err
 
#Half node always
#PBS -l nodes=1:ppn=8
#PBS -l walltime=24:00:00
#PBS -A simlab_project
#PBS -q alphafold_hn
 
#script version 21.11.2022
 
### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ###
WORKDIR="/"
NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l)
if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then
        echo "At least one variable is needed but not defined. Please touch your manager about."
        exit 1
else
        if [ $NUM_NODES -le 1 ]; then
                WORKDIR+="scratch/"
                export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/"
                mkdir $WORKDIR
                rsync -ap $PBS_O_WORKDIR/ $WORKDIR/
 
                # if you need to check your job output during execution (example: each hour) you can uncomment the following line
                # /shared/scripts/ADMIN__auto-rsync.example 3600 &
        else
                export WORKDIR=$PBS_O_WORKDIR
        fi
fi
 
echo "your current dir is: $PBS_O_WORKDIR"
echo "your workdir is: $WORKDIR"
echo "number of nodes: $NUM_NODES"
echo "number of cores: "$(cat $PBS_NODEFILE|wc -l)
echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done)
 
cd $WORKDIR
 
# If you're using only one node, it's counterproductive to use IB network for your MPI process communications
if [ $NUM_NODES -eq 1 ]; then
        export PSM_DEVICES=self,shm
        export OMPI_MCA_mtl=^psm
        export OMPI_MCA_btl=shm,self
else
# Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts
# we have to share PSM contexts between processes
# CIN is here the number of cores in node
        CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l)
        if [ $(($CIN/16)) -ge 2 ]; then
                PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l)
                if [ $CIN -eq 40 ]; then
                        export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4))
                elif [ $CIN -eq 32 ]; then
                        export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2))
                else
                        echo "This computing node is not supported by this script"
                fi
                echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX"
        else
                echo "no PSM_SHAREDCONTEXTS_MAX to define"
        fi
fi
 
function get_gpu-ids() {
        if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then
                echo "0,1" && return
        fi
 
        if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then
                FILE="/dev/cpuset/torque/$PBS_JOBID/cpus"
        elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then
                FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus"
        else
                FILE=""
        fi
 
        if [ -e $FILE ]; then
                if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then
                        echo "0" && return
                else
                        echo "1" && return
                fi
        else
                echo "0,1" && return
        fi
}
 
gpus=$(get_gpu-ids)
 
 
## USER Part
module load gcc/8.3.0
module load miniconda-py3/latest
 
 
conda activate esmfold
 
#Run
cd $WORKDIR/
 
d1=`date +%s`
echo $(date)
 
esmfold_inference.py -i query.fasta -o outputdir/
 
 
d2=$(date +%s)
echo $(date)
 
diff=$((($d2 - $d1)/60))
echo "Time spent (min) : ${diff}"
 
## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause
# At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir)
if [ $NUM_NODES -le 1 ]; then
        cd $PBS_O_WORKDIR
        rsync -ap $WORKDIR/ $PBS_O_WORKDIR/
        rm -rf $WORKDIR
fi
## END-DO

Benchmarks

Troubleshooting

In case of trouble, you can contact me at : hubert.santuz[at]ibpc.fr