Table des matières

AlphaFold

AlphaFold

AlphaFold 2^[1] (AF2) has been installed on the Baal cluster, currently on nodes node061, node062, node063 and node081

This installation, version 2.3.1, include the new multimer feature (Alphafold-Multimer^[2]).
It's based on this GitHub repo: https://github.com/deepmind/alphafold

To use it, you will find a example script and some benchmarks to give you idea of time needed to perform the predictions.

Versions

Alphafold version 2.3.1 : from 13/06/2023 to XXX
Alphafold version 2.2 : from 15/04/2022 to 13/06/2023
Alphafold version 2.1 : from 26/01/2022 to 15/04/2022

Installation

The installation process was taken from here: https://github.com/kalininalab/alphafold_non_docker
It's not the official process because the official one use docker which is ban on most HPC centers (include ours). Since it's not the official way, bugs may occurs.

The repository of the Alphafold installation is located in /scratch-nv/software/alphafold/alphafold_repo
The databases needed are located in /scratch-nv/software/alphafold/alphafold_db (Note that 2.2TB is required for all the databases).

Additionally, a conda environment called alphafold has been created to install the dependencies.

The installation process is identical between nodes.

Utilization

Queues

There is 2 queues for alphafold predictions.

For each, there is 2 suffix which indicate whether you wan the whole node (_1n) or half the node (_hn).

Queue alphafold

This is the first queue set in place and it includes nodes node61, node062, node063 with 16 cores, 64Gb of RAM and 2 GTX 1080 Ti.
It declines in :

alphafold_hn with -l nodes=1:ppn=8
alphafold_1n with -l nodes=1:ppn=16

It should be use in priority and for small to medium predictions (up to 1000-1500 residus total).

For monomer predictions, please use half the node. As you see in the benchmarks at the end of the page, it take the same amount of time using the half node or the full node.

Queue alphafold2

This is the new queue for the new node node81. This node is composed of 48 cores, 192Gb of RAM and 2 RTX A6000. It declines in :

alphafold2_hn with -l nodes=1:ppn=24
alphafold2_1n with -l nodes=1:ppn=48

It should be use only for large and very large predictions (starting from 1500 residus). Between 1500 and 3200 residus, please use half of the node. Above 3200 residus, please use the full node.

Running Alphafold

You will find a full submission script to use at the end of that section

To run AF2, you will use the bash script called run_alphafold.sh located in the install folder (source)

Usage: run_alphafold.sh <OPTIONS>
Required Parameters:
-d <data_dir>         Path to directory of supporting data
-o <output_dir>       Path to a directory that will store the results.
-f <fasta_path>       Path to a FASTA file containing sequence. If a FASTA file contains multiple sequences, then it will be folded as a multimer
-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:
-g <use_gpu>          Enable NVIDIA runtime to run with GPUs (default: true)
-r <run_relax>        Whether to run the final relaxation step on the predicted models. Turning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the relaxation stage (default: true)
-e <enable_gpu_relax> Run relax on GPU if GPU is enabled (default: true)
-n <openmm_threads>   OpenMM threads (default: all available cores)
-a <gpu_devices>      Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
-m <model_preset>     Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model (default: 'monomer')
-c <db_preset>        Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs) (default: 'full_dbs')
-p <use_precomputed_msas> Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed (default: 'false')
-l <num_multimer_predictions_per_model> How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5 models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer (default: 5)
-b <benchmark>        Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'false')

In our case, here the following commands:

# Modules needed
module load gcc/8.3.0
module load miniconda-py3/latest
 
# All dependencies of AF2 are in this conda environment.
conda activate alphafold
 
# Path of the installation and the databases
AF2_path="/scratch-nv/software/alphafold/alphafold_repo"
AF2_db_path="/scratch-nv/software/alphafold/alphafold_db"
 
bash ${AF2_path}/run_alphafold.sh -d ${AF2_db_path} -o dummy_test -f query.fasta

Required arguments

-d: path of the databases. Use ${AF2_db_path} variable
-o: output directory of the results. You can use . or whatever you want
-f: path of the fasta file which contains the sequence of the protein you want to fold.
-t: Maximum release for the research of templates. Leave 2021-12-01 for now.

Monomer Calculations

For determine a single monomer protein structure, you can use half of the node if its sequence < 1000 residus (for *alphafold* queue).
Use the full node otherwise.

Pay attention to the size (in residus) of your protein. The first tests show that a run with a 2000-residus protein crashed the nodes node06X. It's fine on the node node081

This is mainly due to the RAM used by AF2 when performing multiple sequences alignments (MSAs).

It takes around 1-10hours to perform the predictions.

The command line will looks like that:

bash ${AF2_path}/run_alphafold.sh -d ${AF2_db_path} -o dummy_test -f query.fasta -t 2021-12-01 -m monomer

query.fasta contain one and only one sequence of your protein.

Multimer Calculations

For determine a multimer structure, you have to use the full node.
It's still due to the RAM used and the multimer version tends to use a lot (~30Gb at least).

It takes around 48-96hours to perform the predictions.

The command line will looks like that:

bash ${AF2_path}/run_alphafold.sh -d ${AF2_db_path} -o multimer_test -f sequences.fasta -t 2021-12-01 -m multimer

sequences.fasta contains the different sequences of the multimer with their corresponding copies.
For example, for a homomer of 2 copies of the same sequence <SEQUENCE>, sequences.fasta will contain:

>sequence_1
<SEQUENCE>
>sequence_2
<SEQUENCE>

For a heteromer of 2 copies of the sequence A <SEQUENCE_A> and 3 copies of the sequence B <SEQUENCE_B> , sequences.fasta will contain:

>sequence_1
<SEQUENCE_A>
>sequence_2
<SEQUENCE_A>
>sequence_3
<SEQUENCE_B>
>sequence_4
<SEQUENCE_B>
>sequence_5
<SEQUENCE_B>

Submission script

You can find below an example of a submission script to perform AlphaFold2 computations.
Note that you need to launch one job par protein sequence or multimer prediction.

Script version 20/10/2022

job_AF2.sh

#!/bin/bash
#PBS -S /bin/bash
#PBS -N AF2
#PBS -o $PBS_JOBID.out
#PBS -e $PBS_JOBID.err
 
#Half node for a sequence < 1000 residues
#Full node otherwise and for the multimer version.
#PBS -l nodes=1:ppn=16
#PBS -l walltime=24:00:00
#PBS -A simlab_project
#PBS -q alphafold_1n
 
#script version 20.10.2022
 
### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ###
WORKDIR="/"
NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l)
if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then
        echo "At least one variable is needed but not defined. Please touch your manager about."
        exit 1
else
        if [ $NUM_NODES -le 1 ]; then
                WORKDIR+="scratch/"
                export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/"
                mkdir $WORKDIR
                rsync -ap $PBS_O_WORKDIR/ $WORKDIR/
 
                # if you need to check your job output during execution (example: each hour) you can uncomment the following line
                # /shared/scripts/ADMIN__auto-rsync.example 3600 &
        else
                export WORKDIR=$PBS_O_WORKDIR
        fi
fi
 
echo "your current dir is: $PBS_O_WORKDIR"
echo "your workdir is: $WORKDIR"
echo "number of nodes: $NUM_NODES"
echo "number of cores: "$(cat $PBS_NODEFILE|wc -l)
echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done)
 
cd $WORKDIR
 
# If you're using only one node, it's counterproductive to use IB network for your MPI process communications
if [ $NUM_NODES -eq 1 ]; then
        export PSM_DEVICES=self,shm
        export OMPI_MCA_mtl=^psm
        export OMPI_MCA_btl=shm,self
else
# Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts
# we have to share PSM contexts between processes
# CIN is here the number of cores in node
        CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l)
        if [ $(($CIN/16)) -ge 2 ]; then
                PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l)
                if [ $CIN -eq 40 ]; then
                        export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4))
                elif [ $CIN -eq 32 ]; then
                        export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2))
                else
                        echo "This computing node is not supported by this script"
                fi
                echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX"
        else
                echo "no PSM_SHAREDCONTEXTS_MAX to define"
        fi
fi
 
function get_gpu-ids() {
        if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then
                echo "0,1" && return
        fi
 
        if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then
                FILE="/dev/cpuset/torque/$PBS_JOBID/cpus"
        elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then
                FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus"
        else
                FILE=""
        fi
 
        if [ -e $FILE ]; then
                if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then
                        echo "0" && return
                else
                        echo "1" && return
                fi
        else
                echo "0,1" && return
        fi
}
 
gpus=$(get_gpu-ids)
 
 
## USER Part
module load gcc/8.3.0
module load miniconda-py3/latest
 
conda activate alphafold
 
nb_cores=$(cat $PBS_NODEFILE|wc -l)
 
#Run
cd $WORKDIR/
 
AF2_path="/scratch-nv/software/alphafold/alphafold_repo"
AF2_db_path="/scratch-nv/software/alphafold/alphafold_db"
 
d1=`date +%s`
echo $(date)
 
 
# Use either one by uncommenting the command line.
# Just change the name of the fasta file.
#Monomer
#bash ${AF2_path}/run_alphafold.sh -n $nb_cores -a $gpus -d ${AF2_db_path} -o . -f myfasta.fasta -t 2021-12-01
 
#Multimer
#bash ${AF2_path}/run_alphafold.sh -n $nb_cores -a $gpus -d ${AF2_db_path} -o . -f sequences.fasta -t 2021-11-01 -m multimer
 
 
 
d2=$(date +%s)
echo $(date)
 
diff=$((($d2 - $d1)/60))
echo "Time spent (min) : ${diff}"
 
## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause
# At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir)
if [ $NUM_NODES -le 1 ]; then
        cd $PBS_O_WORKDIR
        rsync -ap $WORKDIR/ $PBS_O_WORKDIR/
        rm -rf $WORKDIR
fi
## END-DO

Outputs

All of the output will be in the directory set with option -o.
The details explanations are available here : https://github.com/deepmind/alphafold#alphafold-output

Basically, you will have :

ranked_X.pdb: PDB files containing the predicted structure ranked from 0 (best) to 4 (worst)
*.pkl: Pickle files containing the features of the deep learning process.
msas/: folder containing the various MSAs perform by AF2 on the several databases.

Benchmarks

Here some benchmarks made with our AF2 installation with protein sequences coming from CASP14 or colleagues.

Monomer

	Time spent (in hours)
Size (in residus)	alphafold_1n	alphafold2_1n	alphafold2_hn
141	1	0.3	0.3
262	1h30	x	x
580	3	x	x
833	6	2	2
2202	crash?	5	5

Multimer

Size (in residus)	Time spent (in hours) - alphafold_1n
724+1068 (tot: 1792)	45
1024+548 (tot: 1572)	60
140×3+281×2 (tot: 982)	57

Older versions

Older versions are still available and usable, you just need to change the path of AF2_path and AF2_db_path inside the job script:

For version 2.1.0

conda activate alphafold_2.2
[...]
AF2_path="/scratch-nv/software/alphafold/alphafold_repo_v2.1.0"
AF2_db_path="/scratch-nv/software/alphafold/alphafold_db_v2.1.0"

For version 2.2

conda activate alphafold_2.2
[...]
AF2_path="/scratch-nv/software/alphafold/alphafold_repo_v2.2"
AF2_db_path="/scratch-nv/software/alphafold/alphafold_db_v2.2"

Bibliography

To know more about AlphaFold and what you can except from it, here a seminar made recently by Thomas Terwilliger: https://www.renafobis.fr/seminaires-web-renafobis/alphafold-changes-everything-incorporating-predicted-models-in-x-ray-and-cryo-em-structure-determination

[1] Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2
[2] Richard Evans et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv 2021.10.04.463034; doi: https://doi.org/10.1101/2021.10.04.463034

Troubleshooting

In case of trouble, you can contact me at : hubert.santuz[at]ibpc.fr