====== AlphaFold ======
AlphaFold 2[[alphafold#bibliography|[1]]] (AF2) has been installed on the Baal cluster, currently **on nodes node061, node062, node063 and node081**
This installation, version **2.3.1**, include the new **multimer** feature (Alphafold-Multimer[[alphafold#bibliography|[2]]]).\\
It's based on this GitHub repo: https://github.com/deepmind/alphafold
**To use it, you will find a [[alphafold#submission_script|example script]] and [[alphafold#benchmarks|some benchmarks]] to give you idea of time needed to perform the predictions.**
===== Versions =====
Alphafold version 2.3.1 : **from 13/06/2023 to XXX**\\
Alphafold version 2.2 : **from 15/04/2022 to 13/06/2023**\\
Alphafold version 2.1 : **from 26/01/2022 to 15/04/2022**\\
===== Installation =====
The installation process was taken from here: https://github.com/kalininalab/alphafold_non_docker\\
It's not the official process because the official one use docker which is ban on most HPC centers (include ours).
Since it's not the official way, bugs may occurs.
The repository of the Alphafold installation is located in ''/scratch-nv/software/alphafold/alphafold_repo''\\
The databases needed are located in ''/scratch-nv/software/alphafold/alphafold_db'' (Note that 2.2TB is required for all the databases).\\
Additionally, a [[conda]] environment called ''alphafold'' has been created to install the dependencies.
The installation process is identical between nodes.
===== Utilization =====
==== Queues ====
There is **2 queues** for alphafold predictions.
For each, there is 2 suffix which indicate whether you wan the whole node (''_1n'') or half the node (''_hn'').
=== Queue *alphafold* ===
This is the first queue set in place and it includes nodes **node61**, **node062**, **node063** with 16 cores, 64Gb of RAM and 2 GTX 1080 Ti.\\
It declines in :
* ''alphafold_hn'' with ''-l nodes=1:ppn=8''
* ''alphafold_1n'' with ''-l nodes=1:ppn=16''
**It should be use in priority and for small to medium predictions (up to 1000-1500 residus total).**
For monomer predictions, please use half the node. As you see in the benchmarks at the end of the page, it take the same amount of time using the half node or the full node.
=== Queue *alphafold2* ===
This is the new queue for the new node **node81**. This node is composed of 48 cores, 192Gb of RAM and 2 RTX A6000.
It declines in :
* ''alphafold2_hn'' with ''-l nodes=1:ppn=24''
* ''alphafold2_1n'' with ''-l nodes=1:ppn=48''
**It should be use only for large and very large predictions (starting from 1500 residus).
Between 1500 and 3200 residus, please use half of the node. Above 3200 residus, please use the full node.**
==== Running Alphafold ====
You will find a full submission script to use at the end of that section
To run AF2, you will use the bash script called ''run_alphafold.sh'' located in the install folder ([[https://github.com/kalininalab/alphafold_non_docker#running-alphafold-v211|source]])
Usage: run_alphafold.sh
Required Parameters:
-d Path to directory of supporting data
-o Path to a directory that will store the results.
-f Path to a FASTA file containing sequence. If a FASTA file contains multiple sequences, then it will be folded as a multimer
-t Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:
-g Enable NVIDIA runtime to run with GPUs (default: true)
-r Whether to run the final relaxation step on the predicted models. Turning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the relaxation stage (default: true)
-e Run relax on GPU if GPU is enabled (default: true)
-n OpenMM threads (default: all available cores)
-a Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
-m Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model (default: 'monomer')
-c Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs) (default: 'full_dbs')
-p Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed (default: 'false')
-l How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5 models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer (default: 5)
-b Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'false')
In our case, here the following commands:
# Modules needed
module load gcc/8.3.0
module load miniconda-py3/latest
# All dependencies of AF2 are in this conda environment.
conda activate alphafold
# Path of the installation and the databases
AF2_path="/scratch-nv/software/alphafold/alphafold_repo"
AF2_db_path="/scratch-nv/software/alphafold/alphafold_db"
bash ${AF2_path}/run_alphafold.sh -d ${AF2_db_path} -o dummy_test -f query.fasta
==== Required arguments ====
* ''-d'': path of the databases. Use ''${AF2_db_path}'' variable
* ''-o'': output directory of the results. You can use ''.'' or whatever you want
* ''-f'': path of the fasta file which contains the sequence of the protein you want to fold.
* ''-t'': Maximum release for the research of templates. Leave ''2021-12-01'' for now.
==== Monomer Calculations ====
For determine a single monomer protein structure, you can use **half of the node if its sequence < 1000 residus** (for *alphafold* queue).\\
Use the full node otherwise.
Pay attention to the size (in residus) of your protein. The first tests show that a run with a 2000-residus protein crashed the nodes node06X. It's fine on the node node081
This is mainly due to the RAM used by AF2 when performing multiple sequences alignments (MSAs).
It takes around 1-10hours to perform the predictions.
The command line will looks like that:
bash ${AF2_path}/run_alphafold.sh -d ${AF2_db_path} -o dummy_test -f query.fasta -t 2021-12-01 -m monomer
''query.fasta'' contain one and only one sequence of your protein.
==== Multimer Calculations ====
For determine a multimer structure, **you have to use the full node**.\\
It's still due to the RAM used and the multimer version tends to use a lot (~30Gb at least).
It takes around 48-96hours to perform the predictions.
The command line will looks like that:
bash ${AF2_path}/run_alphafold.sh -d ${AF2_db_path} -o multimer_test -f sequences.fasta -t 2021-12-01 -m multimer
''sequences.fasta'' contains the different sequences of the multimer with their corresponding copies.\\
For example, for a homomer of 2 copies of the same sequence '''', ''sequences.fasta'' will contain:
>sequence_1
>sequence_2
For a heteromer of 2 copies of the sequence A '''' and 3 copies of the sequence B '''' , ''sequences.fasta'' will contain:
>sequence_1
>sequence_2
>sequence_3
>sequence_4
>sequence_5
==== Submission script ====
You can find below an example of a submission script to perform AlphaFold2 computations.\\
**Note that you need to launch one job par protein sequence or multimer prediction.**
**Script version 20/10/2022**
#!/bin/bash
#PBS -S /bin/bash
#PBS -N AF2
#PBS -o $PBS_JOBID.out
#PBS -e $PBS_JOBID.err
#Half node for a sequence < 1000 residues
#Full node otherwise and for the multimer version.
#PBS -l nodes=1:ppn=16
#PBS -l walltime=24:00:00
#PBS -A simlab_project
#PBS -q alphafold_1n
#script version 20.10.2022
### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ###
WORKDIR="/"
NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l)
if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then
echo "At least one variable is needed but not defined. Please touch your manager about."
exit 1
else
if [ $NUM_NODES -le 1 ]; then
WORKDIR+="scratch/"
export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/"
mkdir $WORKDIR
rsync -ap $PBS_O_WORKDIR/ $WORKDIR/
# if you need to check your job output during execution (example: each hour) you can uncomment the following line
# /shared/scripts/ADMIN__auto-rsync.example 3600 &
else
export WORKDIR=$PBS_O_WORKDIR
fi
fi
echo "your current dir is: $PBS_O_WORKDIR"
echo "your workdir is: $WORKDIR"
echo "number of nodes: $NUM_NODES"
echo "number of cores: "$(cat $PBS_NODEFILE|wc -l)
echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done)
cd $WORKDIR
# If you're using only one node, it's counterproductive to use IB network for your MPI process communications
if [ $NUM_NODES -eq 1 ]; then
export PSM_DEVICES=self,shm
export OMPI_MCA_mtl=^psm
export OMPI_MCA_btl=shm,self
else
# Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts
# we have to share PSM contexts between processes
# CIN is here the number of cores in node
CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l)
if [ $(($CIN/16)) -ge 2 ]; then
PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l)
if [ $CIN -eq 40 ]; then
export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4))
elif [ $CIN -eq 32 ]; then
export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2))
else
echo "This computing node is not supported by this script"
fi
echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX"
else
echo "no PSM_SHAREDCONTEXTS_MAX to define"
fi
fi
function get_gpu-ids() {
if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then
echo "0,1" && return
fi
if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then
FILE="/dev/cpuset/torque/$PBS_JOBID/cpus"
elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then
FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus"
else
FILE=""
fi
if [ -e $FILE ]; then
if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then
echo "0" && return
else
echo "1" && return
fi
else
echo "0,1" && return
fi
}
gpus=$(get_gpu-ids)
## USER Part
module load gcc/8.3.0
module load miniconda-py3/latest
conda activate alphafold
nb_cores=$(cat $PBS_NODEFILE|wc -l)
#Run
cd $WORKDIR/
AF2_path="/scratch-nv/software/alphafold/alphafold_repo"
AF2_db_path="/scratch-nv/software/alphafold/alphafold_db"
d1=`date +%s`
echo $(date)
# Use either one by uncommenting the command line.
# Just change the name of the fasta file.
#Monomer
#bash ${AF2_path}/run_alphafold.sh -n $nb_cores -a $gpus -d ${AF2_db_path} -o . -f myfasta.fasta -t 2021-12-01
#Multimer
#bash ${AF2_path}/run_alphafold.sh -n $nb_cores -a $gpus -d ${AF2_db_path} -o . -f sequences.fasta -t 2021-11-01 -m multimer
d2=$(date +%s)
echo $(date)
diff=$((($d2 - $d1)/60))
echo "Time spent (min) : ${diff}"
## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause
# At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir)
if [ $NUM_NODES -le 1 ]; then
cd $PBS_O_WORKDIR
rsync -ap $WORKDIR/ $PBS_O_WORKDIR/
rm -rf $WORKDIR
fi
## END-DO
===== Outputs =====
All of the output will be in the directory set with option ''-o''.\\
The details explanations are available here : https://github.com/deepmind/alphafold#alphafold-output
Basically, you will have :
* ''ranked_X.pdb'': PDB files containing the predicted structure ranked from 0 (best) to 4 (worst)
* ''*.pkl'': Pickle files containing the features of the deep learning process.
* ''msas/'': folder containing the various MSAs perform by AF2 on the several databases.
===== Benchmarks =====
Here some benchmarks made with our AF2 installation with protein sequences coming from CASP14 or colleagues.
==== Monomer ====
^ ^ Time spent (in hours) ^^^
^ Size (in residus) ^ alphafold_1n ^ alphafold2_1n ^ alphafold2_hn ^
| 141 | 1 | 0.3 | 0.3 |
| 262 | 1h30 | x | x |
| 580 | 3 | x | x |
| 833 | 6 | 2 | 2 |
| 2202 | crash? | 5| 5 |
==== Multimer ====
| Size (in residus) | Time spent (in hours) - alphafold_1n |
| 724+1068 (tot: 1792) | 45 |
| 1024+548 (tot: 1572) | 60 |
| 140x3+281x2 (tot: 982) | 57 |
===== Older versions =====
Older versions are still available and usable, you just need to change the path of ''AF2_path'' and ''AF2_db_path'' inside the job script:
For **version 2.1.0**
conda activate alphafold_2.2
[...]
AF2_path="/scratch-nv/software/alphafold/alphafold_repo_v2.1.0"
AF2_db_path="/scratch-nv/software/alphafold/alphafold_db_v2.1.0"
For **version 2.2**
conda activate alphafold_2.2
[...]
AF2_path="/scratch-nv/software/alphafold/alphafold_repo_v2.2"
AF2_db_path="/scratch-nv/software/alphafold/alphafold_db_v2.2"
===== Bibliography =====
To know more about AlphaFold and what you can except from it, here a seminar made recently by Thomas Terwilliger:
https://www.renafobis.fr/seminaires-web-renafobis/alphafold-changes-everything-incorporating-predicted-models-in-x-ray-and-cryo-em-structure-determination
[1] Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2\\
[2] Richard Evans et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv 2021.10.04.463034; doi: https://doi.org/10.1101/2021.10.04.463034
===== Troubleshooting =====
In case of trouble, you can contact me at : ''hubert.santuz[at]ibpc.fr''