AlphaFold 2[1] (AF2) has been installed on the Baal cluster, currently on nodes node061, node062, node063 and node081
This installation, version 2.3.1, include the new multimer feature (Alphafold-Multimer[2]).
It's based on this GitHub repo: https://github.com/deepmind/alphafold
To use it, you will find a example script and some benchmarks to give you idea of time needed to perform the predictions.
Alphafold version 2.3.1 : from 13/06/2023 to XXX
Alphafold version 2.2 : from 15/04/2022 to 13/06/2023
Alphafold version 2.1 : from 26/01/2022 to 15/04/2022
The installation process was taken from here: https://github.com/kalininalab/alphafold_non_docker
It's not the official process because the official one use docker which is ban on most HPC centers (include ours).
Since it's not the official way, bugs may occurs.
The repository of the Alphafold installation is located in /scratch-nv/software/alphafold/alphafold_repo
The databases needed are located in /scratch-nv/software/alphafold/alphafold_db
(Note that 2.2TB is required for all the databases).
Additionally, a conda environment called alphafold
has been created to install the dependencies.
The installation process is identical between nodes.
There is 2 queues for alphafold predictions.
For each, there is 2 suffix which indicate whether you wan the whole node (_1n
) or half the node (_hn
).
This is the first queue set in place and it includes nodes node61, node062, node063 with 16 cores, 64Gb of RAM and 2 GTX 1080 Ti.
It declines in :
alphafold_hn
with -l nodes=1:ppn=8
alphafold_1n
with -l nodes=1:ppn=16
It should be use in priority and for small to medium predictions (up to 1000-1500 residus total).
This is the new queue for the new node node81. This node is composed of 48 cores, 192Gb of RAM and 2 RTX A6000. It declines in :
alphafold2_hn
with -l nodes=1:ppn=24
alphafold2_1n
with -l nodes=1:ppn=48
To run AF2, you will use the bash script called run_alphafold.sh
located in the install folder (source)
Usage: run_alphafold.sh <OPTIONS> Required Parameters: -d <data_dir> Path to directory of supporting data -o <output_dir> Path to a directory that will store the results. -f <fasta_path> Path to a FASTA file containing sequence. If a FASTA file contains multiple sequences, then it will be folded as a multimer -t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets Optional Parameters: -g <use_gpu> Enable NVIDIA runtime to run with GPUs (default: true) -r <run_relax> Whether to run the final relaxation step on the predicted models. Turning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the relaxation stage (default: true) -e <enable_gpu_relax> Run relax on GPU if GPU is enabled (default: true) -n <openmm_threads> OpenMM threads (default: all available cores) -a <gpu_devices> Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0) -m <model_preset> Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model (default: 'monomer') -c <db_preset> Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs) (default: 'full_dbs') -p <use_precomputed_msas> Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed (default: 'false') -l <num_multimer_predictions_per_model> How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5 models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer (default: 5) -b <benchmark> Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'false')
In our case, here the following commands:
# Modules needed module load gcc/8.3.0 module load miniconda-py3/latest # All dependencies of AF2 are in this conda environment. conda activate alphafold # Path of the installation and the databases AF2_path="/scratch-nv/software/alphafold/alphafold_repo" AF2_db_path="/scratch-nv/software/alphafold/alphafold_db" bash ${AF2_path}/run_alphafold.sh -d ${AF2_db_path} -o dummy_test -f query.fasta
-d
: path of the databases. Use ${AF2_db_path}
variable-o
: output directory of the results. You can use .
or whatever you want-f
: path of the fasta file which contains the sequence of the protein you want to fold.-t
: Maximum release for the research of templates. Leave 2021-12-01
for now.
For determine a single monomer protein structure, you can use half of the node if its sequence < 1000 residus (for *alphafold* queue).
Use the full node otherwise.
This is mainly due to the RAM used by AF2 when performing multiple sequences alignments (MSAs).
It takes around 1-10hours to perform the predictions.
The command line will looks like that:
bash ${AF2_path}/run_alphafold.sh -d ${AF2_db_path} -o dummy_test -f query.fasta -t 2021-12-01 -m monomer
query.fasta
contain one and only one sequence of your protein.
For determine a multimer structure, you have to use the full node.
It's still due to the RAM used and the multimer version tends to use a lot (~30Gb at least).
It takes around 48-96hours to perform the predictions.
The command line will looks like that:
bash ${AF2_path}/run_alphafold.sh -d ${AF2_db_path} -o multimer_test -f sequences.fasta -t 2021-12-01 -m multimer
sequences.fasta
contains the different sequences of the multimer with their corresponding copies.
For example, for a homomer of 2 copies of the same sequence <SEQUENCE>
, sequences.fasta
will contain:
>sequence_1 <SEQUENCE> >sequence_2 <SEQUENCE>
For a heteromer of 2 copies of the sequence A <SEQUENCE_A>
and 3 copies of the sequence B <SEQUENCE_B>
, sequences.fasta
will contain:
>sequence_1 <SEQUENCE_A> >sequence_2 <SEQUENCE_A> >sequence_3 <SEQUENCE_B> >sequence_4 <SEQUENCE_B> >sequence_5 <SEQUENCE_B>
You can find below an example of a submission script to perform AlphaFold2 computations.
Note that you need to launch one job par protein sequence or multimer prediction.
Script version 20/10/2022
#!/bin/bash #PBS -S /bin/bash #PBS -N AF2 #PBS -o $PBS_JOBID.out #PBS -e $PBS_JOBID.err #Half node for a sequence < 1000 residues #Full node otherwise and for the multimer version. #PBS -l nodes=1:ppn=16 #PBS -l walltime=24:00:00 #PBS -A simlab_project #PBS -q alphafold_1n #script version 20.10.2022 ### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ### WORKDIR="/" NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l) if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then echo "At least one variable is needed but not defined. Please touch your manager about." exit 1 else if [ $NUM_NODES -le 1 ]; then WORKDIR+="scratch/" export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/" mkdir $WORKDIR rsync -ap $PBS_O_WORKDIR/ $WORKDIR/ # if you need to check your job output during execution (example: each hour) you can uncomment the following line # /shared/scripts/ADMIN__auto-rsync.example 3600 & else export WORKDIR=$PBS_O_WORKDIR fi fi echo "your current dir is: $PBS_O_WORKDIR" echo "your workdir is: $WORKDIR" echo "number of nodes: $NUM_NODES" echo "number of cores: "$(cat $PBS_NODEFILE|wc -l) echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done) cd $WORKDIR # If you're using only one node, it's counterproductive to use IB network for your MPI process communications if [ $NUM_NODES -eq 1 ]; then export PSM_DEVICES=self,shm export OMPI_MCA_mtl=^psm export OMPI_MCA_btl=shm,self else # Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts # we have to share PSM contexts between processes # CIN is here the number of cores in node CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l) if [ $(($CIN/16)) -ge 2 ]; then PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l) if [ $CIN -eq 40 ]; then export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4)) elif [ $CIN -eq 32 ]; then export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2)) else echo "This computing node is not supported by this script" fi echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX" else echo "no PSM_SHAREDCONTEXTS_MAX to define" fi fi function get_gpu-ids() { if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then echo "0,1" && return fi if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then FILE="/dev/cpuset/torque/$PBS_JOBID/cpus" elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus" else FILE="" fi if [ -e $FILE ]; then if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then echo "0" && return else echo "1" && return fi else echo "0,1" && return fi } gpus=$(get_gpu-ids) ## USER Part module load gcc/8.3.0 module load miniconda-py3/latest conda activate alphafold nb_cores=$(cat $PBS_NODEFILE|wc -l) #Run cd $WORKDIR/ AF2_path="/scratch-nv/software/alphafold/alphafold_repo" AF2_db_path="/scratch-nv/software/alphafold/alphafold_db" d1=`date +%s` echo $(date) # Use either one by uncommenting the command line. # Just change the name of the fasta file. #Monomer #bash ${AF2_path}/run_alphafold.sh -n $nb_cores -a $gpus -d ${AF2_db_path} -o . -f myfasta.fasta -t 2021-12-01 #Multimer #bash ${AF2_path}/run_alphafold.sh -n $nb_cores -a $gpus -d ${AF2_db_path} -o . -f sequences.fasta -t 2021-11-01 -m multimer d2=$(date +%s) echo $(date) diff=$((($d2 - $d1)/60)) echo "Time spent (min) : ${diff}" ## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause # At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir) if [ $NUM_NODES -le 1 ]; then cd $PBS_O_WORKDIR rsync -ap $WORKDIR/ $PBS_O_WORKDIR/ rm -rf $WORKDIR fi ## END-DO
All of the output will be in the directory set with option -o
.
The details explanations are available here : https://github.com/deepmind/alphafold#alphafold-output
Basically, you will have :
ranked_X.pdb
: PDB files containing the predicted structure ranked from 0 (best) to 4 (worst)*.pkl
: Pickle files containing the features of the deep learning process.msas/
: folder containing the various MSAs perform by AF2 on the several databases.Here some benchmarks made with our AF2 installation with protein sequences coming from CASP14 or colleagues.
Time spent (in hours) | |||
---|---|---|---|
Size (in residus) | alphafold_1n | alphafold2_1n | alphafold2_hn |
141 | 1 | 0.3 | 0.3 |
262 | 1h30 | x | x |
580 | 3 | x | x |
833 | 6 | 2 | 2 |
2202 | crash? | 5 | 5 |
Size (in residus) | Time spent (in hours) - alphafold_1n |
724+1068 (tot: 1792) | 45 |
1024+548 (tot: 1572) | 60 |
140×3+281×2 (tot: 982) | 57 |
Older versions are still available and usable, you just need to change the path of AF2_path
and AF2_db_path
inside the job script:
For version 2.1.0
conda activate alphafold_2.2 [...] AF2_path="/scratch-nv/software/alphafold/alphafold_repo_v2.1.0" AF2_db_path="/scratch-nv/software/alphafold/alphafold_db_v2.1.0"
For version 2.2
conda activate alphafold_2.2 [...] AF2_path="/scratch-nv/software/alphafold/alphafold_repo_v2.2" AF2_db_path="/scratch-nv/software/alphafold/alphafold_db_v2.2"
[1] Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2
[2] Richard Evans et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv 2021.10.04.463034; doi: https://doi.org/10.1101/2021.10.04.463034
In case of trouble, you can contact me at : hubert.santuz[at]ibpc.fr