Table des matières
ColabFold
ColabFold is a modified version of Alphafold focused on speed-up the predictions with less computer resources needed.
They achieve that by replacing the AF2 MSA generation with a fast MMseqs2 search which is done on a remote server, among others optimizations. It's still rely on AF2 to make the 3D models.
The other nice feature of ColabFold is the command line offers more options to tweak and personalize the predictions :
- Use a custom MSA as an input
- Stop the calculations below a custom threeshold
- etc
It also provides graphs at the end to judge the predictions quality: one with the sequence coverage and one with the predicted lDDT (confidence score), both along the sequence.
It could be a good complement to the original AlphaFold implementation.
Version
Since the 22/05/2023, the version used in the 1.5.2 (https://github.com/sokrypton/ColabFold).
Ressources
To know more about Colabfold, I highly recommend to read:
- the preprint : https://www.biorxiv.org/content/10.1101/2021.08.15.456425v2
- the GitHub repo: https://github.com/sokrypton/ColabFold
- the available notebook: https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb
Installation
The installation follow the same process of AlphaFold.
It's available on nodes node061, node062, node063 and node081
The installation process was taken from here: https://github.com/YoshitakaMo/localcolabfold
It's installed in this folder: /scratch-nv/software/colabfold
.
Additionally, a conda environment called colabfold-conda has been created to install the dependencies.
Utilization
Use the same queues as alphafold: alphafold
or alphafold2
See here: http://www-lbt.ibpc.fr/wiki/doku.php?id=cluster-lbt:extra-tools:alphafold_tool#queues
Input file
Colabfold support different types of files. You can give a directory with fasta/a3m files or a csv file or only a fasta/a3m file.
For a monomer prediction, you can give several sequences and they will be treated as batch (one after another).
For a multimer prediction, you cannot supply a multi fasta.
You need a fasta file containing the sequences of each monomer separated by :
>Multimer with XX and YY sequences XXXX:YYYY
Custom MSA
You can choose also to provide yourself a Multiple Sequence Alignement file for your prediction. This file need to be ine .a3m
format.
Here the steps to generate one:
- Create a multi fasta file with your target sequence and the sequences to be aligned.
- Use a alignement tool to generate a MSA file with FASTA format for the output (you can use Clustal Omega available https://www.ebi.ac.uk/Tools/msa/clustalo/)
- Use FormatSeq (https://toolkit.tuebingen.mpg.de/tools/formatseq) to convert the MSA in FASTA format to A3M format.
- Adapt the A3M format file to be correctly parsed by colabfold. The first line starts with a header line marked by a
#
. The header consists of two lists separated by a tab. The first list contains the sequence length for each chain and the second its cardinality (see link here https://github.com/sokrypton/ColabFold/issues/76)- For a monomer:
#12 1
where 12 is the sequence length - For a homodimer:
#12 2
with 2 to indicate the dimer - For a heterodimer:
#12,10 1,1
with 12,10 the sequence of each monomer and 1,1 the stochiometry
- The first sequence, which is the target sequence, need to be renamed to
101
in case of monomer/homodimer and101 102
for heterodimer
Then, use the .a3m
instead of the fasta file as an input. You can also use the –max-msa
option to choose the number of sequences to use for the MSA.
Running
A bash script is available called colabfold_batch
:
santuz@node063 ~]$ /scratch-nv/software/colabfold/colabfold-conda/bin/colabfold_batch --help usage: colabfold_batch [-h] [--stop-at-score STOP_AT_SCORE] [--num-recycle NUM_RECYCLE] [--recycle-early-stop-tolerance RECYCLE_EARLY_STOP_TOLERANCE] [--num-ensemble NUM_ENSEMBLE] [--num-seeds NUM_SEEDS] [--random-seed RANDOM_SEED] [--num-models {1,2,3,4,5}] [--recompile-padding RECOMPILE_PADDING] [--model-order MODEL_ORDER] [--host-url HOST_URL] [--data DATA] [--msa-mode {mmseqs2_uniref_env,mmseqs2_uniref,single_sequence}] [--model-type {auto,alphafold2,alphafold2_ptm,alphafold2_multimer_v1,alphafold2_multimer_v2,alphafold2_multimer_v3}] [--amber] [--num-relax NUM_RELAX] [--templates] [--custom-template-path CUSTOM_TEMPLATE_PATH] [--rank {auto,plddt,ptm,iptm,multimer}] [--pair-mode {unpaired,paired,unpaired_paired}] [--sort-queries-by {none,length,random}] [--save-single-representations] [--save-pair-representations] [--use-dropout] [--max-seq MAX_SEQ] [--max-extra-seq MAX_EXTRA_SEQ] [--max-msa MAX_MSA] [--disable-cluster-profile] [--zip] [--use-gpu-relax] [--save-all] [--save-recycles] [--overwrite-existing-results] [--disable-unified-memory] input results positional arguments: input Can be one of the following: Directory with fasta/a3m files, a csv/tsv file, a fasta file or an a3m file results Directory to write the results to options: -h, --help show this help message and exit --stop-at-score STOP_AT_SCORE Compute models until plddt (single chain) or ptmscore (complex) > threshold is reached. This can make colabfold much faster by only running the first model for easy queries. --num-recycle NUM_RECYCLE Number of prediction recycles.Increasing recycles can improve the quality but slows down the prediction. --recycle-early-stop-tolerance RECYCLE_EARLY_STOP_TOLERANCE Specify convergence criteria.Run until the distance between recycles is within specified value. --num-ensemble NUM_ENSEMBLE Number of ensembles.The trunk of the network is run multiple times with different random choices for the MSA cluster centers. --num-seeds NUM_SEEDS Number of seeds to try. Will iterate from range(random_seed, random_seed+num_seeds).. --random-seed RANDOM_SEED Changing the seed for the random number generator can result in different structure predictions. --num-models {1,2,3,4,5} --recompile-padding RECOMPILE_PADDING Whenever the input length changes, the model needs to be recompiled.We pad sequences by specified length, so we can e.g. compute sequence from length 100 to 110 without recompiling.The prediction will become marginally slower for the longer input, but overall performance increases due to not recompiling. Set to 0 to disable. --model-order MODEL_ORDER --host-url HOST_URL --data DATA --msa-mode {mmseqs2_uniref_env,mmseqs2_uniref,single_sequence} Using an a3m file as input overwrites this option --model-type {auto,alphafold2,alphafold2_ptm,alphafold2_multimer_v1,alphafold2_multimer_v2,alphafold2_multimer_v3} predict strucutre/complex using the following model.Auto will pick "alphafold2_ptm" for structure predictions and "alphafold2_multimer_v3" for complexes. --amber Use amber for structure refinement.To control number of top ranked structures are relaxed set --num-relax. --num-relax NUM_RELAX specify how many of the top ranked structures to relax using amber. --templates Use templates from pdb --custom-template-path CUSTOM_TEMPLATE_PATH Directory with pdb files to be used as input --rank {auto,plddt,ptm,iptm,multimer} rank models by auto, plddt or ptmscore --pair-mode {unpaired,paired,unpaired_paired} rank models by auto, unpaired, paired, unpaired_paired --sort-queries-by {none,length,random} sort queries by: none, length, random --save-single-representations saves the single representation embeddings of all models --save-pair-representations saves the pair representation embeddings of all models --use-dropout activate dropouts during inference to sample from uncertainity of the models --max-seq MAX_SEQ number of sequence clusters to use --max-extra-seq MAX_EXTRA_SEQ number of extra sequences to use --max-msa MAX_MSA defines: `max-seq:max-extra-seq` number of sequences to use --disable-cluster-profile EXPERIMENTAL: for multimer models, disable cluster profiles --zip zip all results into one <jobname>.result.zip and delete the original files --use-gpu-relax run amber on GPU instead of CPU --save-all save ALL raw outputs from model to a pickle file --save-recycles save all intermediate predictions at each recycle --overwrite-existing-results --disable-unified-memory if you are getting tensorflow/jax errors it might help to disable this
Submission script
You can find below an example of a submission script to perform Colabfold computations. Note that you need to launch one job par protein sequence or multimer prediction.
Script version 22/05/2023
- job_ColabFold.sh
#!/bin/bash #PBS -S /bin/bash #PBS -N AF2 #PBS -o $PBS_JOBID.out #PBS -e $PBS_JOBID.err #Half node for a sequence < 1000 residues #Full node otherwise and for the multimer version. #PBS -l nodes=1:ppn=16 #PBS -l walltime=24:00:00 #PBS -A simlab_project #PBS -q alphafold_1n #script version 22.05.2023 ### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ### WORKDIR="/" NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l) if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then echo "At least one variable is needed but not defined. Please touch your manager about." exit 1 else if [ $NUM_NODES -le 1 ]; then WORKDIR+="scratch/" export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/" mkdir $WORKDIR rsync -ap $PBS_O_WORKDIR/ $WORKDIR/ # if you need to check your job output during execution (example: each hour) you can uncomment the following line # /shared/scripts/ADMIN__auto-rsync.example 3600 & else export WORKDIR=$PBS_O_WORKDIR fi fi echo "your current dir is: $PBS_O_WORKDIR" echo "your workdir is: $WORKDIR" echo "number of nodes: $NUM_NODES" echo "number of cores: "$(cat $PBS_NODEFILE|wc -l) echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done) cd $WORKDIR # If you're using only one node, it's counterproductive to use IB network for your MPI process communications if [ $NUM_NODES -eq 1 ]; then export PSM_DEVICES=self,shm export OMPI_MCA_mtl=^psm export OMPI_MCA_btl=shm,self else # Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts # we have to share PSM contexts between processes # CIN is here the number of cores in node CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l) if [ $(($CIN/16)) -ge 2 ]; then PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l) if [ $CIN -eq 40 ]; then export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4)) elif [ $CIN -eq 32 ]; then export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2)) else echo "This computing node is not supported by this script" fi echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX" else echo "no PSM_SHAREDCONTEXTS_MAX to define" fi fi function get_gpu-ids() { if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then echo "0,1" && return fi if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then FILE="/dev/cpuset/torque/$PBS_JOBID/cpus" elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus" else FILE="" fi if [ -e $FILE ]; then if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then echo "0" && return else echo "1" && return fi else echo "0,1" && return fi } gpus=$(get_gpu-ids) ## USER Part module load gcc/8.3.0 module load miniconda-py3/latest COLABFOLDDIR=/scratch-nv/software/colabfold conda activate $COLABFOLDDIR/colabfold-conda nb_cores=$(cat $PBS_NODEFILE|wc -l) #Run cd $WORKDIR/ d1=`date +%s` echo $(date) # Use either one by uncommenting the command line. # Just change the name of the fasta file. #Monomer #${COLABFOLDDIR}/colabfold-conda/bin/colabfold_batch query.fasta outputdir/ #Multimer #${COLABFOLDDIR}/colabfold-conda/bin/colabfold_batch multi_fasta.fasta outputdir --model-type alphafold2_multimer_v3 d2=$(date +%s) echo $(date) diff=$((($d2 - $d1)/60)) echo "Time spent (min) : ${diff}" ## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause # At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir) if [ $NUM_NODES -le 1 ]; then cd $PBS_O_WORKDIR rsync -ap $WORKDIR/ $PBS_O_WORKDIR/ rm -rf $WORKDIR fi ## END-DO
Benchmarks
Not so much benchmarks have been done for Colabfold but first tests show only a few minutes run for protein around 200-300 residus.
Older versions
Older versions are still available and usable, you just need to use the correct conda environment and the correct submission script.
For version 1.3.0, use this script:
- job_ColabFold.sh
#!/bin/bash #PBS -S /bin/bash #PBS -N AF2 #PBS -o $PBS_JOBID.out #PBS -e $PBS_JOBID.err #Half node for a sequence < 1000 residues #Full node otherwise and for the multimer version. #PBS -l nodes=1:ppn=16 #PBS -l walltime=24:00:00 #PBS -A simlab_project #PBS -q alphafold_1n #script version 11.10.2022 for colabfold 1.3.0 ### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ### WORKDIR="/" NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l) if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then echo "At least one variable is needed but not defined. Please touch your manager about." exit 1 else if [ $NUM_NODES -le 1 ]; then WORKDIR+="scratch/" export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/" mkdir $WORKDIR rsync -ap $PBS_O_WORKDIR/ $WORKDIR/ # if you need to check your job output during execution (example: each hour) you can uncomment the following line # /shared/scripts/ADMIN__auto-rsync.example 3600 & else export WORKDIR=$PBS_O_WORKDIR fi fi echo "your current dir is: $PBS_O_WORKDIR" echo "your workdir is: $WORKDIR" echo "number of nodes: $NUM_NODES" echo "number of cores: "$(cat $PBS_NODEFILE|wc -l) echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done) cd $WORKDIR # If you're using only one node, it's counterproductive to use IB network for your MPI process communications if [ $NUM_NODES -eq 1 ]; then export PSM_DEVICES=self,shm export OMPI_MCA_mtl=^psm export OMPI_MCA_btl=shm,self else # Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts # we have to share PSM contexts between processes # CIN is here the number of cores in node CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l) if [ $(($CIN/16)) -ge 2 ]; then PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l) if [ $CIN -eq 40 ]; then export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4)) elif [ $CIN -eq 32 ]; then export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2)) else echo "This computing node is not supported by this script" fi echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX" else echo "no PSM_SHAREDCONTEXTS_MAX to define" fi fi function get_gpu-ids() { if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then echo "0,1" && return fi if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then FILE="/dev/cpuset/torque/$PBS_JOBID/cpus" elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus" else FILE="" fi if [ -e $FILE ]; then if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then echo "0" && return else echo "1" && return fi else echo "0,1" && return fi } gpus=$(get_gpu-ids) ## USER Part module load gcc/8.3.0 module load miniconda-py3/latest COLABFOLDDIR=/scratch-nv/software/colabfold_1.3.0 conda activate $COLABFOLDDIR/colabfold-conda nb_cores=$(cat $PBS_NODEFILE|wc -l) #Run cd $WORKDIR/ d1=`date +%s` echo $(date) # Use either one by uncommenting the command line. # Just change the name of the fasta file. #Monomer #bash ${COLABFOLDDIR}/bin/colabfold_batch --data ${COLABFOLDDIR}/colabfold query.fasta outputdir/ #Multimer #bash ${COLABFOLDDIR}/bin/colabfold_batch --data ${COLABFOLDDIR}/colabfold multi_fasta.fasta outputdir --model-type AlphaFold2-multimer-v2 d2=$(date +%s) echo $(date) diff=$((($d2 - $d1)/60)) echo "Time spent (min) : ${diff}" ## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause # At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir) if [ $NUM_NODES -le 1 ]; then cd $PBS_O_WORKDIR rsync -ap $WORKDIR/ $PBS_O_WORKDIR/ rm -rf $WORKDIR fi ## END-DO
Troubleshooting
In case of trouble, you can contact me at : hubert.santuz[at]ibpc.fr
Error : KeyError: 'data'
If you encounter this error:
Traceback (most recent call last): File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/ml_collections/config_dict/config_dict.py", line 883, in __getitem__ field = self._fields[key] KeyError: 'data' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/ml_collections/config_dict/config_dict.py", line 807, in __getattr__ return self[attribute] File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/ml_collections/config_dict/config_dict.py", line 889, in __getitem__ raise KeyError(self._generate_did_you_mean_message(key, str(e))) KeyError: "'data'" During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/scratch-nv/software/colabfold/colabfold-conda/bin/colabfold_batch", line 8, in <module> sys.exit(main()) File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 1336, in main zip_results=args.zip, File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 1092, in run prediction_callback=prediction_callback, File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 197, in predict_structure use_templates, File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 132, in batch_input eval_cfg = model_config.data.eval File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/ml_collections/config_dict/config_dict.py", line 809, in __getattr__ raise AttributeError(e)
This means your fasta file is incorrectly formatted for the multimer option of Colabfold.
As stated here, the fasta must be in the form:
>Multimer with XX and YY sequences XXXX:YYYY