Table des matières

ColabFold

ColabFold is a modified version of Alphafold focused on speed-up the predictions with less computer resources needed.
They achieve that by replacing the AF2 MSA generation with a fast MMseqs2 search which is done on a remote server, among others optimizations. It's still rely on AF2 to make the 3D models.

The other nice feature of ColabFold is the command line offers more options to tweak and personalize the predictions :

It also provides graphs at the end to judge the predictions quality: one with the sequence coverage and one with the predicted lDDT (confidence score), both along the sequence.

It could be a good complement to the original AlphaFold implementation.

This installation has not been thoroughly tested as the AlfphaFold one. You may encounter some bugs when using the CLI options.

Version

Since the 22/05/2023, the version used in the 1.5.2 (https://github.com/sokrypton/ColabFold).

Ressources

To know more about Colabfold, I highly recommend to read:

Installation

The installation follow the same process of AlphaFold.
It's available on nodes node061, node062, node063 and node081

The installation process was taken from here: https://github.com/YoshitakaMo/localcolabfold
It's installed in this folder: /scratch-nv/software/colabfold.
Additionally, a conda environment called colabfold-conda has been created to install the dependencies.

Utilization

Use the same queues as alphafold: alphafold or alphafold2

See here: http://www-lbt.ibpc.fr/wiki/doku.php?id=cluster-lbt:extra-tools:alphafold_tool#queues

Input file

Colabfold support different types of files. You can give a directory with fasta/a3m files or a csv file or only a fasta/a3m file.

For a monomer prediction, you can give several sequences and they will be treated as batch (one after another).

For a multimer prediction, you cannot supply a multi fasta. You need a fasta file containing the sequences of each monomer separated by :

>Multimer with XX and YY sequences
XXXX:YYYY

Custom MSA

You can choose also to provide yourself a Multiple Sequence Alignement file for your prediction. This file need to be ine .a3m format.
Here the steps to generate one:

  1. Create a multi fasta file with your target sequence and the sequences to be aligned.
  2. Use a alignement tool to generate a MSA file with FASTA format for the output (you can use Clustal Omega available https://www.ebi.ac.uk/Tools/msa/clustalo/)
  3. Use FormatSeq (https://toolkit.tuebingen.mpg.de/tools/formatseq) to convert the MSA in FASTA format to A3M format.
  4. Adapt the A3M format file to be correctly parsed by colabfold. The first line starts with a header line marked by a #. The header consists of two lists separated by a tab. The first list contains the sequence length for each chain and the second its cardinality (see link here https://github.com/sokrypton/ColabFold/issues/76)
    • For a monomer: #12 1 where 12 is the sequence length
    • For a homodimer: #12 2 with 2 to indicate the dimer
    • For a heterodimer: #12,10 1,1 with 12,10 the sequence of each monomer and 1,1 the stochiometry
  5. The first sequence, which is the target sequence, need to be renamed to 101 in case of monomer/homodimer and 101 102 for heterodimer

Then, use the .a3m instead of the fasta file as an input. You can also use the –max-msa option to choose the number of sequences to use for the MSA.

Running

A bash script is available called colabfold_batch:

santuz@node063 ~]$ /scratch-nv/software/colabfold/colabfold-conda/bin/colabfold_batch --help
 
usage: colabfold_batch [-h] [--stop-at-score STOP_AT_SCORE] [--num-recycle NUM_RECYCLE] [--recycle-early-stop-tolerance RECYCLE_EARLY_STOP_TOLERANCE] [--num-ensemble NUM_ENSEMBLE] [--num-seeds NUM_SEEDS] [--random-seed RANDOM_SEED] [--num-models {1,2,3,4,5}] [--recompile-padding RECOMPILE_PADDING] [--model-order MODEL_ORDER]
                       [--host-url HOST_URL] [--data DATA] [--msa-mode {mmseqs2_uniref_env,mmseqs2_uniref,single_sequence}] [--model-type {auto,alphafold2,alphafold2_ptm,alphafold2_multimer_v1,alphafold2_multimer_v2,alphafold2_multimer_v3}] [--amber] [--num-relax NUM_RELAX] [--templates]
                       [--custom-template-path CUSTOM_TEMPLATE_PATH] [--rank {auto,plddt,ptm,iptm,multimer}] [--pair-mode {unpaired,paired,unpaired_paired}] [--sort-queries-by {none,length,random}] [--save-single-representations] [--save-pair-representations] [--use-dropout] [--max-seq MAX_SEQ] [--max-extra-seq MAX_EXTRA_SEQ]
                       [--max-msa MAX_MSA] [--disable-cluster-profile] [--zip] [--use-gpu-relax] [--save-all] [--save-recycles] [--overwrite-existing-results] [--disable-unified-memory]
                       input results
 
positional arguments:
  input                 Can be one of the following: Directory with fasta/a3m files, a csv/tsv file, a fasta file or an a3m file
  results               Directory to write the results to
 
options:
  -h, --help            show this help message and exit
  --stop-at-score STOP_AT_SCORE
                        Compute models until plddt (single chain) or ptmscore (complex) > threshold is reached. This can make colabfold much faster by only running the first model for easy queries.
  --num-recycle NUM_RECYCLE
                        Number of prediction recycles.Increasing recycles can improve the quality but slows down the prediction.
  --recycle-early-stop-tolerance RECYCLE_EARLY_STOP_TOLERANCE
                        Specify convergence criteria.Run until the distance between recycles is within specified value.
  --num-ensemble NUM_ENSEMBLE
                        Number of ensembles.The trunk of the network is run multiple times with different random choices for the MSA cluster centers.
  --num-seeds NUM_SEEDS
                        Number of seeds to try. Will iterate from range(random_seed, random_seed+num_seeds)..
  --random-seed RANDOM_SEED
                        Changing the seed for the random number generator can result in different structure predictions.
  --num-models {1,2,3,4,5}
  --recompile-padding RECOMPILE_PADDING
                        Whenever the input length changes, the model needs to be recompiled.We pad sequences by specified length, so we can e.g. compute sequence from length 100 to 110 without recompiling.The prediction will become marginally slower for the longer input, but overall performance increases due to not recompiling.
                        Set to 0 to disable.
  --model-order MODEL_ORDER
  --host-url HOST_URL
  --data DATA
  --msa-mode {mmseqs2_uniref_env,mmseqs2_uniref,single_sequence}
                        Using an a3m file as input overwrites this option
  --model-type {auto,alphafold2,alphafold2_ptm,alphafold2_multimer_v1,alphafold2_multimer_v2,alphafold2_multimer_v3}
                        predict strucutre/complex using the following model.Auto will pick "alphafold2_ptm" for structure predictions and "alphafold2_multimer_v3" for complexes.
  --amber               Use amber for structure refinement.To control number of top ranked structures are relaxed set --num-relax.
  --num-relax NUM_RELAX
                        specify how many of the top ranked structures to relax using amber.
  --templates           Use templates from pdb
  --custom-template-path CUSTOM_TEMPLATE_PATH
                        Directory with pdb files to be used as input
  --rank {auto,plddt,ptm,iptm,multimer}
                        rank models by auto, plddt or ptmscore
  --pair-mode {unpaired,paired,unpaired_paired}
                        rank models by auto, unpaired, paired, unpaired_paired
  --sort-queries-by {none,length,random}
                        sort queries by: none, length, random
  --save-single-representations
                        saves the single representation embeddings of all models
  --save-pair-representations
                        saves the pair representation embeddings of all models
  --use-dropout         activate dropouts during inference to sample from uncertainity of the models
  --max-seq MAX_SEQ     number of sequence clusters to use
  --max-extra-seq MAX_EXTRA_SEQ
                        number of extra sequences to use
  --max-msa MAX_MSA     defines: `max-seq:max-extra-seq` number of sequences to use
  --disable-cluster-profile
                        EXPERIMENTAL: for multimer models, disable cluster profiles
  --zip                 zip all results into one <jobname>.result.zip and delete the original files
  --use-gpu-relax       run amber on GPU instead of CPU
  --save-all            save ALL raw outputs from model to a pickle file
  --save-recycles       save all intermediate predictions at each recycle
  --overwrite-existing-results
  --disable-unified-memory
                        if you are getting tensorflow/jax errors it might help to disable this

Submission script

You can find below an example of a submission script to perform Colabfold computations. Note that you need to launch one job par protein sequence or multimer prediction.

Script version 22/05/2023

job_ColabFold.sh
#!/bin/bash
#PBS -S /bin/bash
#PBS -N AF2
#PBS -o $PBS_JOBID.out
#PBS -e $PBS_JOBID.err
 
#Half node for a sequence < 1000 residues
#Full node otherwise and for the multimer version.
#PBS -l nodes=1:ppn=16
#PBS -l walltime=24:00:00
#PBS -A simlab_project
#PBS -q alphafold_1n
 
#script version 22.05.2023
 
### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ###
WORKDIR="/"
NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l)
if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then
        echo "At least one variable is needed but not defined. Please touch your manager about."
        exit 1
else
        if [ $NUM_NODES -le 1 ]; then
                WORKDIR+="scratch/"
                export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/"
                mkdir $WORKDIR
                rsync -ap $PBS_O_WORKDIR/ $WORKDIR/
 
                # if you need to check your job output during execution (example: each hour) you can uncomment the following line
                # /shared/scripts/ADMIN__auto-rsync.example 3600 &
        else
                export WORKDIR=$PBS_O_WORKDIR
        fi
fi
 
echo "your current dir is: $PBS_O_WORKDIR"
echo "your workdir is: $WORKDIR"
echo "number of nodes: $NUM_NODES"
echo "number of cores: "$(cat $PBS_NODEFILE|wc -l)
echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done)
 
cd $WORKDIR
 
# If you're using only one node, it's counterproductive to use IB network for your MPI process communications
if [ $NUM_NODES -eq 1 ]; then
        export PSM_DEVICES=self,shm
        export OMPI_MCA_mtl=^psm
        export OMPI_MCA_btl=shm,self
else
# Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts
# we have to share PSM contexts between processes
# CIN is here the number of cores in node
        CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l)
        if [ $(($CIN/16)) -ge 2 ]; then
                PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l)
                if [ $CIN -eq 40 ]; then
                        export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4))
                elif [ $CIN -eq 32 ]; then
                        export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2))
                else
                        echo "This computing node is not supported by this script"
                fi
                echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX"
        else
                echo "no PSM_SHAREDCONTEXTS_MAX to define"
        fi
fi
 
function get_gpu-ids() {
        if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then
                echo "0,1" && return
        fi
 
        if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then
                FILE="/dev/cpuset/torque/$PBS_JOBID/cpus"
        elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then
                FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus"
        else
                FILE=""
        fi
 
        if [ -e $FILE ]; then
                if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then
                        echo "0" && return
                else
                        echo "1" && return
                fi
        else
                echo "0,1" && return
        fi
}
 
gpus=$(get_gpu-ids)
 
 
## USER Part
module load gcc/8.3.0
module load miniconda-py3/latest
 
COLABFOLDDIR=/scratch-nv/software/colabfold
 
conda activate $COLABFOLDDIR/colabfold-conda
 
 
nb_cores=$(cat $PBS_NODEFILE|wc -l)
 
#Run
cd $WORKDIR/
 
d1=`date +%s`
echo $(date)
 
 
# Use either one by uncommenting the command line.
# Just change the name of the fasta file.
#Monomer
#${COLABFOLDDIR}/colabfold-conda/bin/colabfold_batch  query.fasta outputdir/
 
#Multimer
#${COLABFOLDDIR}/colabfold-conda/bin/colabfold_batch  multi_fasta.fasta outputdir --model-type alphafold2_multimer_v3
 
 
d2=$(date +%s)
echo $(date)
 
diff=$((($d2 - $d1)/60))
echo "Time spent (min) : ${diff}"
 
## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause
# At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir)
if [ $NUM_NODES -le 1 ]; then
        cd $PBS_O_WORKDIR
        rsync -ap $WORKDIR/ $PBS_O_WORKDIR/
        rm -rf $WORKDIR
fi
## END-DO

Benchmarks

Not so much benchmarks have been done for Colabfold but first tests show only a few minutes run for protein around 200-300 residus.

Older versions

Older versions are still available and usable, you just need to use the correct conda environment and the correct submission script.

For version 1.3.0, use this script:

job_ColabFold.sh
#!/bin/bash
#PBS -S /bin/bash
#PBS -N AF2
#PBS -o $PBS_JOBID.out
#PBS -e $PBS_JOBID.err
 
#Half node for a sequence < 1000 residues
#Full node otherwise and for the multimer version.
#PBS -l nodes=1:ppn=16
#PBS -l walltime=24:00:00
#PBS -A simlab_project
#PBS -q alphafold_1n
 
#script version 11.10.2022 for colabfold 1.3.0
 
### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ###
WORKDIR="/"
NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l)
if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then
        echo "At least one variable is needed but not defined. Please touch your manager about."
        exit 1
else
        if [ $NUM_NODES -le 1 ]; then
                WORKDIR+="scratch/"
                export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/"
                mkdir $WORKDIR
                rsync -ap $PBS_O_WORKDIR/ $WORKDIR/
 
                # if you need to check your job output during execution (example: each hour) you can uncomment the following line
                # /shared/scripts/ADMIN__auto-rsync.example 3600 &
        else
                export WORKDIR=$PBS_O_WORKDIR
        fi
fi
 
echo "your current dir is: $PBS_O_WORKDIR"
echo "your workdir is: $WORKDIR"
echo "number of nodes: $NUM_NODES"
echo "number of cores: "$(cat $PBS_NODEFILE|wc -l)
echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done)
 
cd $WORKDIR
 
# If you're using only one node, it's counterproductive to use IB network for your MPI process communications
if [ $NUM_NODES -eq 1 ]; then
        export PSM_DEVICES=self,shm
        export OMPI_MCA_mtl=^psm
        export OMPI_MCA_btl=shm,self
else
# Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts
# we have to share PSM contexts between processes
# CIN is here the number of cores in node
        CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l)
        if [ $(($CIN/16)) -ge 2 ]; then
                PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l)
                if [ $CIN -eq 40 ]; then
                        export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4))
                elif [ $CIN -eq 32 ]; then
                        export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2))
                else
                        echo "This computing node is not supported by this script"
                fi
                echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX"
        else
                echo "no PSM_SHAREDCONTEXTS_MAX to define"
        fi
fi
 
function get_gpu-ids() {
        if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then
                echo "0,1" && return
        fi
 
        if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then
                FILE="/dev/cpuset/torque/$PBS_JOBID/cpus"
        elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then
                FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus"
        else
                FILE=""
        fi
 
        if [ -e $FILE ]; then
                if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then
                        echo "0" && return
                else
                        echo "1" && return
                fi
        else
                echo "0,1" && return
        fi
}
 
gpus=$(get_gpu-ids)
 
 
## USER Part
module load gcc/8.3.0
module load miniconda-py3/latest
 
COLABFOLDDIR=/scratch-nv/software/colabfold_1.3.0
 
conda activate $COLABFOLDDIR/colabfold-conda
 
 
nb_cores=$(cat $PBS_NODEFILE|wc -l)
 
#Run
cd $WORKDIR/
 
d1=`date +%s`
echo $(date)
 
 
# Use either one by uncommenting the command line.
# Just change the name of the fasta file.
#Monomer
#bash ${COLABFOLDDIR}/bin/colabfold_batch --data ${COLABFOLDDIR}/colabfold query.fasta outputdir/
 
#Multimer
#bash ${COLABFOLDDIR}/bin/colabfold_batch --data ${COLABFOLDDIR}/colabfold multi_fasta.fasta outputdir --model-type AlphaFold2-multimer-v2
 
 
d2=$(date +%s)
echo $(date)
 
diff=$((($d2 - $d1)/60))
echo "Time spent (min) : ${diff}"
 
## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause
# At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir)
if [ $NUM_NODES -le 1 ]; then
        cd $PBS_O_WORKDIR
        rsync -ap $WORKDIR/ $PBS_O_WORKDIR/
        rm -rf $WORKDIR
fi
## END-DO

Troubleshooting

In case of trouble, you can contact me at : hubert.santuz[at]ibpc.fr

Error : KeyError: 'data'

If you encounter this error:

Traceback (most recent call last):
  File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/ml_collections/config_dict/config_dict.py", line 883, in __getitem__
    field = self._fields[key]
KeyError: 'data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/ml_collections/config_dict/config_dict.py", line 807, in __getattr__
    return self[attribute]
  File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/ml_collections/config_dict/config_dict.py", line 889, in __getitem__
    raise KeyError(self._generate_did_you_mean_message(key, str(e)))
KeyError: "'data'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch-nv/software/colabfold/colabfold-conda/bin/colabfold_batch", line 8, in <module>
    sys.exit(main())
  File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 1336, in main
    zip_results=args.zip,
  File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 1092, in run
    prediction_callback=prediction_callback,
  File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 197, in predict_structure
    use_templates,
  File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 132, in batch_input
    eval_cfg = model_config.data.eval
  File "/scratch-nv/software/colabfold/colabfold-conda/lib/python3.7/site-packages/ml_collections/config_dict/config_dict.py", line 809, in __getattr__
    raise AttributeError(e)

This means your fasta file is incorrectly formatted for the multimer option of Colabfold.

As stated here, the fasta must be in the form:

>Multimer with XX and YY sequences
XXXX:YYYY