====== Omegafold ======
Omegafold is another class of prediction structure based on a Protein Language Model (PLM). It doesn't require any multiple sequence alignment and use solely the sequence of the protein of interest.
**For now, it does not support multimere predictions.**
Omegafold main limitation is the GPU memory as it takes a lot for the predictions (see below)
Omegafold is *really fasta* : seconds for small sequences (up to ~100) and minutes for bigger ones (5-10minutes for a 800 sequences protein)
===== Version =====
It use the v1.1.0 available from the Github repository https://github.com/HeliXonProtein/OmegaFold
===== Ressources =====
To know more about Omegafold, I highly recommend to read:
* the preprint : https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1
* the GitHub repo: https://github.com/HeliXonProtein/OmegaFold
* the available notebook: https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/beta/omegafold.ipynb
===== Installation =====
The installation follow the same process of AlphaFold.\\
**It's available on nodes node061, node062, node063 and node081**
The installation requires only a Python Package to install. A conda environment **omegafold** was created for this purpose.
===== Utilization =====
**Use the same queues as alphafold: ''alphafold'' or ''alphafold2'' **
See here: http://www-lbt.ibpc.fr/wiki/doku.php?id=cluster-lbt:extra-tools:alphafold_tool#queues
Since the main limitations of Omegafold is the GPU memory, you should **always** use half of the node for the predictions.
==== Input file ====
Omegafold support only a fasta file.
For several predictions, you can give a multifasta and sequences will be treated as batch (one after another).
==== GPU Memory ====
Omegafold use a lot of the GPU Memory for the predictions:
* ~500Mb for a 70 sequences protein
* ~27GB for a 800 sequences protein
The GPU installed in nodes 6X have ~10Gb memory and the ones in the node81 have ~48Gb.
However, omegafold has an option to decrease the memory used (and thus increase the prediction time) called ''subbatch_size''. Here the explanation taken from the [[https://github.com/HeliXonProtein/OmegaFold#setting-subbatch|GitHub]]:
> Subbatch makes a trade-off between time and space. One can greatly reduce the space requirements by setting --subbatch_size very low. The default is the number of residues in the sequence and the lowest possible number is 1. For now we do not have a rule of thumb for setting the --subbatch_size, but we suggest half the value if you run into GPU memory limitations.
==== Running ====
The first time you use Omegafold, it will download a weight file (model.pt) and will copy it into ~/.cache/omegafold_ckpt directory.
A script is available called ''omegafold'':
(omegafold) [santuz@node061 simple_dimere]$ omegafold -h
usage: omegafold [-h] [--num_cycle NUM_CYCLE] [--subbatch_size SUBBATCH_SIZE] [--device DEVICE] [--weights_file WEIGHTS_FILE] [--weights WEIGHTS]
[--pseudo_msa_mask_rate PSEUDO_MSA_MASK_RATE] [--num_pseudo_msa NUM_PSEUDO_MSA] [--allow_tf32 ALLOW_TF32]
input_file output_dir
Launch OmegaFold and perform inference on the data. Some examples (both the input and output files) are included in the Examples folder, where each
folder contains the output of each available model from model1 to model3. All of the results are obtained by issuing the general command with only model
number chosen (1-3).
positional arguments:
input_file The input fasta file
output_dir The output directory to write the output pdb files. If the directory does not exist, we just create it. The output file name
follows its unique identifier in the rows of the input fasta file"
optional arguments:
-h, --help show this help message and exit
--num_cycle NUM_CYCLE
The number of cycles for optimization, default to 10
--subbatch_size SUBBATCH_SIZE
The subbatching number, the smaller, the slower, the less GRAM requirements. Default is the entire length of the sequence. This
one takes priority over the automatically determined one for the sequences
--device DEVICE The device on which the model will be running, default to the accelerator that we can find
--weights_file WEIGHTS_FILE
The model cache to run
--weights WEIGHTS The url to the weights of the model
--pseudo_msa_mask_rate PSEUDO_MSA_MASK_RATE
The masking rate for generating pseudo MSAs
--num_pseudo_msa NUM_PSEUDO_MSA
The number of pseudo MSAs
--allow_tf32 ALLOW_TF32
if allow tf32 for speed if available, default to True
==== Submission script ====
You can find below an example of a submission script to perform Omegafold computations.
**Script version 18/11/2022**
#!/bin/bash
#PBS -S /bin/bash
#PBS -N AF2
#PBS -o $PBS_JOBID.out
#PBS -e $PBS_JOBID.err
#Half node always
#PBS -l nodes=1:ppn=8
#PBS -l walltime=24:00:00
#PBS -A simlab_project
#PBS -q alphafold_hn
#script version 18.11.2022
### FOR EVERYTHING BELOW, I ADVISE YOU TO MODIFY THE USER-part ONLY ###
WORKDIR="/"
NUM_NODES=$(cat $PBS_NODEFILE|uniq|wc -l)
if [ ! -n "$PBS_O_HOME" ] || [ ! -n "$PBS_JOBID" ]; then
echo "At least one variable is needed but not defined. Please touch your manager about."
exit 1
else
if [ $NUM_NODES -le 1 ]; then
WORKDIR+="scratch/"
export WORKDIR+=$(echo $PBS_O_HOME |sed 's#.*/\(home\|workdir\)/\(.*_team\)*.*#\2#g')"/$PBS_JOBID/"
mkdir $WORKDIR
rsync -ap $PBS_O_WORKDIR/ $WORKDIR/
# if you need to check your job output during execution (example: each hour) you can uncomment the following line
# /shared/scripts/ADMIN__auto-rsync.example 3600 &
else
export WORKDIR=$PBS_O_WORKDIR
fi
fi
echo "your current dir is: $PBS_O_WORKDIR"
echo "your workdir is: $WORKDIR"
echo "number of nodes: $NUM_NODES"
echo "number of cores: "$(cat $PBS_NODEFILE|wc -l)
echo "your execution environment: "$(cat $PBS_NODEFILE|uniq|while read line; do printf "%s" "$line "; done)
cd $WORKDIR
# If you're using only one node, it's counterproductive to use IB network for your MPI process communications
if [ $NUM_NODES -eq 1 ]; then
export PSM_DEVICES=self,shm
export OMPI_MCA_mtl=^psm
export OMPI_MCA_btl=shm,self
else
# Since we are using a single IB card per node which can initiate only up to a maximum of 16 PSM contexts
# we have to share PSM contexts between processes
# CIN is here the number of cores in node
CIN=$(cat /proc/cpuinfo | grep -i processor | wc -l)
if [ $(($CIN/16)) -ge 2 ]; then
PPN=$(grep $HOSTNAME $PBS_NODEFILE|wc -l)
if [ $CIN -eq 40 ]; then
export PSM_SHAREDCONTEXTS_MAX=$(($PPN/4))
elif [ $CIN -eq 32 ]; then
export PSM_SHAREDCONTEXTS_MAX=$(($PPN/2))
else
echo "This computing node is not supported by this script"
fi
echo "PSM_SHAREDCONTEXTS_MAX defined to $PSM_SHAREDCONTEXTS_MAX"
else
echo "no PSM_SHAREDCONTEXTS_MAX to define"
fi
fi
function get_gpu-ids() {
if [ $PBS_NUM_PPN -eq $(cat /proc/cpuinfo | grep -cE "^processor.*:") ]; then
echo "0,1" && return
fi
if [ -e /dev/cpuset/torque/$PBS_JOBID/cpus ]; then
FILE="/dev/cpuset/torque/$PBS_JOBID/cpus"
elif [ -e /dev/cpuset/torque/$PBS_JOBID/cpuset.cpus ]; then
FILE="/dev/cpuset/torque/$PBS_JOBID/cpuset.cpus"
else
FILE=""
fi
if [ -e $FILE ]; then
if [ $(cat $FILE | sed -r 's/^([0-9]).*$/\1/') -eq 0 ]; then
echo "0" && return
else
echo "1" && return
fi
else
echo "0,1" && return
fi
}
gpus=$(get_gpu-ids)
## USER Part
module load gcc/8.3.0
module load miniconda-py3/latest
conda activate omegafold
#Run
cd $WORKDIR/
d1=`date +%s`
echo $(date)
omegafold query.fasta outputdir/
d2=$(date +%s)
echo $(date)
diff=$((($d2 - $d1)/60))
echo "Time spent (min) : ${diff}"
## DO NOT MODIFY THE PART OF SCRIPT: you will be accountable for any damage you cause
# At the term of your job, you need to get back all produced data synchronizing workdir folder with you starting job folder and delete the temporary one (workdir)
if [ $NUM_NODES -le 1 ]; then
cd $PBS_O_WORKDIR
rsync -ap $WORKDIR/ $PBS_O_WORKDIR/
rm -rf $WORKDIR
fi
## END-DO
==== Benchmarks ====
===== Troubleshooting =====
In case of trouble, you can contact me at : ''hubert.santuz[at]ibpc.fr''
==== RuntimeError: CUDA out of memory. ====
If you encounter this error:
Traceback (most recent call last):
File "/shared/compilers/conda-py3/latest/envs/omegafold/bin/omegafold", line 8, in
sys.exit(main())
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/__main__.py", line 74, in main
output = model(
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/model.py", line 175, in forward
result, prev_dict = self.omega_fold_cycle(
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/model.py", line 89, in forward
prev_node, edge_repr, node_repr = self.geoformer(
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/geoformer.py", line 175, in forward
node_repr, edge_repr = block(
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/geoformer.py", line 122, in forward
edge_repr += layer(edge_repr, mask[..., 0, :], fwd_cfg=fwd_cfg)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/modules.py", line 677, in forward
out = self._get_attended(edge_repr, mask, fwd_cfg)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/modules.py", line 607, in _get_attended
attended[s:e] = self.attention(
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/modules.py", line 431, in forward
attn_out = self._get_attn_out(q_inputs, kv_inputs, fwd_cfg, bias)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/modules.py", line 455, in _get_attn_out
attn_out, _ = attention(
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/modules.py", line 156, in attention
res, attn = _attention(
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/omegafold/modules.py", line 93, in _attention
logits = torch.einsum("...id, ...jd -> ...ij", query * scale, key)
File "/shared/compilers/conda-py3/latest/envs/omegafold/lib/python3.9/site-packages/torch/functional.py", line 360, in einsum
return _VF.einsum(equation, operands) # type: ignore[attr-defined]
RuntimeError: CUDA out of memory. Tried to allocate 14.09 GiB (GPU 0; 10.92 GiB total capacity; 9.36 GiB already allocated; 747.38 MiB free; 9.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It means your predictions use too much GPU memory that the card can handle. Try playing with the ''subbatch_size'' option (as explained [[cluster-lbt:extra-tools:omegafold#gpu_memory|here]]) to reduce the memory used.