===== Performance tips =====
In order to help you to get the best performance with your compute as possible, thereafter a few tips/advises I can give:
==== For single node jobs only ====
=== IO tunings ===
For single/mono node jobs only, if you are using your workdir for your own compiled stuffs, i advise you to tar the top dir of and untar it into /scratch/<job-ID> when you are running your jobs. Indeed, node disk IO will better perform due to no IO competition on.

For example, based on the premise you have an **appdir** directory in your own environment, each time you compile/install something in, you just have to [re]create/update the tar file doing on master:
<cli prompt="$">
$ tar rf $(/shared/scripts/getWorkdir.sh)/packages.tar appdir
</cli>
Now, when you are running a single node job, you can add these following lines into your job script in the USER-part section:
<code job-script.sh>
[...]
# tar xf /workdir/<groupname>/<username>/packages.tar -C $PWD
export ROOT_WORKDIR=/$(/shared/scripts/getWorkdir.sh|awk -F'/' '{print $2"/"$3"/"$4}')
tar xf $ROOT_WORKDIR/packages.tar -C $WORKDIR/

export LD_LIBRARY_PATH=$WORKDIR/appdir/lib:$LD_LIBRARY_PATH
export PATH=$WORKDIR/appdir/bin:$LD_LIBRARY_PATH

mpirun <what you want>
[...]
</code>
=== MPI tunings ===
When you are using OpenMPI for launching single node jobs, you should better to avoid the network usage for communication. Indeed, internal communications will still perform faster than network ones.

To do so, in interactive sessions or into your job scripts, you can either:
  * set several environment variables:
<cli prompt="$" comment="# ">
$ export PSM_DEVICES=self,shm
$ export OMPI_MCA_mtl=^psm
$ export OMPI_MCA_btl=shm,self
$ mpirun ...
</cli>
  * add in **mpirun** parameters the expected options:
<cli prompt="$">
$ mpirun -mca mtl ^psm -mca btl shm,self -x PSM_DEVICES=self,shm ...
</cli>
==== More general run-time tunings ====
In general, a good way to get the best application performance is to well tune the processor and memory affinity.

To do so, here 2 ways to proceed:
  * with OpenMPI, you can provide in parameters to **orterun**, **mpirun**, **mpiexec** these following options:
    * **--bind-to-core**: bind processes to cores (<color red>my favorite one</color>)
    * **--bind-to-socket**: bind processes to processor sockets
    * **--bind-to-none**: not not bind processes
<cli prompt='$'>
$ mpirun --bind-to-core simulation.x
</cli>

  * without any parallel launcher, you can bind by yourself your processes by using one of the following commandes:
    * **taskset**
    * **numactl**
    * **pin_t2c**

=== taskset ===
<cli prompt="$">
$ taskset -c 0,1,2,3 simulation.x
</cli>
=== numactl ===
<cli>
$ numactl --physcpubind=0,1,2,3 simulation.x
</cli>
Here you run your **simulation.x** application on the 4 first cores of the first socket processor.
<note warning>Because we are using **CGROUP/CPUSET** Linux kernel features to contain jobs under dedicated sets of cores, before trying to use **taskset** or **numactl** command, you should get your core IDs provided by Torque doing:\\ **$ cat /dev/cpuset/torque/$PBS_JOBID/cpus**</note>
=== pin_t2c ===


You can also use a homemade tool named **[[cluster-lbt:pin_t2c|pin_t2c]]** (from eponymous module) which provides an easy way to pin threads of a running process to hardware resources
<cli>
$ module load pin_t2c
$ myapp & pin_t2c --pid $!
</cli>
//* without any other optional argument, this take care of the complete CGROUP/CPUSET allocation provided by Torque; but you can manually specify CPU sockets, cores, etc. to well tune your process.//
==== Embarrassingly parallel jobs ====
For those who need to run embarrassingly parallel jobs (non-MPI processes), you have 2 options for:
  * with **OpenMPI** and **mpirun**/**mpiexec** binary:
Example, inside the User-PART section of your submission script:
<code bash>
[...]
module load openmpi
export PSM_DEVICES=self,shm
export OMPI_MCA_mtl=^psm
export OMPI_MCA_btl=shm,self

mpirun --bind-to core simulation.x
[...]
</code>
If the task is identical on all processor cores, you can simply use it without any flourish; but in the case where each processor core should be doing a different task, then you can make use of an environmental variable called **$OMPI_COMM_WORLD_RANK** (a kind of global core IDs) which will be different for every running task.

  * with **Torque** bundled utilities: you can use the **pbsdsh** tool contained in torque client module.
Example, inside the User-PART section of your submission script:
<code bash>
[...]
module load torque
pbsdsh -v $WORKDIR/taskscript.sh
[...]
</code>
<code bash taskscript.sh>
#!/bin/bash -l

source /usr/share/Modules/init/bash 2> /dev/null # to be able to load some modules

module load <what-you-need>
simulation.x
</code>
If the task is identical on all processor cores, you can simply use it without any flourish; but in the case where each processor core should be doing a different task, then you can make use of an environmental variable called **$PBS_VNODENUM** (a kind of global core IDs) which will be different for every running task.
<note important>You have to keep in mind a script invoked by pbsdsh starts in a very basic environment and dont inherit your own environment. So, dont forget to reconstruct your needed shell environment in the task script.</note>