User Tools

Site Tools


cluster-lbt:architecture

Computation resources architecture

Below is the architecture model of our computation and storage resources. HPC architecture

Every LBT computing resources (HPC, storage and IS clusters) are internally designed and built, and based on open-source software and solutions.

Currently in the cluster room we have 2 HPC clusters (a.k.a. Supercomputers), called Hades and Baal, that are composed as following:

  • Hades (348 cores and 1.2TB RAM):
    • 29x Intel 12 cores nodes for parallel jobs
  • Baal (616 CPU cores, ~115k GPU cores and ~2TB RAM)
    • 1x Intel 8 cores for test jobs (20 minutes max)
    • 2x Intel 24 cores for post-processing jobs
    • 3x Intel 16 cores and bi NVidia GPU nodes (dedicated for GPU jobs)
    • 13x Intel 40 cores and bi NVidia GPU nodes (dedicated for GPU jobs)

Because it will not do any good to compute without storage capacity, following storage volumes are also available:

  • /workdir: ~49TB (useful capacity) for Hades cluster, ~146TB for Baal one
  • /workdir/ibpc_team: ~19TB (useful capacity) dedicated for non-LBT members
  • /archive: ~112TB (useful capacity) replicated every weeks
  • /archive/ibpc_team: ~33TB (useful capacity) dedicated for non-LBT members
  • /scratch: except for the first 3 GPU nodes and post-processing nodes that have around 12TB each and all other GPU nodes with 1.6 SSD TB each, every nodes have around 200GB each to let you use temporary computing files (mainly used for mono-node jobs)

Some other disk spaces are not listed above because they are kept in reserve.

Except for *ibpc_team and /scratch volumes, all above-mentioned volumes are replicated and/or distributed on a storage cluster currently composed by 11 servers.

/scratch volume on each node serves only to store temporary computing files. So, because it's cleared every night (deleting old job's directories and all not well-formed directories), you should not try to use it for chaining jobs in a single -but dedicated- scratch directory.

Archive volume (/archive) is not available on computing nodes.

In order to get the best performance as possible, we choose a high throughput and low latency network technology: Infiniband QDR (40Gbs).

Below, the evolution year-per-year of the computing performance and storage:

CPU computing power evolution.GPU-accelerated computing power evolution.Storage evolution

cluster-lbt/architecture.txt ยท Last modified: 2020/07/23 12:08 by admin