Outils pour utilisateurs

Outils du site


cluster-lbt:troubleshooting

Troubleshooting

Below are the common issues you can solve by yourself:

SSH Remote "failed"

If you note this kind of output when you try to connect remotely into the clusters:

$ ssh baal
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
8a:51:9a:2c:78:03:51:03:39:f3:03:1f:aa:2f:56:c7.
Please contact your system administrator.
Add correct host key in /home/ibpcadmin/.ssh/known_hosts to get rid of this message.
Offending key in /home/ibpcadmin/.ssh/known_hosts:5
RSA host key for baal.lbt.ibpc.fr has changed and you have requested strict checking.
Host key verification failed.

This is probably because the security key has changed. The easiest way to solve this is to remove your old key to let your SSH client recreate its needed entry into your .ssh/known_hosts file.

For security reason, if you have any doubt, don't hesitate to contact me directly to know more about this kind of messages.

To remove it:
$ sed -i.bak '<line number>d' ~/.ssh/known_hosts

As an example, as you can see in the previous message, the offending key is located on line #5 in the known_hosts file. So, to remove it (after doing a backup):

$ sed -i'.bak' '5d' ~/.ssh/known_hosts

You can alternatively use the following command line to solve the problem:

$ ssh-keygen -f ~/.ssh/known_hosts -R "baal.lbt.ibpc.fr"

Jobs that stay blocked in queue

Sometime, when you submit a job, you may notice your job stay blocked in queue of which you are not aware why. In this case, the first thing to do is to check your job status:

$ checkjob -vv <job-ID>

$ checkjob -vv 18049


checking job 18049 (RM job '18049.torque1.cluster.lbt')

State: Idle  EState: Deferred
Creds:  user:admin  group:admin_team  account:baaden_project  class:monop  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Mon Oct 12 17:20:09
  (Time Queued  Total: 00:00:00  Eligible: 00:00:00)

StartDate: 00:00:01  Mon Oct 12 17:20:10
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [smp-nodes]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
job is deferred.  Reason:  BankFailure  (cannot debit job account)
Holds:    Defer  (hold reason:  BankFailure)
PE:  1.00  StartPriority:  24
cannot select job 18049 for partition DEFAULT (job hold active)

In the message above, you can notice “BankFailure (cannot debit job account)”. This message means your credit is either completely burned, expired or not existing (typing error?) -or you are not a member of this credit account.

cluster-lbt/troubleshooting.txt · Dernière modification : 2020/09/30 17:06 de 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki