Table of Contents
The XGE cannot start the VNodesManager
Question: When I launch the XGE, it aborts and cannot start the VNodesManager. I see similar messages to the following on console:
Password: libvir: Remote error : cannot recv data: Connection reset by peer [ERROR] Failed to open connection to the Backend hypervisor [ERROR] Could not start VNodesManager. Abort.
Answer: Please provide privileged (aka root) passwordless SSH access to:
- All physical machines
- All VMs
- The headnode itself (this means login to 127.0.0.1)
See Configuration chapter for further instructions.
The XGE refuses to start
Question: I have libvirt installed, but the XGE won't start.
Answer: Please make sure that the Xen daemon (aka xend) is running on all machines. libvirt refuses to run if it cannot talks with the backend hypervisor. In all versions after 2010.1 the xged should detect this on the head node and refuses to run.
I cannot start VMs and see a message about the network bridge
Question: When I want to start VMs with the XGE, it aborts with the following message:
libvir: Xen Daemon error : POST operation failed: xend_post: error from xen daemon: (xend.err 'Device 0 (vif) could not be connected. Could not find bridge device xenbr0')
Answer: Check if you specified the correct bridge name in xge.conf (xenbr0 in this example). Your distributions bridge name might be different from the default.
The XGE won't receive any jobs from Torque
Question: I submit dozens of jobs through Torque, but nothing happens.
Answer: The XGE only recognizes jobs in a specific queue called virtual by default. If you submit jobs to another queue, they won't be recognized by the XGE. Check the log file in /opt/xge/jobs/log for entries similar to the following:
pbs_prolog.sh node02c1 152.int12909 virtual testjob.sh matthias users pbs_epilog.sh testjob.sh 152.int12909 matthias users cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:00 virtual
The XGE aborts with ImportError: No module named libtorrent
Question: The XGE aborts with a message similar to the following:
File "/opt/xge/modules/server/main.py", line 15, in <module> from GridWatchdog import GridWatchdog File "/home/matthias/stuff/xge/modules/server/GridWatchdog.py", line 14, in <module> import Job, HLTransfer, common.XGESocketServer File "/home/matthias/stuff/xge/modules/server/Job.py", line 14, in <module> import VMManager, XgeMessage File "/home/matthias/stuff/xge/modules/server/VMManager.py", line 21, in <module> from VNodesManager import * File "/home/matthias/stuff/xge/modules/server/VNodesManager.py", line 23, in <module> from ImageManager import ImageManager File "/home/matthias/stuff/xge/modules/server/ImageManager.py", line 13, in <module> import libtorrent as lt ImportError: No module named libtorrent
Answer: Install a recent version of libtorrent-rasterbar including (!) python bindings.
All LXGEds are running, but the XGE aborts with Connection refused
Question: The XGE is running and reports that all LXGEds are running and I can see the processes on the local systems. Although, when the XGE tries to deploy a VM disk image, I see a Connection Refused exception.
Answer: Make sure that you do not have an entry, similar to the following, in your /etc/hosts file on all compute des:
All LXGEds try to resolve their own IP address based on the host name. If the host name maps to 127.0.0.1, the LXGe binds to localhost and thus, is not available for the XGE.