= Design of the Xen Grid Engine = [[TOC]] {{{ #!html The information on this page is out of date! Please have a look at our recent publications (listed on the main page) for in-depth information about the XGE! }}} == Architecture == The XGE is divided into three modules: Job Watchdog, Job Handler and VM Handler. [[Image(xge.png, 300px)]] The Job Watchdog is the core component watching for new jobs announced by the XGE prolog script. If a new job is detected, it is registered within a global job list. If a job is finished, the Job Watchdog is also notified by the XGE epilog script. The Job Handler manages all aspects of a job, e.g. registering, deleting, changing etc. It also offers a network interface for VMs and a special client program for the user. The VM handler is responsible for VM management, including deployment of VM images, starting, stopping and deleting VMs. == VM Deployment == Before the resource manager (RM) (here SGE/Torque) can execute a job on the nodes its scheduler has chosen, the XGE must deploy the user VM image to all compute nodes scheduled to execute the job. The XGE has a hook in the RM where it can see which nodes are chosen for scheduling. The XGE then pauses the RM and deploys the user image to the chosen nodes. Currently, the XGE supports three different types of image deployment: * ''Local images'': If all images reside on the local hard disk of the XGE node, the XGE can copy the required images from the local disk to the chosen compute nodes. * ''SSH deployment'': Images created using the ICS are stored in an image pool which is accessible via SSH. The required image is copied onto the chosen compute node using Secure Copy (''scp''). The overhead created by the encryption process is necessary since the image can contain sensitive data which needs to be protected. * ''Pre-deployed images'': Before deploying an image, the XGE checks if the required image is already on the compute node and skips deployment if this is the case. This mechanism is mainly used for testing purposes to save the deployment time. It can, however, also be used for image caching if consecutive jobs require the same image. === Deployment algorithm === The time needed to deploy the VM images to a number of nodes is crucial. If it takes to long a user could assume a possible error and cancel his/her job not knowing that a lengthy deployment process is in progress. Furthermore deploying a large number of VM images over a network consumes a lot of bandwith and could lead to network congestion. Image if a 5 GB VM image gets copied from one machine to 1000 remote machines. All copy processes have to share the bandwith between the source machine and the connected switch port. The above stated problems need clever algorithms to deploy the VM image to a large number of nodes without exhausting the infrastructure. The algorithm used by default is a binary tree deployment algorithm. The VM image is copied from the Image Pool to the first node, if we have an odd number of compute nodes for this job. If we have an even number of compute nodes the VM image is also copied to the "last" compute node. If a reasonable number of bytes arrive at the first node, the VM image is copied in parallel to both children nodes. This recursive procedure ensures that the VM image arrives on all compute nodes without exhausting the network. Furthermore there is a reasonable time gain over serial copying. A schematic overview over the binary tree is given in the following figure: [[Image(xge_bin_tree.png)]] === Future work === This algorithm scales well for a good number of compute nodes. If we advance to a large number of nodes (e.g. > 1000) better algorithms are needed. We investigate in a number of possible algorithms including P2P techniques where all compute nodes also act as "seeder" for VM images. Furthermore deployment over multicast is evaluated. Multicast has a lot of advantages, but also a lot of disadvantages. A clear evaluation including performance measurements, security considerations and multi-site support capabilities will help us to resolve unanswered questions. Stay tuned for more. == Placeholders and Real Virtual Machines == To make the XGE extension transparent to the RM, the RM should not realize that it is operating on VMs, since it would then see worker nodes appearing and disappearing. This would break RM's ability to properly schedule jobs, since the nodes allocated to the job queues change all the time (when nodes disappear for a while, the RM believes the nodes have crashed and cannot be used for further jobs and thus reschedules everything). To prevent this from happening, our solution uses placeholder VMs which are registered with the RM job queues. Every placeholder VM has a RM execution daemon (either sge_execd or pbs_mom) installed to let the RM know about a number of compute nodes and make scheduling decisions. When a job is scheduled, the placeholder VM is transparently exchanged with the user VM, so that the RM does not notice any change. === Placeholder design === A placeholder is in fact a standard Linux VM with minor modifications. No shared storage volume (e.g. NFS) is mounted and most system partitions (`/`, `/usr`, `/etc` and `/home`) are mounted read only. The remaining partitions, `/tmp` and `/var` are located on a ramdisk and fresh initialized on every startup. If we would have a complete read-write mounted virtual hard disk, the filesystem could suffer from data corruption through continuous writes of log files, temporary files etc. Of course, this method does not prevent that the image file holding the filesystem itself gets corrupted, but the likelihood that the filesystem gets corrupted through extensive usage is decreased. Furthermore no filesystem checks during startup are needed which speeds up booting time and we can destroy the placeholder immediately without having to properly unmount the disk. Measurements have proofed that the cleanup time for a placeholder could be decreased from about 8s to 1-2s. == Dynamic firewalls == The XGE secures both software and data from being accessed directly by unauthorized users. However, the network and thus the connected resources are fully accessible to a user if no firewalling technology is used. Since users can install software with root privileges (within their own VM), it is easy to install packet sniffing tools to monitor network traffic. It is also possible to probe for remote vulnerabilities in other users' images to gain access illegally. Thus, it is desirable to prevent network access between images of different users. It is also desirable to allow different users to have different network settings concerning the Internet. Traditional Grid applications do not require Internet access, but some commercial applications must contact a licensing server (e.g. FlexLM) to run and thus must at least be able to make outgoing connections. Fully service-oriented applications or interactive applications might even need to be reachable from the Internet and thus require incoming connections to be possible. System wide firewall configuration would either restrict applications with high connectivity requirements or unnecessarily endanger applications which normally operate in a private network. Since users already have their own operating system, we add to this a user based firewalling approach located in the ''Xen0'' controlling the user image to facilitate the different requirements. The extent a user is allowed to open or close ports for his or her image is specified by the administrator of the site either on a per user and/or a per virtual organization (VO) basis, relying on the amount of trust a user or a VO has. An overview about the dynamic firewalls is presented in the following figure: [[Image(fw.png, 500px)]] == Client/Server architecture == To ease communication with the XGE, one part acts as server providing various information about jobs. Furthermore the server accepts a series of commands to perform actions (delete jobs, shut down VMs, ...) on jobs. The XGE client (''xgec'')is a command line program used to communicate with the server. An overview of the communication protocol used by client and server can be found in the [wiki:Communication] section.