Friday, 9 June 2006

An Interview with Jeff Dike - The creator of User Mode Linux

Jeff Dike is the creator and maintainer of User Mode Linux (UML) - a virtual machine which runs on Linux. In recent times, UML has gained a lot of significance after Linus Torvalds incorporated the UML patch into the official Linux kernel source tree. Now a days Jeff works full time for Intel devoting his time towards further development of UML. He has also authored a book titled "User Mode Linux" published by Prentice Hall. After reading through the book written by him on this subject and also running UML on my machine, I had the desire to ask him a few questions on UML and how it fared when compared with other virtualization technologies. And Jeff very kindly agreed to take time off from his important work schedule to give answers to my queries. Without further ado here are the questions I posed to Jeff Dike along with his replies.

Question: There are a lot of virtualization technologies like VMware, Xen and QEMU other than UML. What are the relative strengths of UML which would urge a person in choosing it over its counterparts ?

Jeff Dike: The reason varies according to the technology that you're comparing UML to. With qemu and other instruction emulators, the attraction is speed. These let you boot a kernel on a machine with a different architecture, i.e. a ppc kernel on an i386 host. When the architecture of the virtual machine is the same as the host, there are few reasons to take the overhead of instruction emulation, even if the emulator is optimized in this case to just virtualize instructions.

With hypervisor-based technologies such as VMWare ESX or Xen, the advantage of UML is simplicity. There are two aspects of this. The less important one is that you can have UML up and running by downloading a UML kernel and a filesystem, and running a shell command. This makes it very easy to be up and running with UML quickly.

The more important aspect of simplicity is that UML is conceptually simple. That is, for the host's system administrator, UML introduces relatively few new concepts. You don't have to learn how to administer a hypervisor, since, with UML, the hypervisor is Linux. A UML instance is a set of Linux processes, which every Linux admin knows how to examine and control. All of the Linux diagnostic tools, such as ps, top, and everything in /proc work as well for diagnosing problematic UML instances as they do for any other process on the system. When something goes wrong with a virtual machine and it has to be fixed quickly, UML allows all of your Linux tools, techniques, and experience to be applied to the problem. There's no need to introduce another OS, with which you have limited experience, in order to run some virtual machines.

Question: Are there any drawbacks of UML?

Jeff Dike: The main complaint about UML with respect to other technologies is speed. There is a noticeable amount of overhead with common workloads running under UML. This is a combination of Linux not being a perfect hypervisor and UML not being as well optimized as it should. These are both being fixed. A number of hypervisor-related improvements have gone into Linux recently, including PTRACE_SYSEMU, which greatly improves system call virtualization performance, and MADV_REMOVE, which allows UML to implement hotplug memory.

On the UML side, a relatively recent change made enabling and disabling interrupts much more efficient. This made a surprising performance difference, with a kernel build inside UML on my laptop being 25% faster than before.

The ongoing container effort also promises to bring UML performance much closer to native. This project is adding virtualization infrastructure to the kernel in order to support lightweight containers such as OpenVZ and vserver. It turns out that UML can use this support in order to allow its process system calls to run directly on the host rather than being intercepted and emulated by the UML kernel. I implemented a container for the system time and used it to bring UML's gettimeofday performance to around 99% of native. LWN has done an excellent coverage of this. Other containers will do the same for many other common system calls.

Question: In the book you have written on User Mode Linux, you state the difficulties you faced in getting Linus to merge the UML code with the official Linux kernel source tree. Do you feel that the process of getting new features incorporated in the official Linux kernel source is too tiresome ? And should Linus simplify this procedure to some extent? What are your thoughts on this ?

Jeff Dike: I don't see that UML makes a good case for changing how easy it is to get new things into the Linux kernel. There should be some reluctance to incorporate new code. It should be fairly well-understood, especially when it affects other parts of the kernel. It should also be maintained and have an identifiable user base. All of these things take time to demonstrate, so any new project should spend some time being maintained outside of Linus' tree.

UML spent its initial life being maintained out-of-tree before being incorporated for the first time. I also describe a period after that in which it was difficult to get UML patches into mainline, and UML more or less went back to being maintained out-of-tree. This wasn't entirely Linus' fault, although my changes affect only the UML part of the kernel tree and I am the maintainer of that portion of the tree, so he should have just waved them through. It didn't take too long for my accumulated changes to essentially merge into a small number of very large patches. Submitting patches such as these is contrary to normal kernel practice, which is to have each patch contain a discrete identifiable change.

This deadlock was broken by Paolo Giarrusso, who recognized that UML was better off in-tree than out-of-tree, and sent a large catch-up patch to Andrew Morton, who forwarded that to Linus. This large patch contained all of the changes that I had accumulated in my own tree, and getting that into mainline synchronized Linus' tree with mine.

Any procedure can be improved, and getting code into the kernel is no exception. However, I don't see a case for anything drastic. Maybe it should be slightly easier to submit new code, or maybe it should be slightly harder, I dunno.

Question: There are Linux OSes which run from within windows. CoLinux (www.colinux.org) comes to my mind which is run cooperatively alongside windows on a single machine. Is it possible to run UML inside such Linux distributions which are run from within windows ?

Jeff Dike: If the Linux-inside-Windows is a complete and reasonably bug-free Linux, then UML should run fine. However, while UML is a completely normal process, it is a demanding one, and tends to expose kernel bugs that aren't seen anywhere else. So, UML should run inside something like CoLinux, but I wouldn't be surprised for it to hit bugs when that is first tried.

UML is known to run inside VMWare, which isn't much of a surprise considering that VMWare virtualizes the hardware and runs the same Linux kernel as the host.

There is also the possibility of porting UML directly to Windows, or some other OS. This was a Windows port done a number of years ago (in part by the author of CoLinux) and was very nearly completely working. There were screenshots on the project's web site (umlwin32.sourceforge.net), of UML/Windows running X, but they seem to have disappeared.

Question: With the increase in processor speed and the fall in memory prices, virtualization technology has come within reach of the average computer user. Naturally this has opened avenues which were not available in the past. And many OS companies are taking a keen interest in providing virtualization. For example, Apple has already released a software (boot camp) which is used to run other OSes from inside OSX.(Update: Boot camp is not a virtualization technology but Apple is rumored to be working on building in virtualization technology in its upcoming OS code named Leopard). What in your opinion, is the future of virtualization and what significant role will UML play in this?

Jeff Dike: I see huge potential in application-level virtualization, in which applications gain some of the attributes of an operating system. In the final chapter of the book, I use the example of clusterized applications, in which an application, by incorporating a clustering technology, essentially becomes a cluster. By doing so, it allows multiple users to share a single instance of the application and to simultaneously work on whatever the application lets them work on.

For example, a clusterized word processor would allow many people to work on different parts of a large document at the same time, with the cluster technology within the word processor making sure that everyone sees the same data. The users would all be working on an up-to-date copy of the document, seeing real-time updates of changes made by other users. In a case like this, a cluster filesystem is likely to be the basis of the clustering. So, the rest of the filesystem infrastructure will have needed to been incorporated into the application. This provides our word processor with a full internal filesystem, with a permission system, that can be used to store a large document in a directory hierarchy which reflects the organization of the document. This is only a matter of how the document is stored within the application and would not affect how it appears to the user. However, this representation does make it possible to use the permission system that the word processor has incorporated to assign parts of the documents to individuals or groups and to enforce those assignments by setting ownerships and permissions on the internal files into which the document has been divided.

Clusterizing an application would make it possible for many people to work on a document, spreadsheet, presentation, or almost anything else as though it were a wiki. The question is where this application-level clustering will come from. Here's where UML comes in. There is a fair amount of kernel-level clustering available now. UML makes that technology available in userspace, by virtue of the fact that UML is a userspace port of the Linux kernel.

Almost everything in the Linux kernel is available in userspace via UML. A filesystem internal to the application is also interesting because it provides some consistency guarantees about the data stored within it, providing some crash-resiliency to the application. The SMP scaling work that has gone into the Linux kernel is the equivalent of threading support in a process.

Applications are coming under increasing pressure to become threaded as CPUs are built with increasing number of cores. UML offers all of these things already running in a process. There will be work needed in order to incorporate any of this into an application, but that it likely to be easier than writing it from scratch.

Question: You have authored the book User Mode Linux (Read the review of the book) which I found a really interesting and informative read. Usually it is very difficult to find people who have created a popular software who sit down and author a book on it. But you have excelled in both these fields. On this note, how difficult is it to write a book? Have you found writing a book easier than writing code or vice versa?

Jeff Dike: For me, writing the book was much harder than writing code. Writing prose comes much less naturally to me than writing code. On top of that, writing a book comes with other constraints such as meeting a schedule and making sure that everything you write is well-structured at all levels, from correct spelling and grammar to the manuscript being a consistent and coherent whole.

My less-than-optimal work habits contributed somewhat to the problem. Generally, I had a chapter due every 3-4 weeks. The actual writing of a chapter tended to be done in the week before the deadline, and in some cases, the day or two beforehand. This led to the year 2005 being a cycle consisting of relaxation and good feeling immediately after completing a chapter, followed by two weeks or so of working on other things while an increasingly loud voice in the back of my head reminded me that I wasn't writing. This, of course, was followed by the aforementioned writeathon.

The result of this was that most of the time, I was racked by guilt over needing to write a chapter, but not doing so. Better work habits would have had me writing one chapter immediately after sending in the previous one, and polishing it in a leisurely manner until its deadline.

This situation was further complicated by mishaps such as losing about half of chapter 7 (which owners of this fine book will immediately recognize as being The Long One) during a laptop theft in France. In a classic case of closing the barn door after the horses have fled, I did institute a more careful backup procedure after this.

Question: Can you give a few examples of where UML has been put to use in a production setup ?

Jeff Dike: You can rent a UML instance from a number of ISPs. linode.com is one that I am reasonably familiar with. A completely different area is embedded development - a number of companies use UML internally to simulate devices so that development can proceed before hardware is available. These companies tend to keep quite about their activities - an exception is accenia.com which sells an embedded development toolkit, one part of which is UML.

Question: When a person - especially with a programming background - comes across the acronym UML, he immediately associates it with "Unified Modeling Language". Why did you opt for the name UML for this project and do you perceive a name change ?

Jeff Dike: I opted for the name because of a complete lack of imagination. If I had had any imagination, it would have been called Zeus or Willow or something equally spiffy-sounding and undescriptive. As for the acronym, I consider this to be similar to trademarks - collisions are OK as long as you're not confusing anyone. UML (the VM) and UML (the language), despite both being computer-related, are so dissimilar that no one is going to be confused by the clash. No one is going to go looking for a virtualization technology and get side-tracked by the language, or vice-versa.

Question: Are you entirely responsible for UML?

Jeff Dike: No! I am the principal maintainer of UML and therefore get the credit for it, but many other people have contributed to the project. Paolo Giarrusso, a college student in Italy, has been my second-in-command for a while, making a large number of contributions to UML, in the form of code, support on the UML mailing lists, and documentation. The UML user base has been most supportive, with many UML features owing their existence to requests, and occasionally to patches, from users. I would like to single out Bill Stearns for his support for the project in many ways since almost the beginning. Last, but not least, Intel has contributed greatly to the project since 2004 when they hired me to work full-time on UML.

No comments:

Post a Comment