wigblog

ramblings of a philomathic polymath

Saturday, January 28, 2006

For those who are interested in what I do…

So I’ve been wanting to document, in a little more technical detail, what I have done and what I am doing in my work. I think there are some folks out there who might be interested… I don’t know. If nothing else, it will be a good reference piece for future resumes and whatnot.

Starting in late 1996, my career path gave me the opportunity to become generally focussed on web development. Prior to then, the only internet related work I had done was for an ISP dealing primarily with customers getting connectivity (read: phone support) and the like. I did some personal web development in my spare time… this was in the days when you used a simple text editor and typed raw code (still my preferred development method)… none of this point-and-click Dreamweaver crap. :-)

At this time in 1996, I was commissioned to put together a group of developers that would do award winning website design and implementation. I started out having to do a lot of the development work, but as we brought more people on board I got to turn my attention to where I was really interested… server side development, databases, server management and configuration, etc. It was this type of work, as it relates to internet facing infrastructure, that I really focussed on up until a few months ago.

In the world of internet facing servers, the idea is to not implement anything that stresses a server too heavily. You more or less want to see the server sitting as idle as possible at all times, so that in the event of some spike in traffic, there will be ample headroom to handle said spike. This means paying particular attention to server side code you write, database queries, etc., making sure they are as optimized as possible to induce as little computation power as possible, but still provide the required functionality. Kind of a funny way to think… spend all this money on powerful servers to do complicated logic as quickly as possible, but try to keep them 99% idle. Sounds like a waste, doesn’t it?

Then a few months ago I took a job opportunity with a new company much closer to home. I had previously had to drive long distances to work… prior to my current 8-12 minute commute, I think the best I previously had was about 30 minutes. The previous 3.5 years’ commute averaged about 50 minutes whether I drove or took the commuter train, and sometimes was as bad as an hour and 30 minutes.

If you’ve read any other computer related content on this website, or engaged in a computer related conversation with me, it’s probably no secret to you that I’m a big fan of free/open source software (FOSS). Things like Linux, Apache, MySQL, PHP, etc. These tools are pervasive through the majority of internet/web architecture (despite any marketing-speak you might hear from Microsoft) and have been fundamental tools I’ve used throughout my career. With this job change, came an industry change as well. I had already been with a civil engineering firm, an ISP, a 3D game development company, a software development company, and advertising agency, a biotech, and I would now be working in the semiconductor industry. The great thing about this I discovered, is that they were even more dependent on Linux and FOSS technologies than any industry I had previously worked with, obviously something exciting for me. Another significant change in taking this position was that my prior work had primarily focussed on internet facing applications whereas this new position would focus on LAN based applications. No more worrying about how some remote person on the other side of the world would perceive the availability and usability of the products I was supporting, but instead strictly the people in the same office I was working (and one branch office in Canada).

Another fundamental shift is that in the semiconductor industry, the typical computing tasks of all the different departments I’ve worked with so far are computationally intensive. Some of this comes from the typical compilation of source code into object code… developer build processes, nightly build processes, testing, etc. The other more common task comes from simulations. There are two types of simulations I’ve come across so far, and I don’t yet understand their nature completely, but they are basically tasks that consume 100% of a CPU for long durations. So I’ve found it to be an interesting shift from an environment where I was trying to keep servers as idle as possible, to an environment where I’m trying to keep them as busy as possible. From what I understand, there are an infinite number of these simulations that can be run, but the fact that some of these simulations take *days* to run simply means they can’t all be run. It’s really about making sure you’ve run the most pertinent simulations to replicate conditions that might be found in the real world and know how your product is going to behave. While it might be interesting, it’s of no value to run simulations that emulate conditions that will never occur in the real world.

That said, even running the simulations emulating real-world conditions can take a very long time to do. Especially when you’re fine tuning certain logic that requires these tests to be re-run a number of times. The way most of these companies do this kind of work is on a large group of computers called a cluster.

Clusters are not a new technology, but have only in recent years really come into prominence. It used to be that big fast computers were single units with a large number of processors and some obscene amount of RAM, storage and cost an even more obscene amount of money (think millions of dollars). It used to be that the folks who had these large, computationally intensive problems to solve were either government research labs (atomic explosion modeling, climate prediction), or educational institutions that could buy these systems through government grants and the like.

But with the advent of industries like biotech (gene sequencing) and others, the need to execute computationally intensive tasks worked their way down into smaller and smaller companies. The folks who needed to run these computationally intensive processes realized that it would be much more cost effective if they could figure out a way to use easily acquired, cheap, desktop PC type systems, working together in some fashion to solve these large problems. That’s where clusters come in. A cluster is essentially a group of computers, tied together in some fashion to be used cooperatively to solve some big, computationally intensive problem.

Now I had read plenty about clusters in all of my regular tech industry related reading, but alas, I had never really had a need to build and work with one. At the biotech I worked for, I was told there might be a need one day for some of the research scientists to have a cluster available, but it never materialized. So I went along in my internet focussed environment where servers are typically divided by task and not necessarily working together on a common problem, but isolated and executing very different functions as designated by the server’s particular role.

On arrival at the new job, the need for a cluster was long overdue, and everyone knew it. They needed it “yesterday”. I was excited…. finally a valid reason and an opportunity to build out my own cluster! The “cluster” they had in place when I got there was not really a cluster at all, but five independent machines that were generally available to anyone and everyone as a “work pool” of computers on which they could execute their jobs. There was no management system that kept anyone from overloading any of the machines, no balancing the needs of the different departments, etc. When users wanted to run a job on one of these systems, they would have to check each machine to find which was the least busy where they could start their task… an annoying pain in the butt. And an essentially infinite number of jobs could be started on any one of the systems, which could either overload the machine and potentially cause a crash, or more typically just starved the already running tasks of CPU resources meaning that all the jobs ended up taking longer to complete. But… things were working for them more or less. Fortunately the folks using these systems were smart enough not to overload them and they were getting by. Five more machines had been purchased and were on the way to be added in to the work pool, but they were really desiring a true cluster setup with these machines. Thus the stage was set for me to set off learning about clusters and how to implement them.

There are two kinds of cluster computing: High Performance Computing (HPC) and Batch Job Processing (BJP). (I know “HPC” is an official acronym you’re likely to see somewhere else… I’m not sure about “BJP”.) HPC computing is where one program is launched across all machines in a cluster and each machine works on a subset of the same particular problem. BJP computing is where there are many individual tasks to be run and each processor in the cluster is assigned a new job when it finishes the previous job. In both of these, the goal is to keep the computers as busy as possible 100% of the time. HPC type applications typically have to be written specifically to be executed in a parallel fashion as indicated, where BJP can run any type of application, but only as fast as a single processor can run it. HPC is about solving a single, big problem as fast as possible, where BJP is more about running as many separate, individual, varying types of jobs as possible, as quickly and efficiently as possible.

The need in my work environment is for the BJP type of cluster. The folks I work with use primarily common-off-the-shelf (COTS) software whether that be FOSS or a specialized piece of software purchased from another company, and they just need to run many, separate instances of this software with varying parameters as quickly and efficiently as possible. Fortunately the type of cluster computing required does not necessarily dictate the type of cluster to be built. It’s quite possible (and easy) to build a cluster than can do both HPC and BJP.

Knowing what types of cluster computing there are, let’s talk a bit now about cluster types. There are multiple ways to setup and manage a cluster. There’s probably more than what I’ll cover here, as I’m only going to cover the types with which I’m familiar. The first is what’s called a “single system image” (SSI) cluster. This design is built on the idea that the cluster operates off (and users only interact with) a single head node (the master node), and that additional computers added to the cluster (slave nodes) simply appear as additional processors and RAM to this machine. This essentially makes the cluster appear as one of those large, monolithic machines with lots of CPUs and RAM and it’s very cheap and easy to expand it by simply adding machines. Jobs running on this cluster are all started on the master node, and then automagically load balanced across all slave nodes in the cluster. There are pros and cons to this type of setup. Some pros:

  • This cluster type is very easy to setup. It’s really as easy as inserting a CD installer into your master node, booting from it and following the on-screen instructions.
  • Assures balanced utilization of all your hardware since the load balancing is all done transparently behind the scenes.
  • Very easy to expand this cluster type, especially when you have your slave nodes booting from a copy of the operating system stored on the master node. (Slave nodes in this case require no local hard disk.)

Some cons:

  • Good load balancing is really only achievable for jobs that execute longer than 5 seconds. There is a certain overhead involved in migrating a process to a slave node from the master node. This works fine for those long running simulations, but software build/compilations that consist of many sub-5 second operations on small files end up executing entirely on the head node (which is also tasked with coordinating the duties of all the slave nodes).
  • This SSI functionality is limited to version 2.4 of the Linux kernel. There is experimental support the 2.6 kernel (now at least 2 years old), but you don’t find any ready made installation distributions that use the 2.6 kernel. This can be a limitation of you require certain 2.6 functionality or hardware support. I only found it a problem as it meant I would be running a significantly outdated Linux distribution lacking certain “standard” utilities making integration into the network environment more difficult.
  • All users login directly to and use the master node to launch jobs. This presents the opportunity for users to overload the cluster by launching too many jobs at once. There’s no default limitation on how many jobs can be launched by a single user though standard system utilities to limit this could conceivably be used.
  • No intelligence / queueing / accounting system in place for job management. While one could probably be installed atop the environment, there is no software that can intelligently manage and queue jobs on the system by default. Probably wouldn’t make much difference anyway with users directly logged in to the master node there’s nothing to keep them from launching programs directly on the master node bypassing any job management that might be in place. In addition, the fact that there’s no queueing system in place also means the system is incapable of being an effective BJP platform.
  • With all the slave nodes being ‘dumb’, and the system appearing as a single, big SMP system from the head node, this makes SSI an ineffective platform for HPC applications as well.

Then besides the SSI cluster described above, you have what I like to call the “true cluster”. There’s probably a more official name for this type of cluster, I either just can’t think of it as I write this or I haven’t heard a succinct name for it. This cluster type deals with many of the shortcomings of the SSI cluster type. Needs we know need to be overcome:

  • Users logging in directly to the master node (or any cluster node for that matter). We probably want to avoid users touching any of the cluster nodes directly so they don’t have any type of control over where and when their jobs are launched.
  • A setup that will work effectively for long and short-running processes
  • The ability to use current software (kernel and distribution)
  • Some kind of an intelligent management system that will organize utilization of the cluster, queue jobs, provide accounting information, and enable BJP applications to run effectively
  • An environment in which HPC applications can be run effectively

Fortunately for me, the two Linux magazines I subscribe to (Linux Magazine and Linux Journal) had both recently been featuring articles on clusters. There’s also a fairly recent O’Reilly book on building clusters as well. With a basic, ground zero level of understanding, my first goal was to get all that hardware powered up and available for processing. I decided the best plan of attack would be to first setup an SSI cluster that would take away the manual process of load balancing jobs across the machines and be fairly quick to deploy, while I kept a couple machines for myself to work on building out what I envisioned for the long term: a cluster that met all of the needs outlined above.

For the SSI cluster, I turned to a product that seems to be used fairly commonly and was touted in a magazine article I had as well as covered in the O’Reilly book: ClusterKnoppix. Knoppix, for those who haven’t heard of it, gained notoriety as one of the first, well organized “live CD based” distributions. What this means is that you can boot a system off of a Knoppix CD and have a completely usable system without requiring you to change the contents of any local storage (hard disk). This is a great way for those unfamiliar with Linux to be able to work with it, without having to wipe out or reshuffle the contents of their hard disk. ClusterKnoppix is a distribution built off the base Knoppix distro, but adds tools to create an SSI cluster very simply using the most common Linux SSI cluster technology: OpenMosix. OpenMosix is the tool that manages the load balancing of all the processes in the cluster across all the nodes in the cluster transparently in the background. Getting this cluster setup is pretty simple: download and burn the CD image, insert the CD into your master node and boot the machine, follow the on-screen instructions and it will install the distribution to your hard disk. While possible, it would be inefficient to run all of your cluster nodes booted from CD’s as input/output (I/O) to and from the CD’s is much slower than to/from a hard disk. And as these computers are dedicated for usage in the cluster, it makes sense to put everything straight on the hard disk. One nice thing about ClusterKnoppix is that it can be used on a bunch of machines at night, when they would otherwise be idle, and returned to their functional state in the morning with the contents of the hard disk never having been touched.

With a little bit of extra configuration, you can get things setup for your slave nodes to network boot from the master node, eliminating the need to do any kind of installation on your slave nodes. They just load their files from the master node and are instantly added in to the cluster. I set this up initially using 3 of the new machines that had been ordered. I decided to leave the 5 existing systems in place as they were being used until I had a viable option ready for everyone to which they could easily move. The two remaining new machines I set aside to build my next revision of the cluster that would be my long term solution.

Getting the SSI/OpenMosix/ClusterKnoppix based cluster up and running was indeed quick and easy, and once I had it in what I felt was a usable state, let one of the “power users” loose on it to run a bunch of simulations. These proved to work nicely and over time, the other users migrated to this cluster and were able to use it as well without issue. Eventually all 5 of the existing machines were moved over into this cluster setup, giving us a total of 8 nodes that we easily able to load balance all processes that ran longer than 5 seconds (Under OpenMosix, a process only becomes eligible for migration to another node after it has been running for 5 seconds.)

My Fortune 500 IT experience has made me overly cautious in my planning and deployments. Which is good in a sense as the quality of my work is very high (I like to think), but perhaps slower than it might be if I threw some of that caution to the wind. Working in a startup again has been a challenge as I’m essentially trying to help these guys think and operate more like a big company, yet still retain the typical speed of change and agility for which startups are known. So adhering to the “act like a big company” idea, I don’t migrate people to a new environment prior to it being fully tested and “burned in”. I’m happy to say that I underestimated myself and by the time I had the 5 existing systems migrated into the SSI I already had the next cluster configuration in an almost operational status. I think it was only one weekend that the SSI cluster had a full 8 nodes in it, I started migrating the nodes to the new configuration for testing right away.

For the next cluster configuration, I knew I had a lot of tools to learn that I hadn’t dealt with before. Fortunately for me, others had already laid the ground work to get a cluster of my desired type up and running quickly. A Linux based cluster essentially consists of a fairly standard OS installation at the base, with a combination of software packages installed on top of it that implement and assist in managing the clustering functionality. In my research, I came across two projects that had the tools pre-bundled in an easy to use package: OSCAR and Rocks. I chose OSCAR for my cluster though I’m actually interested in trying to build a Rocks cluster as well.

Where Rocks comes as a complete OS installation package, the OSCAR package is essentially a collection of software that enables clustering functionality, bundled with installation and management software tools. You can install your choice of any of the supported distributions (”supported” meaning that the installation and management scripts are tested and should run without issue on the distro), and then you install the OSCAR package over it. The benefit here is that you can start with a base OS install you might already be familiar with (which I was), rather than go with a completely customized install that others have tweaked, the particulars of which you might not be familiar with.

The installation starts with the master node. I am quite familiar with the Fedora distribution and OSCAR really seems to lean in the Red Hat direction. (Other supported distros are all Red Hat related more or less: RHEL, Fedora, CentOS, Whitebox, ScientificLinux, etc.) I went with Fedora Core 3 (FC3), the latest version of Fedora supported at the time I was ready to do my install. While not the latest version of Fedora, I knew it inside and out and knew this version would be more than adequate. So, per the instructions, I did a very basic, standard install of FC3 and went through the OSCAR installation after that. OSCAR comes with a nice point-and-click installation interface (not that I require one), as well as some very good documentation complete with screen shots that make the process quite easy. Once the master node is setup, you follow a few more instructions to work on getting your slave node(s) setup. Slave nodes are installed over the network from the master node. This is nice as it’s a very quick process (much quicker than a CD/DVD based installation) and the slave node is already tweaked and customized how it needs to be to participate in the cluster.

With the basics of the cluster installation complete is was time to start getting familiar with the various packages involved with operating a cluster. This really ended up revolving around the resource management and scheduling tools. It was difficult for me to understand where the line was drawn between these two functions and the line is still pretty blurry actually. These tools essentially keep track of what hardware exists in the cluster, what jobs need to be run on it, and takes care of allocating the necessary resources to execute the jobs in the queue to be run. There are several of these systems available, some free and open source, and some commercial offerings. OSCAR bundles a package called TORQUE for resource management. For scheduling, a very basic “first in - first out” (FIFO) scheduler comes with TORQUE. OSCAR also includes a package called Maui, that is a much more complex scheduling tool. So while TORQUE manages the queue(s) into which jobs to be run are submitted, manages the available hardware in the cluster (knows what’s there and what’s busy or not busy), the scheduler then just tells the resource manager “when and where” to run the next job according to it’s configuration. The FIFO scheduler is very simple in that it provides just enough functionality to tell the resource manager to run the jobs in the same order they are submitted. Almost seems like it’s only there just so that TORQUE can operate alone. In any type of a true clustering environment, I can’t imagine a case where one would not want to configure some kind of customized scheduling depending on the types of jobs needing to be run. Knowing in advance my needs would be more complex than the FIFO scheduler could offer, I enabled the Maui scheduler right away.

Out of the gate, the Maui scheduler essentially operates as a FIFO scheduler, it must be configured to behave how you want it to… and that’s been no trivial task. Perhaps I should explain what my needs are like. At this point, I have 4 known job types that will need to be run on the cluster: Long-running simulations, short-running simulations, Matlab jobs (I’ll leave you to Google Matlab if you’re curious), and software compilation / build jobs. Each of these different job types has varying priority. The group that runs the long-running simulations wants to run as many of these simulations as possible all around the clock. These jobs typically take several days to complete, and while they want the results ASAP, because they know there is a wait time of several days anyway, it’s not essential that the job runs 24/7 through to completion. The short-running simulations involve licensed software that costs something on the order of $25K per seat, and there is immediacy to the jobs as the user will typically sit there and wait for it to complete. These jobs take somewhere on the order of 5-10 minutes to run. Matlab is a licensed piece of software, but the jobs take some time to run… somewhere around 12 hours I believe. The software compilation jobs can take anywhere from a few minutes to a couple of hours. It’s the configuration of the Maui scheduler that determines how and when these jobs will run, the “policy” if you will.

The policy I’ve setup in the cluster is to allow as many of the long-running jobs to run as there are available processors to handle them. Then when any other job type comes up, the long-running job is ’suspended’, or “put on pause” while the other job is allowed to run. (Some of you at this point maybe wondering or might start wondering: “well aren’t these great computers today supposed to be able to multi-task and run multiple jobs simultaneously?” While that’s true, don’t forget that two jobs competing for 100% of a single processor will only get 50% of it and then both jobs will take twice as long to complete than if they were the only job running on the processor. In my case, it’s better to stop the one that’s going to continue running for days and execute the one that won’t run very long right away.) This gives the user waiting on the short-running job a sense that his needs are being met with immediacy (or close to it), while still maximizing usage of the cluster resources. I don’t have to have idle processors sitting around waiting for the short running jobs all the time. It is possible to configure this scenario as well, but like I said, I’m all about maximizing resource usage… I want 100% of my processors working 100% of the time, as long as there are jobs to be run anyway.

Now while it’s very cool that I can have the cluster management software put one job on pause and run another job in its place and then resume the job when the short-running one has completed, suspending a job means that the job is still consuming memory and other non-processor related resources (file handles, pipes, network connections, etc.) on the node where the job is running. What would be even better is to stop this job completely, have it release *all* of it’s consumed resources, yet still have the ability to restart it from the exact point it was stopped as if it was never interrupted. This kind of functionality is called “checkpointing”. Checkpointing is definitely possible, but is a procedure that must happen at the OS level, and while it comes standard on some of the commercial Unix flavors out there, there is no standard checkpointing functionality available in Linux at this time. There does appear to be some experimental implementations available which I’ll be checking out sometime soon. I’d like to have this functionality as it would then actually allow a process on one node to be stopped and then, if necessary, migrated to a different node in the cluster where it can be restarted and continue to completion.

The configuration I’ve done thus far as been a challenge. Both TORQUE and Maui are complex systems with their own set of commands with which to configure and manage each piece of software. And making changes in one package can affect things in the other since they work so closely. It’s hard to know what troubleshooting whether to look at things from the TORQUE end or the Maui end. Some of the data you get is different depending on which tool you use to look at it, and sometimes getting to the data you’re looking for can be a frustrating adventure. Unfortunately these tools are not documented as well as would be hoped, so experimentation, testing, and patience are required to get it working how you want it to. And even when you think you’ve finally got things right, something will happen that you just have no idea on. One of my next tasks will be to devise some type of job simulation tool to load up the system with dummy jobs and then try to recreate these strange conditions to figure out why they happen…. and of course I’m doing this on a production system since I don’t have spare clusters lying around. Dummy jobs don’t have any potential to cause any damage, so I’m not to worried about it.

Another nice aspect of using the TORQUE/Maui software is that jobs can be submitted to the cluster from remote machines. This means that cluster users never have to “touch” any of the cluster nodes. Jobs are submitted to the cluster from remote systems and the resource management / scheduler determined where and when they are run, and the results of the job are sent back to the use on the system they were submitted from. It’s nice to keep users off the cluster nodes as this assures the resources of the cluster will be utilized according to how the policies are set in the scheduler.

As it stands at the time of this writing, I have the 5 original machines, the 5 machines that were purchased shortly before my arrival on the job, 3 machines borrowed from our parent company, and one spare desktop PC for a total of 14 nodes in the cluster. The group that needs to run the long-running simulations is really driving the expansion of this cluster. They tell me they can easily keep 50 machines busy (I believe ‘em), and I’ve even heard a long term number of 100 machines being in this cluster. In the short term, we’ve been targeting to add 20 more machines and evaluated several architectures so we could spend our money as wisely as possible. We ended up deciding on dual processor AMD Opteron systems. We had tested a system loaned to us by our parent company and were quite pleased with it. After coming to a final system configuration specification and final pricing with our vendor, I purchased one just as a final test before ordering the 19 others. I dropped this single machine into the cluster and it integrated seamlessly so the remaining 19 were ordered and should be ready for delivery when I get back from my vacation. I’m looking forward to getting all those systems setup! When I do that, I’ll probably take the 3 systems on loan out along with that lone desktop PC after which I should have the following hardware in the cluster:

  • 10: Pentium 4 @ 3GHz, 2GB RAM, Gigabit ethernet
  • 20: Dual Opteron 246 @ 2GHz, 4GB RAM, Gigabit ethernet

Pretty fun having all that power available… but how do I really know how much power I have? How do I know that the system as a whole is operating as efficiently and effectively as it should be? This can be evaluated by running a benchmark on the system that will stress its most critical components: the processors, RAM, and interconnect. (Ahh yes “interconnect”…. I haven’t talked much about the technical aspects of cluster construction, but one important consideration in building a cluster that can effectively run HPC type applications is the speed at which the nodes can communicate with each other. The medium by which this communication happens is what’s called the cluster “interconnect”. In my case, I’ve chosen to use the gigabit ethernet interfaces on these machines which is also used for network access.) There are several standard benchmarks that are used to test clusters, but none probably more popular than the Linpack HPL benchmark, an HPC type application, that is used to rank clusters on the Top 500 list, a list of the 500 most powerful computers on the planet (or at least the ones that execute this benchmark the best). Figuring out how to run this benchmark is an undertaking in and of itself, but I’ve been running it and measuring how the cluster is doing as it’s been growing.

From what I’ve been able to glean from all the reading I’ve done related to this benchmark, it’s generally considered “good” if your cluster can achieve a sustained performance level 50% of that which should be a system’s theoretical peak. So in reading up on the Top 500 list, you’ll note two numbers listed for each system: Rmax and Rpeak. Rpeak being the theoretical peak performance of the system, and Rmax being the measured performance level. Divide Rmax by Rpeak and you have your efficiency ratio. For the Linpack HPL benchmark which is strictly about floating point computation, the theoretical peak performance of a system is determined by multiplying the number of floating point calculations a processor can execute in a single clock cycle by the frequency of the processor. For example, a Pentium 4 chip can execute 2 floating point operations per cycle, so at a frequency of 3GHz (3 billion cycles per second), that gives a theoretical peak performance (or Rpeak) of 6 billion floating point operations per second (FLOPS), or in industry parlance: 6 Gigaflops. The last benchmark I ran on the cluster was with the 10 Pentium 4 nodes and the 3 borrowed from our parent company… so 13 nodes. 6 Gigaflops per node * 13 nodes = 78 Gigaflops is my Rpeak. I came out with an Rmax of around 54.4 Gigaflops, or an efficiency rating of 69.8%… quite respectable if I do say so myself. Some of you might be wondering how well that ranks against the systems on the Top 500 list. If I remember correctly, that rating would have placed 251st on the list from June of 2000. I still have along ways to go if I’m going to com close to getting a spot on the list!

It’s very interesting running the Linpack HPL benchmark on the cluster. The typical jobs we run on the cluster are definitely processor intensive, but RAM and network/interconnect utilization stay pretty low. OSCAR includes a piece of software that helps you gauge the utilization of these components on your cluster called Ganglia… this is something I watch all throughout the day to keep an eye on the cluster. Its interesting to watch all the graphs on there go up to the top when running the benchmark. And if you’re in the same room as the cluster when that benchmark starts, you’ll hear all those fans kick into high gear…. sounds like a jet engine more or less. Another aspect I haven’t mentioned in all of this is that when you get a lot of computers running in one place, you have to start thinking about how much power they are consuming and how much heat they are putting out. I’d be able to comfortably fit about 7 of those Pentium 4 machines on a single household sized circuit (15A). At the time of this writing I have 15 systems running, soon to be at least 30…. that means I have to spread that power load across multiple circuits (3-4 industrial sized circuits (20A)). And then the cooling requirements… when these things are spinning up there at 100% they definitely put out some heat. Dedicated cooling for the room where the cluster lives is required as well as decent air circulation. I won’t get into the formula to determine how much cooling is required…. you can ask me if you’re really that interested.

With the OSCAR cluster setup, I’ve been able to address the needs that were not met in the SSI configuration: jobs are submitted by users remotely, long and short-running processes are dealt with appropriately by the TORQUE/Maui combo, a relatively current base Linux distribution is being used, and the TORQUE/Maui combo enables both BJP and HPC applications to run effectively on the cluster.

That’s more or less what I’ve been focussing on at work for the past few months. And it’s going to be a continuing process… adding in more features and functionality to even further enhance utilization of the cluster, as well as adding more hardware as the company buys it. And continually fine tuning the resource management / scheduling software to make sure things run efficiently, responsively and effectively. It’s doing pretty darn well right now, but it can get better.

posted by jwigdahl at 7:50 pm  

1 Comment »

  1. James:

    Cool stuff. I’m assuming that cluster software isn’t really available/affordable/user-friendly enough for mainstream desktop OS’s (OSX or XP), but how cool would it be to have that setup on a home network? Using your wife’s/kid’s computers to help speed up a video encoding session would be pretty sweet. I haven’t got a clue as to what technical difficulties that may pose, but I’m suprised it hasn’t been implemented in some fashion, especially on a Mac. I do know that Logic Pro has a “distributed computing” mode that allows additional computers to be added to speed up processing, but OS-level integration could allow any process (from any program) to be “outsourced” to the cluster (of course, that would mean that all of these processes would have to be fully multi-threaded, which a lot of programs aren’t, but as we move more and more towards multi-core, I’m sure we’ll see quite a few more multi-threaded apps). One day, sending a task out to a cluster could be as user-friendly as accessing a network drive.

    -Bryan

    Comment by Bryan — 2/15/2006 @ 12:13 am

RSS feed for comments on this post. TrackBack URI

Leave a comment

Powered by WordPress