How Much Linux Cluster for $6000?
A recent article on Slashdot queried about assembling a parallel computing platform on the cheap. The budget was noted at £4000, which is conversationally about $6000US. The application was noted as being CPU intensive, but not written for GPU support. The final notation (of concern here) was the suggestion of using a boring Linux distro and Sun (now Oracle) GridEngine, with storage done via NFS mounts.
Of course, the easiest and most popular response was to use services like Amazon’s Elastic Compute Cloud. There’s analysis that would need to be done to determine if the size of the cloud used would be cost-effective. One thing to remember is that pricing is per hour per instance, so while it looks cheap at sixty-eight cents for their Cluster Compute Linux Instance. That’s just $16 a day per instance. That’s a year’s worth of one instance. Six months for two, then. You could run for about one day on about 300 nodes.
As a thought experiment, I mused what could be done to make a cluster to run “locally,” given their requirements, and what would be some the potentially unforeseen obstacles? How would that compare to the Amazon solution?
Looking at the requirements, I think it’d be possible to make a pretty fair cluster if that $6000 went entirely into equipment, meaning later utilities and manpower come from other sources, especially if a boring Linux distro is the at core. The mention was that the processing was CPU intensive, so it seems it’d need some fat processors. There wasn’t a mention of RAM requirements for the processing, but storage was noted as being remote to each server, which drastically cheapens each slave in our cluster. The boring Linux distribution means our OS cost is zero. As far as I can see, even the Oracle Grid Engine is free.
A battle-line gets drawn when starting any build-it-yourself PC project. Arguments over whether AMD or Intel are better, blah, blah. For the purposes of building on the cheap, though, it’s hard to beat AMD. Visit a discount computer supplier, like Newegg, and search for 6-core processors. AMD comes out a clear dollar winner, with Phenom II x6 processors ranging in price from $139 to $189, while Intel’s two I7 processors are $598 and $999. Additionally, the motherboards to support those processors start as low as $40 or the AMD, and from $150 for the Intel CPU. That’s a minimum build-point of $179 for AMD or $737 for Intel; or about 4 times as many AMD CPU and motherboards for each Intel. Other options come out, too, like trying to put together a few quad- or dual-core systems for each six-core system, but in a pure core count per dollar, the six beats it every time.
There’s a big battle point in that the Intel CPUs are running just a little more than 3GHz, while the AMDs are a little less. Also, the Intel CPUs support Hyper-threading, giving each core potentially two threads of execution. To the first I’ll go back to 4 times six times just less than 3 is a lot more than one times six times just more than three. For the second, I’ll go two ways; assuming the Hyper-threading is utilized always, you’re still looking at 4 times six times just less than three is still more than one times twelve (threads not cores) times just more than three.
Where this gets further tweaked is that each motherboard requires RAM and a PSU (at a minimum).
For these configurations, where the CPU is drawing about 125W, and the hard drives will be remote, each should be able to chug along nicely on more-or-less the same inexpensive 300W PSU, for which a quality one can be had for about $30. A benefit to cheap PSUs is that if they go out, they’re an easy thing to replace without digging into a tight budget. Additionally, for those not wanting to have a bunch of motherboards and power supplies just laying about, there are cheap cases that come with PSUs at that level for about the same price, so we don’t need to further consider cases, either. Each motherboard needs one (barring any kind of sharing scheme), so let’s conversationally amp our AMD price to $209 and our Intel price to $769.
RAM is a little different. The AMD motherboards use their RAM in pairs, while the Intel motherboards use triples (at least the ones supporting the six-core processors). At least conversationally, the individual RAM sticks are about the same, and each supports DDR3 with more-or-less the same speed ranges. Picking a nice middle-of-the-road, but still zippy 1333, 2GB sticks run about $14 and 4GB sticks run about $20. To keep the pricing fair, and since RAM wasn’t cited as a heavy need, putting a pair of 4GB in the AMDs and a trio of 2GB in the Intel boxes gives about the same price; call it $40 (which is rounding nicely for the Intel).
This makes our base $249 for the AMD and $809 for the Intel. Any other additions would be identical per motherboard. Whatever we add for GPU or storage or…whatever. So this gets us about three AMD machines for each Intel. Or 18 cores and 24GB RAM for AMD, and about $50 to spare, for each 6 cores and 6GB of RAM for Intel.
Going back to the Hyper-threading and CPU speed, this is 18 cores running at just less than 3GHz compared to 12 threads running at just more than 3GHz for about the same dollars. Depending on how strongly a group might be tied to Intel processors, that rounding error might be acceptable, and instead of having 3 systems to administer, there’d be one. That alone could make the difference for some.
So for this thought experiment, we’re going to use commodity 6-core AMD rigs, with 8GB of RAM each, with a 300W PSU, with or without a case. Just dividing that $6000 by $250 gives us a little group of 24 systems. A group of 24 systems gives us 144 cores of processing power, resting in 192GB of RAM.
This pokes at one assumption that’s been made. The assumption is that the application will perform better on a multi-core system. If the grid-computing application was written in a multi-core or multi-thread fashion to take advantage of this, then these bulkier systems make sense. Even if the application is written such that it’s trapped in one core, but multiple copies can be run on one system, these systems make sense.
What about the case where the application is not multi-core/thread friendly, and only one instance can be run on a system at a time (forgetting about virtual machines for the moment)? Could we build a bigger cluster of single-core systems to do better for the same money?
Looking again at Newegg, we can see they have AMD Sempron and Intel Celeron combo-packs with motherboards for about $70. Atom and VIA CPUs might run as low as $50. These run with pairs of memory, but (curiously) more expensive DDR2, so we can only get a pair of 2GB sticks for the same $40. Add in the same PSU (with or without case), and we’ve got $140 per core. This gets us 42 systems in our $6000 budget. That’s 42 cores and 168GB of RAM.
Only if the application is core-bound does this make sense. If we’re constrained to running one instance of the application on the system, and that instance isn’t written in a way that allows for utilizing more than one core, then it’s better to have 42 instances running than 24, especially if those 24 are letting five of their cores spin. It might take some small system experimentation, as it may be the case that even with the extra cores, the single-core systems perform better.
However, even if the application is core-bound and system-tying, we have the ability to squeeze virtual machines on our multi-core systems. We could theoretically squeeze five virtual machines on our hosted system, and run six instances of the CPU hogging application, giving us 144 instances of the application. Even if we can only utilize five of our six cores for the application, leaving one to handle the host, that’s then 120 instances of the application. This might affect the bottom line a little bit as the additional virtual machines might require a bit more RAM in each system, or even require upgraded motherboards that supported more DIMMs as well as larger ones.
We’ll assume more cores is better, and go with the six-core boxes.
We’ll have to give a little bit of that back, as we haven’t made our systems do anything, and we’ve forgotten an essential piece of the puzzle. We have no storage! Let’s make it nice and round, and take back four systems to give us $1000 of pocket money to make the rest of that work. So we’re going to have 120 cores and 160GB of RAM.
Before we get into the center of the network, we need to boot each of our systems. We could work this really cheap an set up network booting, letting each system run completely diskless. There are packages that can be added easily to our boring Linux distributions to allow this. The next cheapest step up would be to buy 20 4GB USB thumb drives for about $5 each, installing our boring distro on them, and allowing each machine to boot from USB. I’m ambivalent about which solution is better. There’s a bit more systems administration and one more piece of software involved in making a network boot cluster, plus there’s the matter of additional network traffic. There’s the time involved in making bootable USB drives, and configuring 20 separate workstations (just cloning the drives would result in 20 machines with the same name, for one small obstacle), but there’s the reduction in bandwidth just to boot and run, and replacing cheap drives is, well, cheap.
Considering that the storage would be networked, either a NAS or dedicated server could be used. If we’re going to consider building systems anyway, perhaps a big box should be put together to hold the project’s storage. Our simple rig we’ve built so far would probably make a good NAS or NFS server. Additionally, that rig could be used as either the network boot server, grid master, and a number of other tasks that wouldn’t be wanted on the cluster slaves. Because of this, I might consider adding a bit of RAM, just to allow running the additional services. Even doubling the RAM to 16GB only adds $40 per system, assuming we can just put another pair on our motherboard, or that it supports the twice-as-expensive 8GB DIMMs.
Even the cheapest motherboards support four hard drives. Stepping up to a motherboard that supports six SATA drives only adds ten or twenty dollars, and jumping to one that supports eight drives is still only a hundred dollars. For the sake of making this a monster, let’s jump up to eight drives, and 16 GB of RAM. We have to drop $100 for the motherboard, still $140-ish for our six-core CPU, and $80 for our RAM, bringing our base for this server up to $320. We need a bigger power supply, to support all of those hard drives, so we’ll probably have to spring for a 750W or even a 1KW PSU, which will mean dropping another $80-100. Let’s be optimistic, and call this a $400 base.
The hard drives can be put into a RAID configuration for redundancy, striping or both. Thankfully, we can rest assured that almost all of the $100-range motherboards have BIOS RAID. Even if they didn’t we could utilize software RAID when we installed our boring Linux distribution.
We need a little bit of that extra to buy a case big enough to hold all of those drives, so springing for a full-size tower (because this is one machine we don’t want to crate or lay out on a table) will run us closer to $100. This brings our base price to $500.
Now we need 8 hard drives to put into the box. Cheap hard drives run about $40 for 160GB or so. Big hard drives run about 3TB at around $130. If we filled our 8-port box with these, we’d spend between $320, and have 1.2TB, and $1040, but have 24TB of storage. If we’ve spent $600, on USB drives and the rest of this server, of our spare $1000, we only have $400 left, so we’ve got to err on the cheaper side. Neatly in the middle, 1TB drives cost about $50, which would take just more than half of our remaining dollars, and give us 8TB of storage.
We’ve got about $150 left of the original $6000. There’s 21 systems. All are 6-core AMD systems. One has 16GB of RAM, the rest have 8GB. The big one has 8TB of storage that will be used by the remaining 20, which are booting off of inexpensive USB drives. If we chose to network boot, we’ve got $250 left.
What to do with that? Well, one thing we haven’t quite got licked is how we get the systems talking to each other. Every one of those motherboards comes with network adapters. In all likelihood we can find them with GB ports. Certainly the larger system has one or more GB port. We’ve got 21 systems that need connecting, and quite probably one or more ports will be needed for systems not in the cluster, or at least one gateway port. That totals 22 minimum ports.
Newegg sells commodity 24-port 10/100/1000MB unmanaged switches for less than $150. The rest will probably have to go to buying 22 patch cables, or a spool and ends to make them.
No keyboard or monitor is included. It’s assumed these could be borrowed for the purposes of installing the OS and getting at least the systems booting, and the rest of the configuration could be done from other workstations.
We’ve now got 21 computers to assemble, and a switch to attach them to. Twenty-one boring Linux installations, plus configuring the grid, plus adding the application. What’s possibly left.
Power, for one. We’ll go with the assumption that either a UPS is available for at least the server. Even an inexpensive $30 one could be scraped up to allow the system to spin down neatly. If the power goes out on a cluster node, we’ll have to hope the grid and application can handle the loss, and perhaps restart the work unit, or whatever.
But, where to plug-in the UPS and 20 other systems and the switch?
The typical outlet circuit is probably tied to a 15 or 20 amp fuse. A 20 amp circuit, drawing the 120v needed for our power supplies caps us at 2400 watts before the fuse blows; for 15 amps, that’s 1800W. Since we don’t know which we’re dealing with, we’ll play it safe and say 15 amps. That means we shouldn’t be plugging in more than 6 of our 300W power supplies. Fastidious users will counter with that’s peak wattage, and it’s likely our drive-less systems will be using much less than that, even running the app at full speed. I’ll counter with the idea that we shouldn’t be maxing out the circuit anyway, so we’ll consider that less-than-peak usage our buffer, or else we need to drop to 5 systems per circuit to be safe.
Note that’s a circuit, not an outlet! Most outlets are on a chain of outlets on one circuit. Surely nearly everyone has had that fuse blow while using the microwave, dishwasher, and refrigerator at the same time, or something similar. They’re likely plugged in to different outlets, but those outlets are connected to each other and then to the main power junction.
If we’re limiting each circuit to 1800W or less, or 6 systems, we need 4 or more separate circuits. The server is bigger than the others, and the switch smaller, but 4 circuits gives us room for 24 systems. Note that at peak, this means we’ll be drawing about 7200W of power at a given time.
In a university or larger office setting, it’s possible that there’s a room already set up with this in mind. Higher voltage power is brought in and distributed over more circuits, perhaps even with larger amperage limits, so more systems on a single circuit will work.
Imagine for a moment the demands of laying out such a setting for 42 single-core systems! Plus the additional switch that would be needed, and the extra floor space.
In addition to figuring out where the outlets are on separate circuits in a small office (or worse, a home office), there needs to be room for 21 computers some where. Making a three-tall stack of seven-wide systems in mid-tower cases will take up about about as much space as a desk. Unlike a desk, though, each will probably be kicking off a few BTUs of heat.
The heat concerns some people too much. Yeah, the room can get warm, but even if it heated to the maximum nearly 150F degrees that our 6-core CPUs can reach and still function, the CPUs won’t be damaged or perform any worse because of it. Nothing in the room will burst into flames. Ice might melt fast, and people may sweat, but that’s about all that will happen. Open a window or door or something and the heat will not be so bad.
It might be the case that to find enough circuits to connect everything to power, the systems are spread out in little groups, and the heat won’t be such a problem. This will require different cabling schemes, but longer (to a point) network cables won’t affect this simple cluster that much.
Now that everything is up and running, what have we got?
Twenty cluster/grid slaves each with 6 cores running around 3Ghz. Dig around a little bit and you’ll find benchmarks showing around 70-80 million instructions per second for the mid-range AMD 6-core processors. This seems a whole lot less, but it’s instructions per second that we can translate into how well our software will run. We’ve got twenty of those, so 1.4-1.6 billion instructions per second (potential).
Backtracking a little bit, had we built our cluster out of those more expensive Intel processors, the same sites show benchmarks are better, around 100 million instructions per second. But had we gone that route, we’d only have maybe seven of those, or 700 million instructions per second (potential). While the more expensive systems offer a lot more computing each, grouped like this, the additional systems more than make up for that. Same money, twice the throughput, but three times the systems. If the power or heat is a problem, or the administration of the extra systems is a bother, the Intel slaves might be the way to go (keep the AMD master server, though, ’cause it’s doing file and grid management, not “real” computing).
These will be connected to 8TB of storage for a pretty generous amount of data. They’ll be connected by a gigabit network, at least isolated by the switch, so each system should be able to talk pretty close to full-speed to the server. Of course, network throughput depends on the size of the data and how frequently it’s shuttled between working unit and the server.
Compare that now with the Amazon cloud solution initially suggested. We’ve got twenty nodes in our cluster, so if that’s all we used in EC2, at $16 per day each or $320 a day for the 20 instances, we’d have only about 18 days to run.
This is similar logic to what I used to build our cluster at Hampshire, but then at some point I discovered how cheap used Dell C6100’s were on eBay, and started buying them instead, because that’s even cheaper.
Thanks a lot for the detailed document, this is actually the most helpful (and recent) documentation of the process required to design a cluster. Very helpful!
Do you have any ideas for how the network topology could be improved with respect to increasing the bandwidth whilst keeping costs low?
As a general question, Berty, it’s really hard to suggest improvements on network bandwidth while not having an understanding of the actual bottleneck.
The example wasn’t written with a particular solution in mind, and that would weigh heavily into the network solution. As mentioned, for a lot of CPU intensive work, a robust gigabit switch should do the trick. The bottleneck here might be at the main server, which may have just the one connection to the switch and have to serve the needs of the other twenty-some machines. This is likely to be a bottle neck only in the cases of large, long-lasting payloads, or frequent simultaneous payloads or broadcasts.
This kind of bottleneck might be able to be mitigated by having a server with multiple connections to the switch, and having sets of servers perhaps connect to the server through those different connections (i.e., most simply, have a 4xGB network card on the server, and give each connection a separate IP, and configure 1/4 of the clients to use each of those IPs, or use a round-robin approach to better divide the work to the server at the switch).
If file access or network delivery of work units is a bandwidth issue, or it competes with other client-to-client communication needs, you could choose clients with multiple NICs and connect those to “dedicated” networks. For example, one network would be for storage and the other for cluster communications. This leads to additional cost for additional NIC or motherboards with multiple NIC, but that cost can still be pretty low.
If the traffic is largely due to chatty clients, sharing work units for example, you could further divide the clients into other smaller physical networks this way, too; perhaps in groups of 4 or 6 nodes per network, where the work units make sense. This could go either way with the cost as it might be less expensive to have a number of smaller switches than large switches.
Another simple solution might be to have multiple port NIC in each system, and channel-bond those to the same network, effectively multiplying the throughput. This might break the “costs low” criteria, though, as that would then require more ports on the switches, and switches that can handle the channel bonding, but it is a quick, and sometimes less expensive, way to get 2GB or 4GB networking.
A final example, which will certainly not keep the costs low, is to jump to a faster network. A 10GB switch and NICs will probably break the $6000 budget of the example without any room to purchase systems.