[UCI-Linux] RFC on RCS proposal for cheap, reliable storage

Wed Jun 14 10:18:11 PDT 2006

Hi All, 

This is a long preview of one project we (the Research Computing Support group 
at NACS) are working on. We're hoping that this strawman proposal will result 
in comments that will help us refine and improve the project.  

The first part of the project is the specification of a high quality commodity 
linux box built with an eye to storage.  The second part concerns the 
software required to do various tasks.  The third part is what you can do with 
multiples of such a device and the fourth is how it might be applicable 
to other organizations in the UCs.  Only the first two are dealt with in any 
depth here.

From discussions with Computer Support personnel in other Schools at UCI, 
storage is more important than CPU cycles.  Not only is increasing storage 
required, but the kind of storage required is moving towards longer term, 
more robust storage.  Similarly, the minimum size of storage has moved beyond 
the GB range, well into the TB range.  As such, our interest is with a 
minimum 10TB device, which can be ganged into larger units.

In light of that and the cost of such storage from large commercial vendors 
such as Network Appliance, Sun, EMC, etc, I examined the base cost of devices 
that could provide this storage.  Rather than try to replicate their methods 
of providing such storage which is often expensive and proprietary - Fiber 
Channel controllers thru high-end switches, driving small, but very fast 
10-15K disks, I looked at the sweet spot of such storage - direct-connect 
SATA RAID5, using generic 500GB-750GB disks in hot-swap trays with hot and 
cold spares and redundant power supplies (PS) to provide robustness. 

HARDWARE
========
A rackmount server (5U container with 2xOpterons, 4GB RAM, 3yrs onsite 
warranty, redundant PS, 2200VA UPS, mirrored system disks, 2x12port 3ware SATA 
RAID controllers, 24 hotswap trays (but no disks) is about $6500.  You can 
cram up to 15TB (usable) in one of these units (24x750GB disk configured in 
2x(11 disks in RAID5 + 1 hotspare)) which costs an additional $12K at current 
consumer prices.  This would still bring in the entire 15 TB system in at 
~$20K with tax and shipping.  

Using more economical 500GB disks ($300/per), you can get 10TB of usable space 
from such a device for $7200, for a total of less than $15K.  These prices 
are roughly 1/6 to 1/20 the cost of comparably sized devices from major 
vendors (but does not address the software issue).  If you wanted to set up 2 
complete servers to provide failover backup, it would still cost far less 
than the Sun or NetApp price for a single system (but there are admitted 
problems with setting up such replicate systems).

This system easily supports in-box speeds of up to 100MB/s write and 200MB/s 
reads, far exceeding 1Gb ethernet backbone speeds.  The system does come with 
2 Gb ethernet interfaces and more could be added quite cheaply.  If it were 
dual-homed to 2 Gb nets, the maximum bandwidth on both ports would start to 
saturate the disk IO, but typically, such bandwidth demands would rarely be 
seen.

I'm assuming that this is the type of physical storage that most people want.  
Correct me if this is wrong.

The above configuration places the controller PC with the disks.  Should you 
want the controller separate, this can be done as well for another ~$400).

Part of the problem depends on what level of failure the individual 
organizations are willing to risk.  Here are some scenarios.

Q: Can it suffer a disk failure without losing data?  
A: Yes. RAID5 + hotspare allow the RAID to keep working thru a disk failure, 
recruiting the hotspare to rebuild the RAID, although at a loss of some 
performance.

Q: Can it suffer 2 simultaneous disk failures?  
A: Yes, in different RAID5s, but not in the same RAID5. However, it could 
survive up to 2 simultaneous disk failures in each of the arrays as RAID6 
(requiring a decrease in capacity of 1 more disk).

Q: Can it suffer a Power Supply failure and maintain data ? 
A: Yes. It has redundant PSs.

Q: Can it suffer 2 simultaneous PS failures and maintain data? 
A: Yes. Data integrity would be maintained via battery-backed array 
controllers but the array would be offline until the PS was replaced.

Q: Could it suffer a PC or controller system failure?  
A: Yes, but see answer immediately above.

Q: Can it be set up for complete failover? (completely separate replicate 
machines set to mirror each other).  
A: Tentatively yes.  (The failover mechanism is easy to set up, but the 
mirroring could be tricky.  Depending on the services that needed to be 
replicated and the speed at which they needed to synchronized, there are some 
uilities that could do this (rsync, unison for normal files, but databases and 
special-purpose files would need to be done separately).  However, all such 
utilities have edge cases where they could fail.  If your systems need to be 
this robust, my solution is probably inappropriate for you until we have more 
data on the behavior of such utilities.

SOFTWARE
---------
Software is always harder.  The cost is easiest - it's all free; there may be 
commercial best-of-breed solutions for some of these functions, but I haven't 
examined it yet.  I recommend using the Ubuntu base distro for fast 
and easy setup and admin.  Others may have a preference for RHEL.  In my 
experience, the Debian-based distros stay closer to the mainstream kernel and 
development lines, but RHEL is internally consistent over the long term and 
better supported for commercial software that has to be run.

The 'webmin' Web-based administration tool (now part of the Open Management
Consortium) can be used for many administrative functions (see
http://www.webmin.com).  The 3ware RAID system can also be managed via a web
interface, altho it runs separately  from the webmin tools.  I've used the 
Areca controllers as well and while they have some advantages, they're not 
yet in the mainline kernel and their supporting utilities are not up to the 
level of the 3ware utilities. 

Some of the requirements we've heard from computer support coordinators simply 
may not map well onto such a system, but many of them are supported.  The 
system obviously can be maintained remotely so it can be supported centrally 
or locally as desired. One crucial point that came out of the cross-school 
discussions was that local administrators wanted to have oversight on the 
system to enable local users, change permissions, change filesystem quotas, 
etc.  Certainly this can be done, but it will require an understanding of 
shared responsibilities if NACS is involved with the system at all.

In terms of types of filesystems and protocols, Linux supports more than any 
other OS, including at least 4 well-debugged journaling filesystems (ext3, 
JFS, XFS, reiserfs).  It can export SMB shares as well as, or better than, 
native Windows servers.  Similarly, it can make storage available as NFS, 
AppleshareIP, DAVfs, subversion & CVS version control systems, and even 
Andrew FS if required.  Extensive volume management can be done in a number 
of ways, but IBM's opensourced EVMS system seems to cover most of them 
(http://evms.sourceforge.net).
   Authentication to most services can be done by Kerberos (and so it can use 
UCINETIDs), local login, LDAP, or combinations thereof, although mixing & 
matching will involve more complexity.

   I have not done a serious examination of virus-scanning software for linux, 
except that clamav is fairly well-regarded and is included in most distros. 
This was a highly rated function for many of the Schools.

In terms of backups, I've only considered a single type so far - automated 
disk-based backups that would span up to a few months.  The system is called 
BackupPC (http://backuppc.sf.net) and after about a month of irregular 
testing, reading, and posting to the fairly active list, it seems to have 
most of what a good backup system should have.  I'll expand on this in 
another posting in a bit; this kind of system requires a lot of detail.

[Please post your followup questions or write to me directly - we'd appreciate 
some peer-review.  Gotchas, facts or experience in support or rebuttal 
(preferably with pointers to data), utilities I omitted, challenges to 
unsupported statements, other requirements, etc.]

Thanks
Harry
-- 
Harry Mangalam - Research Computing at NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824 0084(o), 949 285 4487(c) harry.mangalam at uci.edu