[UCI-Linux] RFC on RCS proposal for cheap, reliable storage
Harry Mangalam
harry.mangalam at uci.edu
Wed Jun 14 10:18:11 PDT 2006
Hi All,
This is a long preview of one project we (the Research Computing Support group
at NACS) are working on. We're hoping that this strawman proposal will result
in comments that will help us refine and improve the project.
The first part of the project is the specification of a high quality commodity
linux box built with an eye to storage. The second part concerns the
software required to do various tasks. The third part is what you can do with
multiples of such a device and the fourth is how it might be applicable
to other organizations in the UCs. Only the first two are dealt with in any
depth here.
From discussions with Computer Support personnel in other Schools at UCI,
storage is more important than CPU cycles. Not only is increasing storage
required, but the kind of storage required is moving towards longer term,
more robust storage. Similarly, the minimum size of storage has moved beyond
the GB range, well into the TB range. As such, our interest is with a
minimum 10TB device, which can be ganged into larger units.
In light of that and the cost of such storage from large commercial vendors
such as Network Appliance, Sun, EMC, etc, I examined the base cost of devices
that could provide this storage. Rather than try to replicate their methods
of providing such storage which is often expensive and proprietary - Fiber
Channel controllers thru high-end switches, driving small, but very fast
10-15K disks, I looked at the sweet spot of such storage - direct-connect
SATA RAID5, using generic 500GB-750GB disks in hot-swap trays with hot and
cold spares and redundant power supplies (PS) to provide robustness.
HARDWARE
========
A rackmount server (5U container with 2xOpterons, 4GB RAM, 3yrs onsite
warranty, redundant PS, 2200VA UPS, mirrored system disks, 2x12port 3ware SATA
RAID controllers, 24 hotswap trays (but no disks) is about $6500. You can
cram up to 15TB (usable) in one of these units (24x750GB disk configured in
2x(11 disks in RAID5 + 1 hotspare)) which costs an additional $12K at current
consumer prices. This would still bring in the entire 15 TB system in at
~$20K with tax and shipping.
Using more economical 500GB disks ($300/per), you can get 10TB of usable space
from such a device for $7200, for a total of less than $15K. These prices
are roughly 1/6 to 1/20 the cost of comparably sized devices from major
vendors (but does not address the software issue). If you wanted to set up 2
complete servers to provide failover backup, it would still cost far less
than the Sun or NetApp price for a single system (but there are admitted
problems with setting up such replicate systems).
This system easily supports in-box speeds of up to 100MB/s write and 200MB/s
reads, far exceeding 1Gb ethernet backbone speeds. The system does come with
2 Gb ethernet interfaces and more could be added quite cheaply. If it were
dual-homed to 2 Gb nets, the maximum bandwidth on both ports would start to
saturate the disk IO, but typically, such bandwidth demands would rarely be
seen.
I'm assuming that this is the type of physical storage that most people want.
Correct me if this is wrong.
The above configuration places the controller PC with the disks. Should you
want the controller separate, this can be done as well for another ~$400).
Part of the problem depends on what level of failure the individual
organizations are willing to risk. Here are some scenarios.
Q: Can it suffer a disk failure without losing data?
A: Yes. RAID5 + hotspare allow the RAID to keep working thru a disk failure,
recruiting the hotspare to rebuild the RAID, although at a loss of some
performance.
Q: Can it suffer 2 simultaneous disk failures?
A: Yes, in different RAID5s, but not in the same RAID5. However, it could
survive up to 2 simultaneous disk failures in each of the arrays as RAID6
(requiring a decrease in capacity of 1 more disk).
Q: Can it suffer a Power Supply failure and maintain data ?
A: Yes. It has redundant PSs.
Q: Can it suffer 2 simultaneous PS failures and maintain data?
A: Yes. Data integrity would be maintained via battery-backed array
controllers but the array would be offline until the PS was replaced.
Q: Could it suffer a PC or controller system failure?
A: Yes, but see answer immediately above.
Q: Can it be set up for complete failover? (completely separate replicate
machines set to mirror each other).
A: Tentatively yes. (The failover mechanism is easy to set up, but the
mirroring could be tricky. Depending on the services that needed to be
replicated and the speed at which they needed to synchronized, there are some
uilities that could do this (rsync, unison for normal files, but databases and
special-purpose files would need to be done separately). However, all such
utilities have edge cases where they could fail. If your systems need to be
this robust, my solution is probably inappropriate for you until we have more
data on the behavior of such utilities.
SOFTWARE
---------
Software is always harder. The cost is easiest - it's all free; there may be
commercial best-of-breed solutions for some of these functions, but I haven't
examined it yet. I recommend using the Ubuntu base distro for fast
and easy setup and admin. Others may have a preference for RHEL. In my
experience, the Debian-based distros stay closer to the mainstream kernel and
development lines, but RHEL is internally consistent over the long term and
better supported for commercial software that has to be run.
The 'webmin' Web-based administration tool (now part of the Open Management
Consortium) can be used for many administrative functions (see
http://www.webmin.com). The 3ware RAID system can also be managed via a web
interface, altho it runs separately from the webmin tools. I've used the
Areca controllers as well and while they have some advantages, they're not
yet in the mainline kernel and their supporting utilities are not up to the
level of the 3ware utilities.
Some of the requirements we've heard from computer support coordinators simply
may not map well onto such a system, but many of them are supported. The
system obviously can be maintained remotely so it can be supported centrally
or locally as desired. One crucial point that came out of the cross-school
discussions was that local administrators wanted to have oversight on the
system to enable local users, change permissions, change filesystem quotas,
etc. Certainly this can be done, but it will require an understanding of
shared responsibilities if NACS is involved with the system at all.
In terms of types of filesystems and protocols, Linux supports more than any
other OS, including at least 4 well-debugged journaling filesystems (ext3,
JFS, XFS, reiserfs). It can export SMB shares as well as, or better than,
native Windows servers. Similarly, it can make storage available as NFS,
AppleshareIP, DAVfs, subversion & CVS version control systems, and even
Andrew FS if required. Extensive volume management can be done in a number
of ways, but IBM's opensourced EVMS system seems to cover most of them
(http://evms.sourceforge.net).
Authentication to most services can be done by Kerberos (and so it can use
UCINETIDs), local login, LDAP, or combinations thereof, although mixing &
matching will involve more complexity.
I have not done a serious examination of virus-scanning software for linux,
except that clamav is fairly well-regarded and is included in most distros.
This was a highly rated function for many of the Schools.
In terms of backups, I've only considered a single type so far - automated
disk-based backups that would span up to a few months. The system is called
BackupPC (http://backuppc.sf.net) and after about a month of irregular
testing, reading, and posting to the fairly active list, it seems to have
most of what a good backup system should have. I'll expand on this in
another posting in a bit; this kind of system requires a lot of detail.
[Please post your followup questions or write to me directly - we'd appreciate
some peer-review. Gotchas, facts or experience in support or rebuttal
(preferably with pointers to data), utilities I omitted, challenges to
unsupported statements, other requirements, etc.]
Thanks
Harry
--
Harry Mangalam - Research Computing at NACS, E2148, Engineering Gateway,
UC Irvine 92697 949 824 0084(o), 949 285 4487(c) harry.mangalam at uci.edu
More information about the UCI-Linux
mailing list