[UCI-Linux] RFC on RCS proposal for cheap, reliable storage

Dan Stromberg strombrg at dcs.nac.uci.edu
Wed Jun 14 11:11:15 PDT 2006


Hi Harry.

What do the RAID's look like in your sizings below?  I could probably
paper-and-pencil it based on the raw capacity and your figures, but it
might be easier to just get it from you :)

If it's a single RAID volume, that seems like a pretty aggressive ratio,
even for RAID 6 with spares.

If not, then there can be significant performance and/or storage
penalties - EG, a RAID 5 of RAID 5's will slow down a good bit, and a
RAID 10 will provide about half the raw storage.  A stripe or
concatenation of RAID 5's might be pretty comfortable though...

I like the idea of doing this sort of stuff with commodity hardware.
One needs to be especially careful though, in such a scenario, that
things are as reliable as "needed" (each use has its own definition) if
forgoing vendor support.

Then there's the theoretical problem with RAID 5 (and I assume, RAID 6
as well) that rewriting a data block + a suitable checksum isn't atomic
at the block level.  Supposedly Sun's new "RAID Z" has such atomicity,
but it seems to me it should be possible to journal the writes to get
around this in RAID 5.  But I haven't heard of anyone actually doing it
with RAID 5, and I suppose it's possible that a journaled filesystem
overtop of the RAID might help with this.  Then again, it might not....
In fact, it "feels" like it wouldn't.

You might find this of interest - it's just something I bookmarked a
while back, not something I've looked at much:
http://slashdot.org/palm/21/06/05/30/1453254_1.shtml .  It's based on
FreeBSD.

Supposedly xfs isn't good at ensuring integrity of metadata - I believe
it was Ted T'so who suggested only using xfs for valuable data with a
battery backup.  Recent version of reiserfs have a decidedly theoretical
flavor to them, and IINM, it's proving difficult to get a newer Reiser
into the mainline linux kernel.  However, I'm doing my home backups with
an oldish Reiser, and it's been working well - I don't miss the fsck'ing
with ext3 a bit.

Supposedly EVMS and LVM both sit over top of the Linux Kernel's "Device
Mapper" subsystem, and you can use EVMS to manage something configured
with LVM, and vice versa, -but- supposedly EVMS may set some options
that'll close the door to supporting LVM again on that set of volumes.

I agree that BackupPC looks interesting.  I also see some promise in
http://dcs.nac.uci.edu/~strombrg/plucker-urls/ .  Also unadorned rsync
has an option for creating distinct trees with hardlinks, which creates
something that feels like both a fullsave and an incremental, but has
the storage space of a single fullsave and n incrementals, and when you
remove "the fullsave", you could think of one of the incrementals as
"becoming" the new fullsave - not sure how BackupPC relates to that, but
rdiffweb appears to be more storage efficient than that.  IINM, rdiffweb
stores full changes using rsync's binary diff'ing algorithm, while
BackupPC let's rsync do the diff'ing for data in transit - but that's a
pretty wild guess.  :)

Good that you're staying below 16 terabytes on this - pretty sure you
know that the rules change a bit at that point.

IBRIX appears to be a very interesting way of aggregating a bunch of
backend NAS's into a single, huge filesystem.

Jacob Farmer's company (presented at Usenix, April 2005, about large
storage systems), I've been told, allows him to give free lectures about
large storage.  Here are my notes from the Usenix presentation:
http://dcs.nac.uci.edu/~strombrg/Usenix-talk-Next-Generation-Storage-Networking.html

Interesting stuff - thanks for taking the discussion relatively public.

On Wed, 2006-06-14 at 10:18 -0700, Harry Mangalam wrote:
> Hi All, 
> 
> This is a long preview of one project we (the Research Computing Support group 
> at NACS) are working on. We're hoping that this strawman proposal will result 
> in comments that will help us refine and improve the project.  
> 
> The first part of the project is the specification of a high quality commodity 
> linux box built with an eye to storage.  The second part concerns the 
> software required to do various tasks.  The third part is what you can do with 
> multiples of such a device and the fourth is how it might be applicable 
> to other organizations in the UCs.  Only the first two are dealt with in any 
> depth here.
> 
> From discussions with Computer Support personnel in other Schools at UCI, 
> storage is more important than CPU cycles.  Not only is increasing storage 
> required, but the kind of storage required is moving towards longer term, 
> more robust storage.  Similarly, the minimum size of storage has moved beyond 
> the GB range, well into the TB range.  As such, our interest is with a 
> minimum 10TB device, which can be ganged into larger units.
> 
> In light of that and the cost of such storage from large commercial vendors 
> such as Network Appliance, Sun, EMC, etc, I examined the base cost of devices 
> that could provide this storage.  Rather than try to replicate their methods 
> of providing such storage which is often expensive and proprietary - Fiber 
> Channel controllers thru high-end switches, driving small, but very fast 
> 10-15K disks, I looked at the sweet spot of such storage - direct-connect 
> SATA RAID5, using generic 500GB-750GB disks in hot-swap trays with hot and 
> cold spares and redundant power supplies (PS) to provide robustness. 
> 
> HARDWARE
> ========
> A rackmount server (5U container with 2xOpterons, 4GB RAM, 3yrs onsite 
> warranty, redundant PS, 2200VA UPS, mirrored system disks, 2x12port 3ware SATA 
> RAID controllers, 24 hotswap trays (but no disks) is about $6500.  You can 
> cram up to 15TB (usable) in one of these units (24x750GB disk configured in 
> 2x(11 disks in RAID5 + 1 hotspare)) which costs an additional $12K at current 
> consumer prices.  This would still bring in the entire 15 TB system in at 
> ~$20K with tax and shipping.  
> 
> Using more economical 500GB disks ($300/per), you can get 10TB of usable space 
> from such a device for $7200, for a total of less than $15K.  These prices 
> are roughly 1/6 to 1/20 the cost of comparably sized devices from major 
> vendors (but does not address the software issue).  If you wanted to set up 2 
> complete servers to provide failover backup, it would still cost far less 
> than the Sun or NetApp price for a single system (but there are admitted 
> problems with setting up such replicate systems).
> 
> This system easily supports in-box speeds of up to 100MB/s write and 200MB/s 
> reads, far exceeding 1Gb ethernet backbone speeds.  The system does come with 
> 2 Gb ethernet interfaces and more could be added quite cheaply.  If it were 
> dual-homed to 2 Gb nets, the maximum bandwidth on both ports would start to 
> saturate the disk IO, but typically, such bandwidth demands would rarely be 
> seen.
> 
> I'm assuming that this is the type of physical storage that most people want.  
> Correct me if this is wrong.
> 
> The above configuration places the controller PC with the disks.  Should you 
> want the controller separate, this can be done as well for another ~$400).
> 
> Part of the problem depends on what level of failure the individual 
> organizations are willing to risk.  Here are some scenarios.
> 
> Q: Can it suffer a disk failure without losing data?  
> A: Yes. RAID5 + hotspare allow the RAID to keep working thru a disk failure, 
> recruiting the hotspare to rebuild the RAID, although at a loss of some 
> performance.
> 
> Q: Can it suffer 2 simultaneous disk failures?  
> A: Yes, in different RAID5s, but not in the same RAID5. However, it could 
> survive up to 2 simultaneous disk failures in each of the arrays as RAID6 
> (requiring a decrease in capacity of 1 more disk).
> 
> Q: Can it suffer a Power Supply failure and maintain data ? 
> A: Yes. It has redundant PSs.
> 
> Q: Can it suffer 2 simultaneous PS failures and maintain data? 
> A: Yes. Data integrity would be maintained via battery-backed array 
> controllers but the array would be offline until the PS was replaced.
> 
> Q: Could it suffer a PC or controller system failure?  
> A: Yes, but see answer immediately above.
> 
> Q: Can it be set up for complete failover? (completely separate replicate 
> machines set to mirror each other).  
> A: Tentatively yes.  (The failover mechanism is easy to set up, but the 
> mirroring could be tricky.  Depending on the services that needed to be 
> replicated and the speed at which they needed to synchronized, there are some 
> uilities that could do this (rsync, unison for normal files, but databases and 
> special-purpose files would need to be done separately).  However, all such 
> utilities have edge cases where they could fail.  If your systems need to be 
> this robust, my solution is probably inappropriate for you until we have more 
> data on the behavior of such utilities.
> 
> 
> 
> SOFTWARE
> ---------
> Software is always harder.  The cost is easiest - it's all free; there may be 
> commercial best-of-breed solutions for some of these functions, but I haven't 
> examined it yet.  I recommend using the Ubuntu base distro for fast 
> and easy setup and admin.  Others may have a preference for RHEL.  In my 
> experience, the Debian-based distros stay closer to the mainstream kernel and 
> development lines, but RHEL is internally consistent over the long term and 
> better supported for commercial software that has to be run.
> 
> The 'webmin' Web-based administration tool (now part of the Open Management
> Consortium) can be used for many administrative functions (see
> http://www.webmin.com).  The 3ware RAID system can also be managed via a web
> interface, altho it runs separately  from the webmin tools.  I've used the 
> Areca controllers as well and while they have some advantages, they're not 
> yet in the mainline kernel and their supporting utilities are not up to the 
> level of the 3ware utilities. 
> 
> Some of the requirements we've heard from computer support coordinators simply 
> may not map well onto such a system, but many of them are supported.  The 
> system obviously can be maintained remotely so it can be supported centrally 
> or locally as desired. One crucial point that came out of the cross-school 
> discussions was that local administrators wanted to have oversight on the 
> system to enable local users, change permissions, change filesystem quotas, 
> etc.  Certainly this can be done, but it will require an understanding of 
> shared responsibilities if NACS is involved with the system at all.
> 
> In terms of types of filesystems and protocols, Linux supports more than any 
> other OS, including at least 4 well-debugged journaling filesystems (ext3, 
> JFS, XFS, reiserfs).  It can export SMB shares as well as, or better than, 
> native Windows servers.  Similarly, it can make storage available as NFS, 
> AppleshareIP, DAVfs, subversion & CVS version control systems, and even 
> Andrew FS if required.  Extensive volume management can be done in a number 
> of ways, but IBM's opensourced EVMS system seems to cover most of them 
> (http://evms.sourceforge.net).
>    Authentication to most services can be done by Kerberos (and so it can use 
> UCINETIDs), local login, LDAP, or combinations thereof, although mixing & 
> matching will involve more complexity.
> 
>    I have not done a serious examination of virus-scanning software for linux, 
> except that clamav is fairly well-regarded and is included in most distros. 
> This was a highly rated function for many of the Schools.
> 
> In terms of backups, I've only considered a single type so far - automated 
> disk-based backups that would span up to a few months.  The system is called 
> BackupPC (http://backuppc.sf.net) and after about a month of irregular 
> testing, reading, and posting to the fairly active list, it seems to have 
> most of what a good backup system should have.  I'll expand on this in 
> another posting in a bit; this kind of system requires a lot of detail.
> 
> [Please post your followup questions or write to me directly - we'd appreciate 
> some peer-review.  Gotchas, facts or experience in support or rebuttal 
> (preferably with pointers to data), utilities I omitted, challenges to 
> unsupported statements, other requirements, etc.]
> 
> Thanks
> Harry



More information about the UCI-Linux mailing list