[UCCSC] Re: [UCI-Linux] RFC on RCS proposal for cheap, reliable storage

David Walker David.Walker at ucop.edu
Wed Jun 14 19:08:24 PDT 2006


Have you looked at SDSC's "grid bricks?"  I haven't heard much about it
personally, but I believe their goals are similar to yours.


On Wed, 2006-06-14 at 11:11 -0700, Dan Stromberg wrote:

> Hi Harry.
> What do the RAID's look like in your sizings below?  I could probably
> paper-and-pencil it based on the raw capacity and your figures, but it
> might be easier to just get it from you :)
> If it's a single RAID volume, that seems like a pretty aggressive ratio,
> even for RAID 6 with spares.
> If not, then there can be significant performance and/or storage
> penalties - EG, a RAID 5 of RAID 5's will slow down a good bit, and a
> RAID 10 will provide about half the raw storage.  A stripe or
> concatenation of RAID 5's might be pretty comfortable though...
> I like the idea of doing this sort of stuff with commodity hardware.
> One needs to be especially careful though, in such a scenario, that
> things are as reliable as "needed" (each use has its own definition) if
> forgoing vendor support.
> Then there's the theoretical problem with RAID 5 (and I assume, RAID 6
> as well) that rewriting a data block + a suitable checksum isn't atomic
> at the block level.  Supposedly Sun's new "RAID Z" has such atomicity,
> but it seems to me it should be possible to journal the writes to get
> around this in RAID 5.  But I haven't heard of anyone actually doing it
> with RAID 5, and I suppose it's possible that a journaled filesystem
> overtop of the RAID might help with this.  Then again, it might not....
> In fact, it "feels" like it wouldn't.
> You might find this of interest - it's just something I bookmarked a
> while back, not something I've looked at much:
> http://slashdot.org/palm/21/06/05/30/1453254_1.shtml .  It's based on
> FreeBSD.
> Supposedly xfs isn't good at ensuring integrity of metadata - I believe
> it was Ted T'so who suggested only using xfs for valuable data with a
> battery backup.  Recent version of reiserfs have a decidedly theoretical
> flavor to them, and IINM, it's proving difficult to get a newer Reiser
> into the mainline linux kernel.  However, I'm doing my home backups with
> an oldish Reiser, and it's been working well - I don't miss the fsck'ing
> with ext3 a bit.
> Supposedly EVMS and LVM both sit over top of the Linux Kernel's "Device
> Mapper" subsystem, and you can use EVMS to manage something configured
> with LVM, and vice versa, -but- supposedly EVMS may set some options
> that'll close the door to supporting LVM again on that set of volumes.
> I agree that BackupPC looks interesting.  I also see some promise in
> http://dcs.nac.uci.edu/~strombrg/plucker-urls/ .  Also unadorned rsync
> has an option for creating distinct trees with hardlinks, which creates
> something that feels like both a fullsave and an incremental, but has
> the storage space of a single fullsave and n incrementals, and when you
> remove "the fullsave", you could think of one of the incrementals as
> "becoming" the new fullsave - not sure how BackupPC relates to that, but
> rdiffweb appears to be more storage efficient than that.  IINM, rdiffweb
> stores full changes using rsync's binary diff'ing algorithm, while
> BackupPC let's rsync do the diff'ing for data in transit - but that's a
> pretty wild guess.  :)
> Good that you're staying below 16 terabytes on this - pretty sure you
> know that the rules change a bit at that point.
> IBRIX appears to be a very interesting way of aggregating a bunch of
> backend NAS's into a single, huge filesystem.
> Jacob Farmer's company (presented at Usenix, April 2005, about large
> storage systems), I've been told, allows him to give free lectures about
> large storage.  Here are my notes from the Usenix presentation:
> http://dcs.nac.uci.edu/~strombrg/Usenix-talk-Next-Generation-Storage-Networking.html
> Interesting stuff - thanks for taking the discussion relatively public.
> On Wed, 2006-06-14 at 10:18 -0700, Harry Mangalam wrote:
> > Hi All, 
> > 
> > This is a long preview of one project we (the Research Computing Support group 
> > at NACS) are working on. We're hoping that this strawman proposal will result 
> > in comments that will help us refine and improve the project.  
> > 
> > The first part of the project is the specification of a high quality commodity 
> > linux box built with an eye to storage.  The second part concerns the 
> > software required to do various tasks.  The third part is what you can do with 
> > multiples of such a device and the fourth is how it might be applicable 
> > to other organizations in the UCs.  Only the first two are dealt with in any 
> > depth here.
> > 
> > From discussions with Computer Support personnel in other Schools at UCI, 
> > storage is more important than CPU cycles.  Not only is increasing storage 
> > required, but the kind of storage required is moving towards longer term, 
> > more robust storage.  Similarly, the minimum size of storage has moved beyond 
> > the GB range, well into the TB range.  As such, our interest is with a 
> > minimum 10TB device, which can be ganged into larger units.
> > 
> > In light of that and the cost of such storage from large commercial vendors 
> > such as Network Appliance, Sun, EMC, etc, I examined the base cost of devices 
> > that could provide this storage.  Rather than try to replicate their methods 
> > of providing such storage which is often expensive and proprietary - Fiber 
> > Channel controllers thru high-end switches, driving small, but very fast 
> > 10-15K disks, I looked at the sweet spot of such storage - direct-connect 
> > SATA RAID5, using generic 500GB-750GB disks in hot-swap trays with hot and 
> > cold spares and redundant power supplies (PS) to provide robustness. 
> > 
> > ========
> > A rackmount server (5U container with 2xOpterons, 4GB RAM, 3yrs onsite 
> > warranty, redundant PS, 2200VA UPS, mirrored system disks, 2x12port 3ware SATA 
> > RAID controllers, 24 hotswap trays (but no disks) is about $6500.  You can 
> > cram up to 15TB (usable) in one of these units (24x750GB disk configured in 
> > 2x(11 disks in RAID5 + 1 hotspare)) which costs an additional $12K at current 
> > consumer prices.  This would still bring in the entire 15 TB system in at 
> > ~$20K with tax and shipping.  
> > 
> > Using more economical 500GB disks ($300/per), you can get 10TB of usable space 
> > from such a device for $7200, for a total of less than $15K.  These prices 
> > are roughly 1/6 to 1/20 the cost of comparably sized devices from major 
> > vendors (but does not address the software issue).  If you wanted to set up 2 
> > complete servers to provide failover backup, it would still cost far less 
> > than the Sun or NetApp price for a single system (but there are admitted 
> > problems with setting up such replicate systems).
> > 
> > This system easily supports in-box speeds of up to 100MB/s write and 200MB/s 
> > reads, far exceeding 1Gb ethernet backbone speeds.  The system does come with 
> > 2 Gb ethernet interfaces and more could be added quite cheaply.  If it were 
> > dual-homed to 2 Gb nets, the maximum bandwidth on both ports would start to 
> > saturate the disk IO, but typically, such bandwidth demands would rarely be 
> > seen.
> > 
> > I'm assuming that this is the type of physical storage that most people want.  
> > Correct me if this is wrong.
> > 
> > The above configuration places the controller PC with the disks.  Should you 
> > want the controller separate, this can be done as well for another ~$400).
> > 
> > Part of the problem depends on what level of failure the individual 
> > organizations are willing to risk.  Here are some scenarios.
> > 
> > Q: Can it suffer a disk failure without losing data?  
> > A: Yes. RAID5 + hotspare allow the RAID to keep working thru a disk failure, 
> > recruiting the hotspare to rebuild the RAID, although at a loss of some 
> > performance.
> > 
> > Q: Can it suffer 2 simultaneous disk failures?  
> > A: Yes, in different RAID5s, but not in the same RAID5. However, it could 
> > survive up to 2 simultaneous disk failures in each of the arrays as RAID6 
> > (requiring a decrease in capacity of 1 more disk).
> > 
> > Q: Can it suffer a Power Supply failure and maintain data ? 
> > A: Yes. It has redundant PSs.
> > 
> > Q: Can it suffer 2 simultaneous PS failures and maintain data? 
> > A: Yes. Data integrity would be maintained via battery-backed array 
> > controllers but the array would be offline until the PS was replaced.
> > 
> > Q: Could it suffer a PC or controller system failure?  
> > A: Yes, but see answer immediately above.
> > 
> > Q: Can it be set up for complete failover? (completely separate replicate 
> > machines set to mirror each other).  
> > A: Tentatively yes.  (The failover mechanism is easy to set up, but the 
> > mirroring could be tricky.  Depending on the services that needed to be 
> > replicated and the speed at which they needed to synchronized, there are some 
> > uilities that could do this (rsync, unison for normal files, but databases and 
> > special-purpose files would need to be done separately).  However, all such 
> > utilities have edge cases where they could fail.  If your systems need to be 
> > this robust, my solution is probably inappropriate for you until we have more 
> > data on the behavior of such utilities.
> > 
> > 
> > 
> > ---------
> > Software is always harder.  The cost is easiest - it's all free; there may be 
> > commercial best-of-breed solutions for some of these functions, but I haven't 
> > examined it yet.  I recommend using the Ubuntu base distro for fast 
> > and easy setup and admin.  Others may have a preference for RHEL.  In my 
> > experience, the Debian-based distros stay closer to the mainstream kernel and 
> > development lines, but RHEL is internally consistent over the long term and 
> > better supported for commercial software that has to be run.
> > 
> > The 'webmin' Web-based administration tool (now part of the Open Management
> > Consortium) can be used for many administrative functions (see
> > http://www.webmin.com).  The 3ware RAID system can also be managed via a web
> > interface, altho it runs separately  from the webmin tools.  I've used the 
> > Areca controllers as well and while they have some advantages, they're not 
> > yet in the mainline kernel and their supporting utilities are not up to the 
> > level of the 3ware utilities. 
> > 
> > Some of the requirements we've heard from computer support coordinators simply 
> > may not map well onto such a system, but many of them are supported.  The 
> > system obviously can be maintained remotely so it can be supported centrally 
> > or locally as desired. One crucial point that came out of the cross-school 
> > discussions was that local administrators wanted to have oversight on the 
> > system to enable local users, change permissions, change filesystem quotas, 
> > etc.  Certainly this can be done, but it will require an understanding of 
> > shared responsibilities if NACS is involved with the system at all.
> > 
> > In terms of types of filesystems and protocols, Linux supports more than any 
> > other OS, including at least 4 well-debugged journaling filesystems (ext3, 
> > JFS, XFS, reiserfs).  It can export SMB shares as well as, or better than, 
> > native Windows servers.  Similarly, it can make storage available as NFS, 
> > AppleshareIP, DAVfs, subversion & CVS version control systems, and even 
> > Andrew FS if required.  Extensive volume management can be done in a number 
> > of ways, but IBM's opensourced EVMS system seems to cover most of them 
> > (http://evms.sourceforge.net).
> >    Authentication to most services can be done by Kerberos (and so it can use 
> > UCINETIDs), local login, LDAP, or combinations thereof, although mixing & 
> > matching will involve more complexity.
> > 
> >    I have not done a serious examination of virus-scanning software for linux, 
> > except that clamav is fairly well-regarded and is included in most distros. 
> > This was a highly rated function for many of the Schools.
> > 
> > In terms of backups, I've only considered a single type so far - automated 
> > disk-based backups that would span up to a few months.  The system is called 
> > BackupPC (http://backuppc.sf.net) and after about a month of irregular 
> > testing, reading, and posting to the fairly active list, it seems to have 
> > most of what a good backup system should have.  I'll expand on this in 
> > another posting in a bit; this kind of system requires a lot of detail.
> > 
> > [Please post your followup questions or write to me directly - we'd appreciate 
> > some peer-review.  Gotchas, facts or experience in support or rebuttal 
> > (preferably with pointers to data), utilities I omitted, challenges to 
> > unsupported statements, other requirements, etc.]
> > 
> > Thanks
> > Harry
> _______________________________________________
> List-Info: https://maillists.uci.edu/mailman/listinfo/uccsc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://maillists.uci.edu/mailman/public/uci-linux/attachments/20060614/293cac57/attachment-0001.html

More information about the UCI-Linux mailing list