[UCCSC] Re: [UCI-Linux] RFC on RCS proposal for cheap,
reliable storage
David Walker
David.Walker at ucop.edu
Wed Jun 14 19:08:24 PDT 2006
Harry,
Have you looked at SDSC's "grid bricks?" I haven't heard much about it
personally, but I believe their goals are similar to yours.
David
On Wed, 2006-06-14 at 11:11 -0700, Dan Stromberg wrote:
> Hi Harry.
>
> What do the RAID's look like in your sizings below? I could probably
> paper-and-pencil it based on the raw capacity and your figures, but it
> might be easier to just get it from you :)
>
> If it's a single RAID volume, that seems like a pretty aggressive ratio,
> even for RAID 6 with spares.
>
> If not, then there can be significant performance and/or storage
> penalties - EG, a RAID 5 of RAID 5's will slow down a good bit, and a
> RAID 10 will provide about half the raw storage. A stripe or
> concatenation of RAID 5's might be pretty comfortable though...
>
> I like the idea of doing this sort of stuff with commodity hardware.
> One needs to be especially careful though, in such a scenario, that
> things are as reliable as "needed" (each use has its own definition) if
> forgoing vendor support.
>
> Then there's the theoretical problem with RAID 5 (and I assume, RAID 6
> as well) that rewriting a data block + a suitable checksum isn't atomic
> at the block level. Supposedly Sun's new "RAID Z" has such atomicity,
> but it seems to me it should be possible to journal the writes to get
> around this in RAID 5. But I haven't heard of anyone actually doing it
> with RAID 5, and I suppose it's possible that a journaled filesystem
> overtop of the RAID might help with this. Then again, it might not....
> In fact, it "feels" like it wouldn't.
>
> You might find this of interest - it's just something I bookmarked a
> while back, not something I've looked at much:
> http://slashdot.org/palm/21/06/05/30/1453254_1.shtml . It's based on
> FreeBSD.
>
> Supposedly xfs isn't good at ensuring integrity of metadata - I believe
> it was Ted T'so who suggested only using xfs for valuable data with a
> battery backup. Recent version of reiserfs have a decidedly theoretical
> flavor to them, and IINM, it's proving difficult to get a newer Reiser
> into the mainline linux kernel. However, I'm doing my home backups with
> an oldish Reiser, and it's been working well - I don't miss the fsck'ing
> with ext3 a bit.
>
> Supposedly EVMS and LVM both sit over top of the Linux Kernel's "Device
> Mapper" subsystem, and you can use EVMS to manage something configured
> with LVM, and vice versa, -but- supposedly EVMS may set some options
> that'll close the door to supporting LVM again on that set of volumes.
>
> I agree that BackupPC looks interesting. I also see some promise in
> http://dcs.nac.uci.edu/~strombrg/plucker-urls/ . Also unadorned rsync
> has an option for creating distinct trees with hardlinks, which creates
> something that feels like both a fullsave and an incremental, but has
> the storage space of a single fullsave and n incrementals, and when you
> remove "the fullsave", you could think of one of the incrementals as
> "becoming" the new fullsave - not sure how BackupPC relates to that, but
> rdiffweb appears to be more storage efficient than that. IINM, rdiffweb
> stores full changes using rsync's binary diff'ing algorithm, while
> BackupPC let's rsync do the diff'ing for data in transit - but that's a
> pretty wild guess. :)
>
> Good that you're staying below 16 terabytes on this - pretty sure you
> know that the rules change a bit at that point.
>
> IBRIX appears to be a very interesting way of aggregating a bunch of
> backend NAS's into a single, huge filesystem.
>
> Jacob Farmer's company (presented at Usenix, April 2005, about large
> storage systems), I've been told, allows him to give free lectures about
> large storage. Here are my notes from the Usenix presentation:
> http://dcs.nac.uci.edu/~strombrg/Usenix-talk-Next-Generation-Storage-Networking.html
>
> Interesting stuff - thanks for taking the discussion relatively public.
>
> On Wed, 2006-06-14 at 10:18 -0700, Harry Mangalam wrote:
> > Hi All,
> >
> > This is a long preview of one project we (the Research Computing Support group
> > at NACS) are working on. We're hoping that this strawman proposal will result
> > in comments that will help us refine and improve the project.
> >
> > The first part of the project is the specification of a high quality commodity
> > linux box built with an eye to storage. The second part concerns the
> > software required to do various tasks. The third part is what you can do with
> > multiples of such a device and the fourth is how it might be applicable
> > to other organizations in the UCs. Only the first two are dealt with in any
> > depth here.
> >
> > From discussions with Computer Support personnel in other Schools at UCI,
> > storage is more important than CPU cycles. Not only is increasing storage
> > required, but the kind of storage required is moving towards longer term,
> > more robust storage. Similarly, the minimum size of storage has moved beyond
> > the GB range, well into the TB range. As such, our interest is with a
> > minimum 10TB device, which can be ganged into larger units.
> >
> > In light of that and the cost of such storage from large commercial vendors
> > such as Network Appliance, Sun, EMC, etc, I examined the base cost of devices
> > that could provide this storage. Rather than try to replicate their methods
> > of providing such storage which is often expensive and proprietary - Fiber
> > Channel controllers thru high-end switches, driving small, but very fast
> > 10-15K disks, I looked at the sweet spot of such storage - direct-connect
> > SATA RAID5, using generic 500GB-750GB disks in hot-swap trays with hot and
> > cold spares and redundant power supplies (PS) to provide robustness.
> >
> > HARDWARE
> > ========
> > A rackmount server (5U container with 2xOpterons, 4GB RAM, 3yrs onsite
> > warranty, redundant PS, 2200VA UPS, mirrored system disks, 2x12port 3ware SATA
> > RAID controllers, 24 hotswap trays (but no disks) is about $6500. You can
> > cram up to 15TB (usable) in one of these units (24x750GB disk configured in
> > 2x(11 disks in RAID5 + 1 hotspare)) which costs an additional $12K at current
> > consumer prices. This would still bring in the entire 15 TB system in at
> > ~$20K with tax and shipping.
> >
> > Using more economical 500GB disks ($300/per), you can get 10TB of usable space
> > from such a device for $7200, for a total of less than $15K. These prices
> > are roughly 1/6 to 1/20 the cost of comparably sized devices from major
> > vendors (but does not address the software issue). If you wanted to set up 2
> > complete servers to provide failover backup, it would still cost far less
> > than the Sun or NetApp price for a single system (but there are admitted
> > problems with setting up such replicate systems).
> >
> > This system easily supports in-box speeds of up to 100MB/s write and 200MB/s
> > reads, far exceeding 1Gb ethernet backbone speeds. The system does come with
> > 2 Gb ethernet interfaces and more could be added quite cheaply. If it were
> > dual-homed to 2 Gb nets, the maximum bandwidth on both ports would start to
> > saturate the disk IO, but typically, such bandwidth demands would rarely be
> > seen.
> >
> > I'm assuming that this is the type of physical storage that most people want.
> > Correct me if this is wrong.
> >
> > The above configuration places the controller PC with the disks. Should you
> > want the controller separate, this can be done as well for another ~$400).
> >
> > Part of the problem depends on what level of failure the individual
> > organizations are willing to risk. Here are some scenarios.
> >
> > Q: Can it suffer a disk failure without losing data?
> > A: Yes. RAID5 + hotspare allow the RAID to keep working thru a disk failure,
> > recruiting the hotspare to rebuild the RAID, although at a loss of some
> > performance.
> >
> > Q: Can it suffer 2 simultaneous disk failures?
> > A: Yes, in different RAID5s, but not in the same RAID5. However, it could
> > survive up to 2 simultaneous disk failures in each of the arrays as RAID6
> > (requiring a decrease in capacity of 1 more disk).
> >
> > Q: Can it suffer a Power Supply failure and maintain data ?
> > A: Yes. It has redundant PSs.
> >
> > Q: Can it suffer 2 simultaneous PS failures and maintain data?
> > A: Yes. Data integrity would be maintained via battery-backed array
> > controllers but the array would be offline until the PS was replaced.
> >
> > Q: Could it suffer a PC or controller system failure?
> > A: Yes, but see answer immediately above.
> >
> > Q: Can it be set up for complete failover? (completely separate replicate
> > machines set to mirror each other).
> > A: Tentatively yes. (The failover mechanism is easy to set up, but the
> > mirroring could be tricky. Depending on the services that needed to be
> > replicated and the speed at which they needed to synchronized, there are some
> > uilities that could do this (rsync, unison for normal files, but databases and
> > special-purpose files would need to be done separately). However, all such
> > utilities have edge cases where they could fail. If your systems need to be
> > this robust, my solution is probably inappropriate for you until we have more
> > data on the behavior of such utilities.
> >
> >
> >
> > SOFTWARE
> > ---------
> > Software is always harder. The cost is easiest - it's all free; there may be
> > commercial best-of-breed solutions for some of these functions, but I haven't
> > examined it yet. I recommend using the Ubuntu base distro for fast
> > and easy setup and admin. Others may have a preference for RHEL. In my
> > experience, the Debian-based distros stay closer to the mainstream kernel and
> > development lines, but RHEL is internally consistent over the long term and
> > better supported for commercial software that has to be run.
> >
> > The 'webmin' Web-based administration tool (now part of the Open Management
> > Consortium) can be used for many administrative functions (see
> > http://www.webmin.com). The 3ware RAID system can also be managed via a web
> > interface, altho it runs separately from the webmin tools. I've used the
> > Areca controllers as well and while they have some advantages, they're not
> > yet in the mainline kernel and their supporting utilities are not up to the
> > level of the 3ware utilities.
> >
> > Some of the requirements we've heard from computer support coordinators simply
> > may not map well onto such a system, but many of them are supported. The
> > system obviously can be maintained remotely so it can be supported centrally
> > or locally as desired. One crucial point that came out of the cross-school
> > discussions was that local administrators wanted to have oversight on the
> > system to enable local users, change permissions, change filesystem quotas,
> > etc. Certainly this can be done, but it will require an understanding of
> > shared responsibilities if NACS is involved with the system at all.
> >
> > In terms of types of filesystems and protocols, Linux supports more than any
> > other OS, including at least 4 well-debugged journaling filesystems (ext3,
> > JFS, XFS, reiserfs). It can export SMB shares as well as, or better than,
> > native Windows servers. Similarly, it can make storage available as NFS,
> > AppleshareIP, DAVfs, subversion & CVS version control systems, and even
> > Andrew FS if required. Extensive volume management can be done in a number
> > of ways, but IBM's opensourced EVMS system seems to cover most of them
> > (http://evms.sourceforge.net).
> > Authentication to most services can be done by Kerberos (and so it can use
> > UCINETIDs), local login, LDAP, or combinations thereof, although mixing &
> > matching will involve more complexity.
> >
> > I have not done a serious examination of virus-scanning software for linux,
> > except that clamav is fairly well-regarded and is included in most distros.
> > This was a highly rated function for many of the Schools.
> >
> > In terms of backups, I've only considered a single type so far - automated
> > disk-based backups that would span up to a few months. The system is called
> > BackupPC (http://backuppc.sf.net) and after about a month of irregular
> > testing, reading, and posting to the fairly active list, it seems to have
> > most of what a good backup system should have. I'll expand on this in
> > another posting in a bit; this kind of system requires a lot of detail.
> >
> > [Please post your followup questions or write to me directly - we'd appreciate
> > some peer-review. Gotchas, facts or experience in support or rebuttal
> > (preferably with pointers to data), utilities I omitted, challenges to
> > unsupported statements, other requirements, etc.]
> >
> > Thanks
> > Harry
>
> _______________________________________________
> List-Info: https://maillists.uci.edu/mailman/listinfo/uccsc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://maillists.uci.edu/mailman/public/uci-linux/attachments/20060614/293cac57/attachment-0001.html
More information about the UCI-Linux
mailing list