[UCI-Linux] RFC on RCS proposal for cheap, reliable storage

Wed Jun 14 12:38:24 PDT 2006

On Wednesday 14 June 2006 11:11, Dan Stromberg wrote:
> Hi Harry.
>
> What do the RAID's look like in your sizings below?  I could probably
> paper-and-pencil it based on the raw capacity and your figures, but it
> might be easier to just get it from you :)
>
> If it's a single RAID volume, that seems like a pretty aggressive ratio,
> even for RAID 6 with spares.

No, it's 2 RAID5's, each with 12 disks.  So the capacity is 10 disks worth 
plus the parity disk, with one spinning spare.  As you know, I've had plenty  
of opportunity to check whether the RAID recovery works with the 3wares, and 
despite other problems, it has worked pretty well.  You can buy 24-port Areca 
cards, but that's putting too many eggs in one basket for me as well as 
stressing the IO.

> If not, then there can be significant performance and/or storage
> penalties - EG, a RAID 5 of RAID 5's will slow down a good bit, and a
> RAID 10 will provide about half the raw storage.  A stripe or
> concatenation of RAID 5's might be pretty comfortable though...

no - my base system is a single level raid5 that can be managed/sliced/diced 
with EVMS, but you could easily reconfig it into RAID10 if you needed the 
speed.  My sense is that most people want /reasonably/ reliable storage for 
reasonable prices.

> I like the idea of doing this sort of stuff with commodity hardware.
> One needs to be especially careful though, in such a scenario, that
> things are as reliable as "needed" (each use has its own definition) if
> forgoing vendor support.

yes - this is a key point - you'll have to decide what your comfort level is 
here.  One thing that started me on this tho was the problems we had with [2 
top-tier vendors whose names I will not utter].  Paying top prices does not 
always map to top support or performance.  But it does buy you someone to 
blame. :)

> Then there's the theoretical problem with RAID 5 (and I assume, RAID 6
> as well) that rewriting a data block + a suitable checksum isn't atomic
> at the block level.  Supposedly Sun's new "RAID Z" has such atomicity,
> but it seems to me it should be possible to journal the writes to get
> around this in RAID 5.  But I haven't heard of anyone actually doing it
> with RAID 5, and I suppose it's possible that a journaled filesystem
> overtop of the RAID might help with this.  Then again, it might not....
> In fact, it "feels" like it wouldn't.

But most RAID controllers have battery-backed caches, so if the write isn't 
completed, it will be maintained in-queue for the next write.

> You might find this of interest - it's just something I bookmarked a
> while back, not something I've looked at much:
> http://slashdot.org/palm/21/06/05/30/1453254_1.shtml .  It's based on
> FreeBSD.

That's very much like what I'm aiming for, except with a little more emphasis 
on the bigger end.  

> Supposedly xfs isn't good at ensuring integrity of metadata - I believe
> it was Ted T'so who suggested only using xfs for valuable data with a
> battery backup.  Recent version of reiserfs have a decidedly theoretical
> flavor to them, and IINM, it's proving difficult to get a newer Reiser
> into the mainline linux kernel.  However, I'm doing my home backups with
> an oldish Reiser, and it's been working well - I don't miss the fsck'ing
> with ext3 a bit.

I've heard this as well, but it's never bitten me (sand's raid is sitting on 
XFS and it is MUCH faster on large files than reiserfs or ext3 (as well as 
being incredibly fast to mkfs).  I personally haven't had a problem with 
reiser but I have heard of others who have. 

> Supposedly EVMS and LVM both sit over top of the Linux Kernel's "Device
> Mapper" subsystem, and you can use EVMS to manage something configured
> with LVM, and vice versa, -but- supposedly EVMS may set some options
> that'll close the door to supporting LVM again on that set of volumes.

I'm not sure this is an issue, but I'll look into that in more detail - any 
URLs to this?

> I agree that BackupPC looks interesting.  I also see some promise in
> http://dcs.nac.uci.edu/~strombrg/plucker-urls/ . 

foo!?

> Also unadorned rsync 
> has an option for creating distinct trees with hardlinks, which creates
> something that feels like both a fullsave and an incremental, but has
> the storage space of a single fullsave and n incrementals, and when you
> remove "the fullsave", you could think of one of the incrementals as
> "becoming" the new fullsave - not sure how BackupPC relates to that, but
> rdiffweb appears to be more storage efficient than that.  IINM, rdiffweb
> stores full changes using rsync's binary diff'ing algorithm, while
> BackupPC let's rsync do the diff'ing for data in transit - but that's a
> pretty wild guess.  :)

I looked at rdiff-backup, but while technically sweet (some of the same 
advantages of the DataDomain server), the whole package isn't usable enough 
for a widely rolled out package.  Some backuppc users were talking about 
subsuming this technology into backuppc tho.

> Jacob Farmer's company (presented at Usenix, April 2005, about large
> storage systems), I've been told, allows him to give free lectures about
> large storage.  Here are my notes from the Usenix presentation:
> http://dcs.nac.uci.edu/~strombrg/Usenix-talk-Next-Generation-Storage-Networ
>king.html

Thanks - I'll look at this this some more.

> Interesting stuff - thanks for taking the discussion relatively public.

Thanks for all the info and gotchas.  More to think about.

-- 
Harry Mangalam - Research Computing at NACS, E2148, Engineering Gateway, 
UC Irvine 92697  949 824 0084(o), 949 285 4487(c) harry.mangalam at uci.edu