Wednesday, February 11, 2009

How to REALLY back up like a pro

Over at Jay Lake's blog, he has written an entry on how to do backups like a pro. I cannot whitewash this; this is sheer poppycock. What Jay describes is absolutely the safest way to back up your files... circa 1990. The IT industry solved this problem years ago. It's over, done. And the software you need to do so is in the "cheap," "free," or "already paying for it" category.

That's an audacious claim, so let me start by debunking the notion that this baroque sequence of events is actually safe. There's one fundamental problem with it, and that is it relies on humans to not make errors. There are a lot of steps in there. And I don't know about other people, but after I've just spent a couple hours pounding out a few thousand words, I'm not at my most mentally keen. Should you really be willing to bank your security on habitually executing all those steps correctly under those circumstances?

In a word, no. Going through these motions is tantamount to thinking that the TSA guys who make us take off our shoes, and restrict liquids to no more than 2 oz (so we can't make a very BIG bomb?) are actually protecting us from terrorists. They counter what Bruce Schneier calls "movie terrorist plots"--threats that seem large, but in fact are not very likely--while against the real issues, they protects us little, or not at all.

An example, you say? Well, have any of you ever done any of these?
  • Saved a backup file with the wrong filename, so you can't find it later, or you accidentally overwrote a version you wanted to keep?
  • Had your email account hacked? For example, by a spammer, who gets your gmail account permanently shut down within a matter of hours?
  • Sent your backup dvds to the wrong relative, or had the right relative not correctly file them, making them impossible to find should you need them?
Et cetera, et cetera. "Impossible!" you say, "because I really CARE about my data." Consider this article from 2007, which cites a researcher who discovered that human error is the most common cause of security breaches.

"So, Mr. Smarty," you say, "You are not really being part of the solution here." Fair enough. Safely backing up your stuff requires two systems to cooperate:
  • Revision control (also called version control or source control), and
  • Off-site backups.
Revision control is software that is designed for tracking changes to program source code. No reasonable development shop works without it these days. It works thus: when you make changes to a file, you push those changes over to a revision control server (this is called "committing" the file), which remembers ABSOLUTELY EVERYTHING YOU HAVE EVER DONE TO THAT FILE. In the blink of an eye, you can revert to an older version, without destroying the data you've subsequently stored in your revision control system (hence, RCS).

Couple this with off-site backups. The best way to achieve this is to run your RCS on your internet hosting. When you save your changes, you tell the RCS to push the changes to your repository of files on your ISP. (Yes, there are some risks in doing this. There is no such thing as "no risk," only "manageable risk," and it's wise under these circumstances to get someone who knows about such things to advise you when first setting up your RCS, to mitigate this risk.) At any point in the future, you can restore every single file you've ever stored there, to any version you've ever committed.

Meanwhile there are guys who work for your ISP who get paid to do nothing but think about how to keep data from being lost. They use RAIDs, which protect systems from drive loss. They do regular tape backups. Some of THEM do multisite backups, automatically mirroring your data to another node in their network to protect against catastrophic failure.

If you can't see how that's better than gmailing yourself all your files, I have failed at this argument.


Other benefits
Not only is this a solid backup strategy that requires minimal manual intervention, there are several side benefits it gives you for free.
  • If you are working on a project collaboratively, how does your collaborator know they have the most current revision of the file? With revision control, you push changes up to your revision server, give your partner access and let them pull the most current changes using the same software. Unlike email, this works in real-time. You can even lock your files to show that you are working on them, so neither of you stomps on the other's changes.
  • Every checkin to revision control allows you to add a comment. So when you are looking for a particular past revision of a file, you can read "Road Trip: Changed protag from a man to a woman" instead of digging through a bunch of files called "RoadTrip_v132.doc," "RoadTrip_v133.doc" and so on. Additionally, the system automatically tracks commit times and revision numbers, so even if you don't add comments, it's no harder than looking through a pile of hand-versioned files.
  • If you work on multiple machines, like I do, it's a snap to keep them in sync: on your desktop, push changes up to the server, on your laptop, pull them down.
  • If you must have multiple backup sites, revision control makes it painless to keep them in sync, as well. You can even configure one revision control server to automatically push changes over to another (though this takes a little black magic; however, there are plenty of people who will gladly help you set this up for not very much money or free--including me).
All right, enough of my ranting. If I have even cracked your resolve on this, I encourage you, not to take my word on it, but do more research. Talk to your programmer friends. Google for some of the terms I've thrown around in this post. Go look at the web sites of some of the systems I'm talking about; the one I personally recommend for people getting started with RCS is Subversion. It is free, it's widely-adopted throughout the open-source community (lots of people to answer your questions), there are a number of easy-to-use clients for it (such as Tortoise SVN), and it's pretty easy to set up. (In fact, some ISPs that cater to developers, such as Joyent, the one I use, actually have a control panel that will greatly simplify the process.) I used to use Subversion, but if you're feeling ambitious, you might have a look at Bazaar, which is the RCS I use nowadays. (Word of caution: it's a more complex piece of software, so don't let that sour you on the whole RCS strategy.)

Lastly, if you're interested in hearing more about this, please comment. I will be happy to reply privately or answer peoples' questions here.

TimK
Saving the world from arcane backup strategies, one writer at a time.

Labels: , , ,

5 Comments:

At 9:04 PM, Blogger jaylake said...

As it happens, I do use revision control and offsite backup. I just do them by crude, largely manual means rather than with sophisticated tools. I don't back up like an Engineering pro, not in the slightest, but I do back up like a writing pro... :P

 
At 5:50 AM, Blogger pauljessup said...

Actually, RCS was around before 1990, and was a big component of almost every UNIX system. So Jay, no offense but your method of backing up is older than UNIX :)


Anyway- why not mention the program RCS itself? I use it, since my server doesn't support things like CVS or Subversion. It's free, but a command line utility (not that hard to learn though) and can run on Mac, Linux or Windows with ease. Not just safe and secure, but cross platform :)

I even think there is an RCS plugin for use with OpenOffice.

 
At 1:45 PM, Blogger MrTact said...

@jaylake - Fair enough, I just wanted to spread the information that there may be a better way. It's pretty clear from reading your blog that you're a highly disciplined individual. I'm not, so I prefer to rely on computerized systems that just work, with minimal intervention on my part.

@pauljessup - RCS will certainly do the job, but I wouldn't consider it the best choice for the audience in question. I've been a developer for 15 years or so and *I've* never used it! (Though I have used CVS.)

 
At 7:05 AM, Blogger Luke said...

Actually, he is not totally wrong. Multiple layers of backups are a good thing.

When I was writing my MS Thesis I did use Subversion but just to be safe I would always kept redundant copies on my desktop, on my laptop on a pocket flash drive and on my workstation in the lab at school. All these locations were synced via SVN.

Thanks to this I had the piece of mind to know that if my house burned down, my car exploded (taking the laptop with it), my school was vaporized by aliens and my host suddenly lost power to their data center I would still likely have a semi-recent copy of my thesis in my pocket.

So yeah, version control is great but having redundant backups gives you a piece of mind.

But yeah, that whole "I'll email this to myself as backup" thing irritates me to no end. That's not what email is for! This is one case where "emergent" behaviors are irritating.

Also, for people who can't figure SVN out I'd recommend this service:

https://www.getdropbox.com/

You get 2GB of offline storage. When you install their service you get to pick a folder on your machine. Anything you put in that folder will get automatically uploaded to your dropbox (seamlessly in the background).

Furthermore if you have the service running on more than one computer they will all sync their folders to the online repository. So you edit a file on your laptop, hit save and it gets automatically uploaded, and also synced over to your desktop. Very neat. I use it to share data between my machines all the time.

 
At 9:22 PM, Blogger Open said...

nice post!

If you are looking at free or open source revision control, then here is a post that has a list of such revision control tools that are very capable and nice.

here is list that you might want to check;
Best CVS like revision control softwares

 

Post a Comment

<< Home