||[Feb. 12th, 2017|10:05 pm]
My boss got back from AMS early the week before last, and decided that yes, she did want me to go all-out to try to get the new dataset fully bias-corrected by mid-March, in time for it to be included in a big assessment report. |
This is both a simple and complicated goal. On the one hand, I don't need to do any further method development: I have code that does the bias correction and I've tested it thoroughly on a good set of test cases and I know it works. On the other hand, it takes about 18 seconds to do one location. When you multiply that out by the full spatial domain and all the different variables and simulations, it adds up to about 2500 CPU-hours to do the whole thing. That's about 104 days, which puts us considerably past mid-March.
The good news is that the problem is embarrassingly parallel. (I might say it's even a step beyond "embarrassingly" parallel and into the realm of the ludicrously or stupidly parallel.) So all I have to do is get it set up to run in parallel on our supercomputer or maybe on the cloud and it'll be done in no time. (In principle, if I could get half a million processors, I could do the whole thing in under a minute. In practice, I gather that it always takes at least 5 minutes to get things spun up.)
But there are a lot of unknowns in getting it to run in parallel. I theoretically know how to get R to run in parallel using MPI -- I'm a coauthor on a tech note about it -- but I've never actually done it myself. So the first week was pretty stressful, because I was constantly aware of the clock ticking as I tried, with rather limited success, to Make It Go.
Happily, early this last week I got some help from one of our computing consultants, and although we didn't get the MPI approach to really work, he pointed me at another approach that was both simpler and better suited to the task at hand. I've replicated my test case and it works beautifully, so now it's just about restructuring the data to scale it up. Which means I no longer have to have all the time horizons and contingency plans and feasibility evaluation checkpoints floating around in my head, I can just focus on implementing the solution. Which is a huge relief.
(Hmm, that turned longer than I planned. I'll save the non-work stuff for another post.)
So . . . Is it the same as having nine women working to have one baby in just one month?
2017-02-13 07:08 am (UTC)
That's an example of a non-parallelizable problem.
My operating systems teacher used the following analogy:
Suppose you're a farmer and you dig furrows by hand, which takes you all day.
Then you get twice as much land so you get an ox and a plow.
Then you double your land again, so you get a second ox for your plow.
Your land size doubles again, so you buy a tractor.
Your land size doubles, so you get a bigger tractor.
Another doubling, so you get a really big tractor with one of those really wide multi-furrow things on the back.
Your field grows again, but you're starting to reach the limits of what you can fit on one tractor. So you buy a second really big tractor and hire someone to help you.
More land growth and you need four really big tractors.
Your farm doubles again, but eight really big tractors is looking pretty expensive and hard to manage. So what if instead you hired 64 people with an ox-and-plow?
… or what if you hired a million people with spoons?
A super computer is a lot like a big fancy tractor. Cloud services provide something more like a giant unlimited supply of oxen, plows, and cheap labor. And with a couple billion smart phones on the planet, you can do a lot of digging if only you can figure out how to coordinate all those spoons.
But this story only works if you can split it up like plowing a field. If you're building a skyscraper or making a baby, you can't productively use all million spoon-wielding laborers.