Here's the thing: I'm creating some simulations in Python, a common programming language. Well, the simulation itself is run in highly optimized parallel C++ code (we use Nest; it's good, take a look), but we use Python to create the model, set up and run the simulation and collect and save the resulting data.
We run quite a lot of simulations, so I want to collect the data from one simulation run into one catalog. So, towards the end of the simulation we create a new catalog:
Now, the problem is that if the catalog already exists this will fail. Python will complain and stop. We need to check that the catalog doesn't already exist first:
if not os.path.exists(catalog):
Great! Problem solved. Job well done, beers for everyone, pat yourself on the back. Except...
These simulations are pretty heavy. They take a long time to run. To make things go faster we use many CPU's in parallel. On my desk I have an 8-cpu computer which helps a bit, and I have access to 256 cpu's in a cluster in Tokyo if I need it. Eventually, of course, this project (where this model is just one part) aims to use the new super-computer in Kobe when it comes online next year, and that one has more cpu's than you can shake a very long stick at.
This means that the python code above isn't run just once for a simulation, but once for each cpu we use (some of you probably realize where this is going already). Normally this is not a problem: the first process to finish will create the new directory. The other processes will find the directory is already there and skip the creation.
But here's the problem: there's a tiny bit of time in between checking if the catalog exists, and creating it if it doesn't. What happens if two processes try to do this at the same time? Both find that the catalog does not exist, one of them creates the catalog, then the other tries to create it too - it wasn't there a moment ago after all - and the whole simulation fails. That's what happens.
This is called a "race condition", because you have two (or more) processes 'racing' each other, and you get different results depending on which one happens to get there first.
The chance of this actually happening - that two processes manage to get the timing so exactly wrong - seems really slim of course. It may indeed be a really rare and unlucky event if you just run a simulation once and get hit by this. But if you're running lots of simulations, and using lots of cpu's, then you're bound to get hit by this bug sooner or later.
And I did, this weekend. I had started a long series of simulations to run over the weekend. As I came to work this morning I found that the series had failed about halfway through because of this bug. Bad programmer. Bad, bad programmer. No coffee for you.
So, instead of almost three days worth of simulation data to analyze, I now have to spend another day generating the missing stuff - a day that I can ill afford with deadlines looming like thunderclouds over my head.
What should I have done? What should you do? Something like this:
except OSError, e:
if e.errno == errno.EEXIST:
This is Python's way of dealing with exceptions - run-time errors such as failing to create a catalog. We don't check beforehand if the catalog exists, but simply try to create it. If we fail, we don't just stop. Instead we catch the error (that's the "except" bit). If the error is that the catalog already exists we just ignore it and continue the simulation (the "pass" thing). If it was some other kind of error we send it on for the system to take care of ("raise", as in raise a flag to alert that something is wrong).
I did not do this. Which made me lose a day's worth of data analysis. And makes me a blockhead. Don't be a blockhead like me. Take care of race conditions when you program. Do proper error checking and recovery. Think of the future. Think of the children.