Resolving a Heisenbug can be tricky since by its nature it resists the debugger. Instead we decided to go the unscientific way of throwing a bunch of solutions at the problem, then in true programmer fashion not questioning what fixed Wattsup.
Part one of the solution was to rewrite the packet handling code within wattsup to fix several line errors and how wattsup was handling mis-sized packets. Originally the code would give up and kill the run if the packet size (packets being what the watts data was sent inside) changed due to the program falling out of sync. Now Wattsup recaptures the mis-sized packet and concatenates it with the following packet (which for the majority of the time will also be mis-sized and follow the first packet in data order).
Part two involved unplugging the meters for a while to clear their caches. While unplugging the meters we checked just in case the physical meters were over heating. One meter did feel warm to the touch which could be a result of overheating or normal heat left over from finishing a workload. A last minute idea when we went back to reconnect all the cables included switching the meters measuring each computers. In addition the work space was also rearranged to provide more air flow around the meters since they previously felt warm to the touch.
Wattsup now runs without problems! What actually fixed the Heisenbug? Probably a combination of all three solutions (rewriting code, switching meters and increasing air circulation). The important question now move toward analyzing all the data that we have finished collecting.
Here are a few graphs from our data that will inspire the future headaches from analysis:
Modeling the Power Consumption of Computer Systems
with Graphics Processing Units (GPUs)
Showing posts with label fancy charts. Show all posts
Showing posts with label fancy charts. Show all posts
Friday, June 24, 2011
Saturday, June 18, 2011
Heisenbug and Martha Stewart Cleaning

Bottle necks are incredibly frustrating, they hamper progress because the research can go only as fast as the slowest component. If the slowest component has intermittent bugs then that pace is a frustrating crawl. At long last the bottle neck on this research is software no one involved on the project wrote! That shouldn't be cause for celebration, but as childish as it sounds, it's nice for once to not have the blame be on the research team.
There's even a name for the type of bug causing our bottle neck:
Heisenbug - A bug that disappears or alters its behavior when one attempts to probe or isolate it ... the use of a debugger sometimes alters a program's operating environment significantly enough that buggy code ... behaves quite differently.
Turn the debugger on, the problem vanishes. Turn the debugger off, problems resurface but we can't recreate the error with the debugger back on. This is incredibly frustrating to realize, try to debug and work around.
Sadly it is our data collection software, Wattsup, that is the one dealing with said Heisenbug. The research uses Wattsup to take a reading of the power (in watts) at a set time interval during the length of the benchmark run. The logs show a normal run. The error logs only have one error that appears randomly during the course of a run. But look inside the .ac file (where a timestamp and a measure of watts at that time are expected) and lo, three to four minutes before the end of the run the data cuts off.
What does one do while isolating Heisenbugs? You have let the run go and tail the logs but these runs go from twenty minutes to hours. Can't sit there hitting ls -l on the logs. Instead it's the perfect time to go Martha Stewart spring cleaning on all the versions of research code between rickroll and lolcat. Basically clean up the update logs, output and naming conventions from automatically generated files. No more should we need a nifty flow chart.Organization and version control are crucial! Actually executing organization and version control however involves hoping between multiple languages. Unix, perl, python, and some regular expressions made life easy while the brain began to feel the differences in syntax. Hopefully from this point onward (once Wattsup is "debugged" for good) the project can get some solid data from lolcat. Then onwards to find fresh headaches analyzing our model against actual data from rickroll and lolcat.
Subscribe to:
Posts (Atom)


