Friday, June 24, 2011

Resolving a Heisenbug

Resolving a Heisenbug can be tricky since by its nature it resists the debugger. Instead we decided to go the unscientific way of throwing a bunch of solutions at the problem, then in true programmer fashion not questioning what fixed Wattsup. 

Part one of the solution was to rewrite the packet handling code within wattsup to fix several line errors and how wattsup was handling mis-sized packets. Originally the code would give up and kill the run if the packet size (packets being what the watts data was sent inside) changed due to the program falling out of sync. Now Wattsup recaptures the mis-sized packet and concatenates it with the following packet (which for the majority of the time will also be mis-sized and follow the first packet in data order).

Part two involved unplugging the meters for a while to clear their caches. While unplugging the meters we checked just in case the physical meters were over heating. One meter did feel warm to the touch which could be a result of overheating or normal heat left over from finishing a workload. A last minute idea when we went back to reconnect all the cables included switching the meters measuring each computers. In addition the work space was also rearranged to provide more air flow around the meters since they previously felt warm to the touch.


Wattsup now runs without problems! What actually fixed the Heisenbug? Probably a combination of all three solutions (rewriting code, switching meters and increasing air circulation). The important question now move toward analyzing all the data that we have finished collecting.

Here are a few graphs from our data that will inspire the future headaches from analysis:




Saturday, June 18, 2011

Heisenbug and Martha Stewart Cleaning


Bottle necks are incredibly frustrating, they hamper progress because the research can go only as fast as the slowest component. If the slowest component has intermittent bugs then that pace is a frustrating crawl. At long last the bottle neck on this research is software no one involved on the project wrote! That shouldn't be cause for celebration, but as childish as it sounds, it's nice for once to not have the blame be on the research team.

There's even a name for the type of bug causing our bottle neck:

Heisenbug - A bug that disappears or alters its behavior when one attempts to probe or isolate it ... the use of a debugger sometimes alters a program's operating environment significantly enough that buggy code ... behaves quite differently.

Turn the debugger on, the problem vanishes. Turn the debugger off, problems resurface but we can't recreate the error with the debugger back on. This is incredibly frustrating to realize, try to debug and work around.

Sadly it is our data collection software, Wattsup, that is the one dealing with said Heisenbug. The research uses Wattsup to take a reading of the power (in watts) at a set time interval during the length of the benchmark run. The logs show a normal run. The error logs only have one error that appears randomly during the course of a run. But look inside the .ac file (where a timestamp and a measure of watts at that time are expected) and lo, three to four minutes before the end of the run the data cuts off.

What does one do while isolating Heisenbugs? You have let the run go and tail the logs but these runs go from twenty minutes to hours. Can't sit there hitting ls -l on the logs. Instead it's the perfect time to go Martha Stewart spring cleaning on all the versions of research code between rickroll and lolcat. Basically clean up the update logs, output and naming conventions from automatically generated files. No more should we need a nifty flow chart.

Organization and version control are crucial! Actually executing organization and version control however involves hoping between multiple languages. Unix, perl, python, and some regular expressions made life easy while the brain began to feel the differences in syntax. Hopefully from this point onward (once Wattsup is "debugged" for good) the project can get some solid data from lolcat. Then onwards to find fresh headaches analyzing our model against actual data from rickroll and lolcat.

Tuesday, June 7, 2011

The Wonder of More Time

Not even a week back at the research and the progress is unbelievable! What could possibly be the reason behind so much happening in only two days? Was it the short break away from the data? The new eight hours the research is getting each day? Who knows! I just hope the progress keeps running full tilt. Let's recap so far what's happened since coming back from the brief hiatus at the end of last semester and the parting of Vince.

Day 1 
Back at hitting the code actually turned into spring cleaning. In the mad dash of running ahead with the progress last semester....things got a little hairy in the directories. Test runs going everywhere and scripts written in haste proved they needed to be rewritten. So the first day back was spent cleaning up directories, sorting/rerunning data runs and patching mischievous scripts that needed dire documentation. It's amazing feeling the difference of having the time to not only spot errors but correct them right away. This compared to fighting for time to do research and homework and projects during the semester? I'll happily take my eight hour work day this summer over the last two crazy 10 hour a week semesters.

Day 2
Analyzing data has never been so interesting!  The day began with basically filling out a very fancy excel sheet and utilizing functions in R. This was all done on data from our calibration, CPU and GPU benchmarks for Rickroll. With more initial data to go on than previously, two things jumped out:

1) Our GPU numbers for mean squared error and dynamic range error are curiously opposite of what we initially predicted (for certain frequencies the numbers are inversed: we expected higher numbers at lower frequencies and lower numbers at higher frequencies).

2)Due to the behavior of how specJBB stresses the CPU and what happens with the power consumption, there is serious discussion of adding another CPU benchmark to really stress the CPU further.

I can only imagine how much further we'll be at the end of this week.