Sonoma State University CREU: intermittent bugs

Showing posts with label intermittent bugs. Show all posts

Friday, June 24, 2011

Resolving a Heisenbug

Resolving a Heisenbug can be tricky since by its nature it resists the debugger. Instead we decided to go the unscientific way of throwing a bunch of solutions at the problem, then in true programmer fashion not questioning what fixed Wattsup.

Part one of the solution was to rewrite the packet handling code within wattsup to fix several line errors and how wattsup was handling mis-sized packets. Originally the code would give up and kill the run if the packet size (packets being what the watts data was sent inside) changed due to the program falling out of sync. Now Wattsup recaptures the mis-sized packet and concatenates it with the following packet (which for the majority of the time will also be mis-sized and follow the first packet in data order).

Part two involved unplugging the meters for a while to clear their caches. While unplugging the meters we checked just in case the physical meters were over heating. One meter did feel warm to the touch which could be a result of overheating or normal heat left over from finishing a workload. A last minute idea when we went back to reconnect all the cables included switching the meters measuring each computers. In addition the work space was also rearranged to provide more air flow around the meters since they previously felt warm to the touch.

Wattsup now runs without problems! What actually fixed the Heisenbug? Probably a combination of all three solutions (rewriting code, switching meters and increasing air circulation). The important question now move toward analyzing all the data that we have finished collecting.

Here are a few graphs from our data that will inspire the future headaches from analysis:

Tuesday, May 3, 2011

Dude Where's My Data?

Coming down from the high of industry talks, vacationing and socializing with the family, it was back to the binary trenches. Because errors and code bugs never rest!

Parting of using scripts to automate the data retrieval aspect from the benchmarks so we can run them in successive order across controlled frequencies is that when things go wrong...they go missing. No more non-compiling code or segfaults just missing records that aren't apparent even when stalking the logs real time! Recently our error logs were deceptively empty because the data logs were empty as well. The power meter (delightfully from a brand called WattsUp) was fine. Maybe the cable was shorting out? Back to running baseline tests we resolved that the meter hiccuped and settled down to running the few troublesome frequencies one at a time instead of a set. This resolved the problem and began a new one.

Reading is not hard but for some reason it's deceptive when sleep deprived. When the time came to run the benchmarks related to the GPU (as apposed to the troublesome CPU). Well, that didn't run so smoothly either as does any plan when it meets contact with the enemy on the field. Forgetting to rename the tests surprisingly did not harm anything because the test failed after one frequency and only destroyed an otherwise very small dataset that can easily be recovered.

Sunday, February 27, 2011

Intermittent Bugs

Like a bad ear worm that won't go away, intermittent bugs are resurfacing to force any progress into a standstill. After running the last set of logs there was a strange conundrum: according to screen (a unix utility that allows you to track something on the screen after your remote session logs off) the full run went through smoothly. But all logs after the baseline were MIA. Head scratching conference followed with Dr. Rivoire. We decide to run the test again. Could have just been a fluke. Open up a new instance of screen, begin the test again, check in later on the logs...we have the same problem only this time there is just one set of logs for one frequency.

What's a lowly research assistant to do? This all worked fine not more than a week ago, nothing has changed. Except for the bought of bad weather for the first fluke and that does not excuse the weirdness with the second run (which had fine weather). Time to stalk the logs real time! Sadly there will be no British narrator doing a voice over as if this was a National Geographic documentary. Research is far from error proof which will be a hard lesson for the young research assistant to learn.

Stalking the logs real time is easier than combing through a huge batch of them once a run is finished. The reason for doing it real time is that when the bug occurs it is highly visible. The moment something pops up in the error log you can begin investigating while letting the log run onward. For example: if during the baseline in the error log this gem pops up:

wattsup: [error] Reading final time stamp: Bad address

We can begin with checking the date simultaneously on both rickroll and lolcat to see if maybe they've gotten out of sync. That potential cause for error is then ruled out when timestamps from both show the correct time. Here begins another round of head scratching. Because in the error logs following baseline there are no more errors to be found (using regular expressions as a quick way to look at all the error logs at once: tail testSpockCat*.err). Too bad the idea of naming the run Spock didn't spark some fear into the computers. Now if you'll excuse me I've got to resume stalking the logs for that most elusive and shy of prey, the pesky error.