Showing posts with label WattsUp. Show all posts
Showing posts with label WattsUp. Show all posts

Thursday, July 21, 2011

Cable Drama (Not of the Soap Variety)

Let me begin with a sincere thank you to to the folks at Microsoft Research in Mountain View for donating a WattsUp meter friendly USB cable. Anyone who has to depend on these meters knows their cables can't be easily replaced with a quick trip to the store. This is due to the recessed port on the actual meter.

The story behind this donation revolves around two WattsUp meters [see left in the red circles].

One of these meters began to return too many errors and bad packets. Not a problem, we avoided it by changing the setup so the Machine Under Test (MUT) uses the "good" meter. Switching the two computers between this "good" meter did cause a minor headache because it left room for errors to happen. For each workload we wanted to run on both boxes (meaning all benchmarks) we would have to swap meters. This meant shutdown the computers, unplug a great many cables and hope things didn't get messed up when the cables were then plugged into their new configuration.

How does a new cable play into this meter drama? Last time we re-configured the wiring our "good" meter began to exhibit the same behavior as the "bad" meter. This time the only thing changed were the usb cables on the meters relaying the data from the MUT's meter to the Data Acquisition Machine. Turns out both meters work fine, it was a dud cable.

Dr. Rivoire, our faculty adviser, in her research at Microsoft, had recently replaced cables on the WattsUp meters used in her research there because of a similar problem. QED let's change out the cable. In short the bottleneck on progress has be resolved thanks to a coworker of Dr. Rivoire's for cutting and refitting the casing on a USB cable to fit WattsUp's unique recessed port.

Workloads ahoy I'll finally have a chance to use my new parsing script written in perl abusing some tricks of regular expressions & unique logs.

Friday, June 24, 2011

Resolving a Heisenbug

Resolving a Heisenbug can be tricky since by its nature it resists the debugger. Instead we decided to go the unscientific way of throwing a bunch of solutions at the problem, then in true programmer fashion not questioning what fixed Wattsup. 

Part one of the solution was to rewrite the packet handling code within wattsup to fix several line errors and how wattsup was handling mis-sized packets. Originally the code would give up and kill the run if the packet size (packets being what the watts data was sent inside) changed due to the program falling out of sync. Now Wattsup recaptures the mis-sized packet and concatenates it with the following packet (which for the majority of the time will also be mis-sized and follow the first packet in data order).

Part two involved unplugging the meters for a while to clear their caches. While unplugging the meters we checked just in case the physical meters were over heating. One meter did feel warm to the touch which could be a result of overheating or normal heat left over from finishing a workload. A last minute idea when we went back to reconnect all the cables included switching the meters measuring each computers. In addition the work space was also rearranged to provide more air flow around the meters since they previously felt warm to the touch.


Wattsup now runs without problems! What actually fixed the Heisenbug? Probably a combination of all three solutions (rewriting code, switching meters and increasing air circulation). The important question now move toward analyzing all the data that we have finished collecting.

Here are a few graphs from our data that will inspire the future headaches from analysis:




Tuesday, May 3, 2011

Dude Where's My Data?

Coming down from the high of industry talks, vacationing and socializing with the family, it was back to the binary trenches. Because errors and code bugs never rest!

Parting of using scripts to automate the data retrieval aspect from the benchmarks so we can run them in successive order across controlled frequencies is that when things go wrong...they go missing. No more non-compiling code or segfaults just missing records that aren't apparent even when stalking the logs real time! Recently our error logs were deceptively empty because the data logs were empty as well. The power meter (delightfully from a brand called WattsUp) was fine. Maybe the cable was shorting out? Back to running baseline tests we resolved that the meter hiccuped and settled down to running the few troublesome frequencies one at a time instead of a set. This resolved the problem and began a new one.

Reading is not hard but for some reason it's deceptive when sleep deprived. When the time came to run the benchmarks related to the GPU (as apposed to the troublesome CPU). Well, that didn't run so smoothly either as does any plan when it meets contact with the enemy on the field. Forgetting to rename the tests surprisingly did not harm anything because the test failed after one frequency and only destroyed an otherwise very small dataset that can easily be recovered.