Friday, April 22, 2011

The Importance of Profiling

Ah, profiling. A tool that I've neglected to use for far too long until forced to do so recently at work. No I don't mean that! No not that either!! Profiling is a tool which is used to examine how long each part of a given program executes to find out why it's slower than molasses in Antarctica.

A major tool that I support is a test automation framework that supports remotely exercising APIs under test via a scripting language. I had just finished the initial work of porting this system to support testing of some new software when I discovered a problem.

Under normal circumstances the program can run over one hundred commands per second depending on the time taken by the command itself. An early test however revealed that I was barely managing one or two commands per second. Something was wrong!

Worse, examining the code provided no obvious answers. There were no poorly made algorithms, no pointless sleeps, no redundant work, nothing. Embarrassingly, this was the first time I've actually used a profiling tool to solve a problem. All other times I've simply muddled around with the code until I found some obvious problem like the ones I mentioned. Whelp, no time like the present, right?

The tool I was using provided two ways to profile a program's internals. Both involve what's called instrumentation. What that means is the tool generates some extra code in the program for the purpose of measuring your code's behavior. The first profiling method, and by far the simplest, is to count the number of times every function is called. This is helpful in finding candidates for improvement or potential bugs that cause more calls to some code than necessary. The second measures the amount of time spent in each function in addition to the number of times it's called. This is better since it can tell you where your program is getting stuck longest.

In my case a function which waited for a command stood out as the long runner. Great! Except all it does is call a function to receive data. Digging through the function I found some informational prints, and a call to recv. Not much to go on. Prints are used in a lot of different places in the program and none of those calls took very long at all, so that's out. Time to put on my google Shinobi Shozoku and find out if there are situations that make recv slow.

The result? Yes! However it's not recv that's the problem. It's a little something called Nagle's Algorithm, which is part of the TCP/IP protocol stack. It's designed to reduce network traffic by attempting to combining multiple small packets and sending them at once. It does this by holding on to a small packet and waiting for a period of time before sending it, in case more data is coming. Since the tool was designed to minimize bandwidth and be very lightweight in design it would send encoded commands in the form of very small packets, typically a few dozen bytes being large for the tool. Disabling Nagle's Algorithm did the trick.

Without profiling I might have found the issue in question, but I would have wasted significantly more time either in trial and error. Or worse, not realizing I've made an error, and wasting hours trying to fix something that isn't the problem with negligible improvement.

No comments:

Post a Comment