• 26Dec
    Author: trent Categories: Infrastructure Comments: 3

    If you’ve ever been to the doctor and had an EKG test done, you know that they get super-excited about every little spike on the EKG trace.  The cardiologist stands there and “oohs” and “ahhs” over the slightest deviation from what they know to be normal.  To me, it just looks like a squiggly line.

    In the network performance world, packet traces provide the EKG equivalent to examine network health.  Anyone that’s worked with me over the last 20 years knows that the first thing I want to see when someone says they have a network performance problem is a packet trace.  The really good news is that most network engineers (and even some system administrators) are able and willing to use a tool like wireshark (formerly ethereal) or tcpdump to capture a trace.  Sadly, my experience is that once they have the trace, most folks don’t know how to “read” it — it’s the same squiggly line problem, just in the network space.

    At some level, extracting useful data from a packet trace is something that comes with experience, and perhaps is a bit of an art.  There are (literally) hundreds of interesting conditions that a packet trace can indicate, prove, or disprove.  But, success with packet trace analysis usually boils down to 3 things:

    1. Do you have a trace that actually contains the condition of interest?
    2. Do you understand the protocols in use well enough to identify what is correct vs. abnormal behavior?
    3. Do you examine the trace with adequate time, attention to detail, and a keen eye?

    Obviously, #1 above is essential, but I’m often surprised at how many times I get a trace that is missing this key element.  Hint: collect packet traces yourself when possible — that eliminates a lot of variables right from the beginning.

    The other two items require study and patience.  My hope over time is to document some of the more common conditions that occur here in this blog, as a reference to myself and others.  Let’s get started!

    I spend the vast majority of my trace analysis time looking at TCP/IP traces.  Almost without exception, when examining a TCP trace the first thing I look at is the TCP Time-Sequence graph (In wireshark, this is under Statistics->TCP Stream Graph->Time-Sequence Graph).  Just like it sounds, this graph plots time across the x axis and the TCP connection sequence number on the y axis.  The theoretical ideal graph is one that is a perfectly straight line, where the slope is equal to the maximum bandwidth of the end-to-end link (the first derivative of this graph is throughput).  Of course, real world != theoretical world, except in the Cisco documentation set.

    Figure 1 - Time-Sequence Graph

    Figure 1 - Time-Sequence Graph

    Figure 1 is a somewhat typical example of such a Time-Sequence Graph from a real-world environment.  For the most part, this fits the description of a straight line, and the slope is reasonable for the connection media.  The average network guy would say “looks fine – no problem here.”

    Ah, the devil is in the details.  Look again.

    Figure 2 - Highlighted Time-Sequence Graph

    Figure 2 - Highlighted Time-Sequence Graph

    In fact, as shown in Figure 2, there are a fair number of “outlying” data points in this graph.  Due to the volume of data, the default graph does a poor job of drawing our eye to these outlyers.  Look at this graph zoomed in around the 21 second mark:

    Figure 2A - Time-Sequence zoomed into interesting period

    Figure 2A - Time-Sequence zoomed into interesting period

    In this Figure, you can see how that sequence right around the 21 second mark of the conversation is significantly off the line (you can also see a nasty banding issue in this example as illustrated by the groups of 2 to 4 dots that are vertically aligned, but in this case that’s caused by clocking on the capture/measurement side.  Someday I’ll write about other more concerning phenomena which cause banding like this.) But, even looking at the graph at this level of detail, it still can’t be that big of a deal, right?  Wrong.  Let’s look at the frame-by-frame trace data.  Obviously, there are a number of similar points in this graph, I’ve just chosen the one around the 21 second mark as a random example to dig into.

    Figure 3 - Detailed per-packet info

    Figure 3 - Detailed per-packet info

    Wow.  Right away we can see that things get really ugly at this level around the 21 second mark.  I’ve highlighted in blue frame 2402.  This is the first ACK the data sequence transmitted before frame 2386 (not shown).  Beginning at frame 2405, the receiver begins sending duplicate ACKs (identical to the segment ack’d in 2402).  In fact, by the time this mess is over with, it sends 14 duplicate ACKs.

    What’s happening here?  Frame 2386, a payload frame, headed toward the receiver must have been dropped.  Upon receiving subsequent payload packets, the receiver begins (intentionally) sending duplicate ACKs as an indication that it has received packets out of sequence, and can only acknowledge the stream through to the last frame in sequence it received.  The sender takes the hint at frame 2410 and retransmits what was originally in 2386. This is known as TCP fast retransmission, and was an algorithmic enhancement to the TCP protocol introduced by Van Jacobson (and later documented by W. Richard Stevens, see RFC 2001.  As a random side note, Rich was a cherished friend and the cool story he told me about Wayne’s World eventually made it to salon.com).  Prior to fast retransmission, the lost packet wouldn’t have been detected and retransmitted by the sender until the TCP RTO timer had expired, further wasting time and decreasing performance significantly. Because there is a transmission window between the sender and the receiver involving many hops, it’s another 21 frames before the receiver acknowledges the retransmitted frame.  OUCH.

    Clearly, the underlying problem is the lost packet, which in this case is due to a lossy microwave link at hop 3.  What’s important to take away from this is 1) how much havoc that single lost packet created, including 14 duplicate ACK’s and 2) how “small” this issue was in the original Time-Sequence graph shown in Figure 1 (if you’ve read through to here, I urge you now to go back and look at Figure 1).

    Bottom line: look at the Time-Sequence Graph very carefully, and investigate any anomalies frame-by-frame.  But, this is just one of many tools and analysis methods to consider when interpreting a packet trace… stay tuned for discussion of more of these in later posts.

    [Slashdot] [Digg] [Reddit] [del.icio.us] [Technorati] [StumbleUpon]

3 Responses

WP_Floristica
  • wireshark_user Says:

    Very good Tutorial. I learned a lot from you, thank you. Please do continue to make some more of this mini tutorials. There are – no doubt, beneficial. ;-)

  • New Wireshark User Says:

    This is indeed very good tutorial!

    One question: if the slope of the time/sequence graph is too steep, does it mean the throughput is too high such that the intervening hops (routers, switches) may not be able to handle the traffic? Can that explain dropped packets?

    Plus, to fix it, should the sending side to reduce the speed of sending data?

  • The Barking Seal » Blog Archive » A Gentle Infrastructure Monitoring Reminder Says:

    [...] Dan on Don’t Forget to Vote to Bring Google Fiber to Boulder by March 21!New Wireshark User on Interpreting Packet Traces with Wireshark (Part 1 of n)ben on Changing Active Directory Passwords with PerlRoger on Changing Active Directory Passwords [...]

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.