• 28Jul
    Author: trent Categories: Infrastructure Comments: 3
    tcp lego header

    TCP header, Lego (tm) style

    The older I get, the more lessons I seem to learn (or, not learn) over and over.  Have you ever seen TCP offload work correctly?  Of course not!  I’ve been bitten by a TCP offload (aka TCP Offload Engine or TOE) problem in just about every environment I’ve touched in the last 20 years, and sadly this week was no exception.

    To make a long story short, we have a production vmware ESXi 4.1 host with both Linux (CentOS) and Windows Server 2008 guests.  No problems were reported (or measured) with the Linux guests, but the Win 2008 guests suffered from extremely choppy network connections, for common services like Remote Desktop and backups (including lost connections).  As you probably know, I’m big into actually investigating the underlying cause of a problem rather than randomly throwing darts at it, and as such I grabbed some packet traces with wireshark.  Check this out:

    Read more »

  • 06Jul

    The fine folks at Twitter Engineering recently posted about the performance issues they have had over the holiday weekend. Since Saturday, the site has been slow for users and API calls. While AppliedTrust hasn’t (yet) made the leap to Twitter, we recognize how important it is for delivering World Cup news. I give Twitter Engineering tons of credit for being so transparent about the details of the problem – they say:

    In brief, we made three mistakes:
    * We put two critical, fast-growing, high-bandwith components on the same segment of our internal network.
    * Our internal network wasn’t appropriately being monitored.
    * Our internal network was temporarily misconfigured.

    Twitter is well known for great application-layer monitoring and instrumentation, so this gap in monitoring is a surprise. It exposes a common misconception among social software companies – that their server and network infrastructure is “covered” by their hosting provider.  As web applications scale to even 1/1000 the size of Twitter, software becomes critically interdependent on the underlying network. Infrastructure should be instrumented and monitored at least as closely as the software that depends on it.

    For more The Barking Seal articles on monitoring and troubleshooting, see:

    [Slashdot] [Digg] [Reddit] [del.icio.us] [Technorati] [StumbleUpon]
  • 27Jul
    Author: trent Categories: Ramblings Comments: 1

    I’m honored to know Todd Vernon, CEO of Lijit.  His blog is not only entertaining, but it’s always “dead on” right, in an eerie sort of way. As a security guy I was super-entertained by Todd’s analysis of the Clear shutdown a few weeks ago.  Today, Todd has an excellent post on Does Startup Location Matter?

    This is a topic that’s near and dear to my heart, as those of you who have visited our Boulder offices know.  In 2003, we purchased the 3rd floor of the Columbine building in downtown Boulder (above Amante Coffee).  A bold and expensive move for us at the time — and possibly one of the best decisions we’ve ever made.  We’re very proud of our offices.  They’re in a great location, and are definitely a bonus when it comes to recruiting the highly-talented engineers that we look for.  I couldn’t agree more with Todd’s point that being close to the action makes a big difference. Read more »

  • 12Apr
    Author: ben Categories: Infrastructure Comments: 4

    At Applied Trust we run Nagios, the excellent open source monitoring system, to ensure the availability of our internal system infrastructure and that of our clients. The system monitors roughly 2,740 services on 468 hosts. It has performed flawlessly for the past six or seven years, but over the last few months we began experiencing performance problems (see the 12 month CPU usage graph below). In particular, we were being alerted regularly about excessive load on the monitoring system and, as a consequence of high load, false positive timeouts when connecting to services.

    12 month CPU usage

    12 month CPU usage

    Read more »

  • 15Mar
    Author: ned Categories: Infrastructure Comments: 2

    drupal_icon1One of the questions we often help clients answer is: which EC2 instance size provides the best performance-per-cost for a given application? I recently did some load testing with a few different sample web configurations, including a “stock” Drupal installation… here are the results:

    Read more »

  • 26Dec
    Author: trent Categories: Infrastructure Comments: 3

    If you’ve ever been to the doctor and had an EKG test done, you know that they get super-excited about every little spike on the EKG trace.  The cardiologist stands there and “oohs” and “ahhs” over the slightest deviation from what they know to be normal.  To me, it just looks like a squiggly line.

    In the network performance world, packet traces provide the EKG equivalent to examine network health.  Anyone that’s worked with me over the last 20 years knows that the first thing I want to see when someone says they have a network performance problem is a packet trace.  The really good news is that most network engineers (and even some system administrators) are able and willing to use a tool like wireshark (formerly ethereal) or tcpdump to capture a trace.  Sadly, my experience is that once they have the trace, most folks don’t know how to “read” it — it’s the same squiggly line problem, just in the network space.

    At some level, extracting useful data from a packet trace is something that comes with experience, and perhaps is a bit of an art.  There are (literally) hundreds of interesting conditions that a packet trace can indicate, prove, or disprove.  But, success with packet trace analysis usually boils down to 3 things:

    Read more »

  • 19Dec
    Author: ben Categories: Infrastructure Comments: 0

    Every time I complete a large or complicated project, I try to consider a few lessons learned. Having recently been involved in a detailed technical performance assessment, I thought I’d share some generic thoughts here in hopes that it helps someone else. Read on for tips that might make your next performance assessment a success.

    Read more »