Ruslan Grumble - Fotolia

Troubleshooting SDN routing highlights need for new tools

Troubleshooting SDN routing issues when the routes change quickly in real-time requires a new kind of network monitoring tool that can virtually roll back routing changes.

Two years ago, I wrote that troubleshooting for software-defined networking would require a packet time machine to untangle the complex, transient topology of rapidly changing dynamic networks. In addition, enterprises are accelerating the cloud transition with hybrid networks that make us even more dependent on routing we can't easily control inside service provider networks.

As the network fabric regularly reconfigures itself on both sides of the firewall, how do network engineers troubleshoot transient, route flapping or recent-past issues? New tools for SDN routing are emerging to tackle the problem, but they're not like anything we've used before.

Spooky network action at a distance

A key principle of science is reproducibility, with independent researchers following the same procedure to obtain the same results under similar conditions. And if anything in networking is scientific, it's the command line. Its functional limitations enforce strict usage and, while not entirely deterministic, repeated commands produce commensurate operation. It's also relatively expensive, enforcing stability that we rely on when trying to understand why a network misbehaved at a certain point in time. The question, "Why would Mike change that firewall rule?" is reasonable and helpful when reverse-engineering Mike's changes during the overnight shift.

It's not enough to look at the network as it is -- our tools will need to let us virtually roll back routing changes in time and troubleshoot routes that might have only existed for a few minutes.

And Mike's, Kirstin's and my expensiveness comes from the inefficiency of skilled administrators hunched over laptops at 3 a.m. for maintenance windows. Network configuration by command line is laborious, error-prone and effectively caps maximum network change rate.

As a side effect, low cardinality lets our brains build serviceable topology models. We remember the links and nodes in our critical routes because we built them. When service goes sideways, we remember the likely missed feature that caused it and, more importantly, the node where the change was made. command line interface (CLI) to that machine, repair and close the ticket as fixed.

Troubleshooting unfathomable routes

The promise of SDN is a double-edged sword, in that it has essentially no barrier to change. In SDN routing, when adding a preferred next-hop route of any one router is the same as adding it to 100, and when GUI admins create multihomed connections in seconds, the change-cost barrier disappears.

Let's not forget that IT loves making changes when it's trivial. How often did you reconfigure physical servers before VMware? Now, how many virtual machine (VM) changes do you make before lunch? SDN brings the same ability to networks.

And like troubleshooting a VM OS issue four hours after the guest is vMotioned to another host, we're now tracing network routes that may change every few hours. It's not enough to look at the network as it is -- our network troubleshooting tools will need to let us virtually roll back routing changes in time and troubleshoot routes that might have only existed for a few minutes. It's a problem carrier networks have had for a decade; now we get to enjoy them, too.

A visual approach to SDN routing tools

Emerging network tools focus on discovering and monitoring paths. Paths aren't routes in the traditional sense because they're four dimensional. A path is a pair of traffic endpoints and all of the possible routes that packets can reasonably be expected to traverse, but captured and suspended in time. Because of path complexity, especially for internet routing, these network troubleshooting tools aren't the typical aggregating, drill down to detail dashboards we can drive in our sleep.

These new SDN routing tools are interactive, with browsing and contextual traversal front and center. By rolling connection visualizations backward and forward to compare snapshots in time, they reveal subtle SDN routing performance nuances as the network is reconfigured. They identify cause of packet loss from single misconfigured links in a list of hundreds, even if SDN only instantiated a virtual link for a few minutes. They know the difference between normal overall path latency and the normal behavior of the path's intermediate hops. That's important because it surfaces issues in complex networks with large light-time delays that stretch overall latency.

Two years ago, I wondered how vendors hoped to monitor real-world performance and topology of SDN as it programmatically modifies itself, perhaps hundreds of times a day. Moreover, I've worried we won't get the application point of view that goes beyond the vRoutes and vLinks we craft in our SDN controllers. Finally, we're seeing something new out of labs, maybe even a bit revolutionary (at least for networking).

Perhaps we've reached a practical operations limit where additional automation is impossible without monitoring tools for visualizing the complexity SDN creates. This year may turn out to be a great year for routing wonks -- whether software actuated or CLI configured, or in our data centers, in the cloud or both.

Next Steps

Network functions virtualization requires new network management model that includes virtual resources.

Network management plays catch-up to SDN, cloud and mobility.

SDN for LTE networks could be a game-changer in network management.

Dig Deeper on IT Systems Management and Monitoring