Network performance monitoring has become more complex now that companies have more workloads in the cloud, and network teams are finding visibility into the cloud isn’t on par with what they have into their on-prem resources.
Migration to the cloud introduced infrastructure that isn’t owned by the organization, and a pandemic-driven surge in remote work is accelerating the shift to the cloud and an associated increase in off-premises environments. Container-based applications deployed on cloud-native architectures further complicate network visibility. For these reasons and more, enterprises need tools that can monitor not only the data center and WAN but also the internet, SaaS applications and multiple providers’ public cloud operations.
“Only 36% of network operations professionals believe that their network management tools are as good at managing cloud networks as they are at managing on-prem networks,” says Shamus McGillicuddy, a vice president of research at Enterprise Management Associates (EMA). “At the same time, the average enterprise can attribute about 40% of its network traffic to the cloud at this point. So that’s a huge disadvantage.”
How did these visibility gaps arise? Network teams were often on the sidelines as enterprises began deploying workloads to the cloud.
“One of the problems is that the network infrastructure team doesn’t always have the same authority over the cloud environment as it does over the on-prem network,” McGillicuddy says. “A lot of times the cloud adoption was led by an application team or a line of business, and they looked at the cloud as an alternative to IT, not necessarily an extension of it,” he says.
“The teams that do have more authority in the cloud don’t always think it’s important to have network monitoring. They’re more interested in application performance monitoring,” McGillicuddy says. “They don’t see the point in devoting their budget to stuff that they consider to be like old-world infrastructure monitoring.”
How companies view the role of network engineers in the cloud makes all the difference, says Dan Rohan, product manager at network visibility and performance management vendor Kentik.
“When we started talking about monitoring the cloud two or three years ago, I don’t think very many network engineers cared,” Rohan says. As cloud deployments started to mature, and companies took a hard look at cloud costs, performance, and controls, they realized they needed to put some structure back into place, Rohan says, “and then suddenly, network engineers had a role to play again.”
What today’s network performance management tools can do
Typical cloud vendor networks are incredibly complex. “It’s not uncommon today to have 15 hops between you and the cloud provider across your ISP, maybe a local carrier, and then maybe a Tier 1 carrier. And then you’ll go through another 30 hops inside the cloud provider,” says Matt Stevens, president and CEO of AppNeta. “So the days of 10 to 20 hops total have now exploded to 40 or 50 Layer 3 network hops. Each one does its own thing to your performance.”
As network complexity goes up, so does the potential for problems, Stevens says. “When you have multiple employees running multiple applications, and those applications are hosted from multiple sources, whether it be your private data center, a virtual data center that your organization is trying to run as a cloud, a fully public cloud, or something in between—the very definition of hybrid IT—every time you add one more variable, the complexity goes up [exponentially].”
Network teams are turning to vendors for help. According to EMA, 57% of network teams have acquired specialized tools to close gaps in cloud-networking visibility. The research firm expects network-performance management tools to provide cloud monitoring through some combination of:
Collecting metrics from virtual network elements deployed in the cloud
Collecting flow logs and other telemetry offered by cloud providers
Collecting network traffic data in the cloud, such as packet flows
Analyzing synthetic traffic directed at SaaS services
Traditional network-management tools were designed to monitor the health of routers and switches in a data center or an on-prem network, but the cloud poses different challenges, Rohan says. “Network engineers don’t have a picture of [the cloud infrastructure] in their head because it’s growing fast, and it wasn’t built by them, and it’s changing all the time, because it’s the cloud. So they’re starting off with that kind of handicap,” he says.
They need different tools to solve the problems that crop up—an application team that can’t get its new cloud application to talk to an on-prem database, or integrate with another cloud app, for instance.
“Network teams would turn to these tools that were just pulling data out of AWS’s API, or any one of the cloud provider’s APIs. But that doesn’t tell me about connectivity failures. It doesn’t tell me why things aren’t working. And so we started there,” Rohan says. “We think the thing that really helps network people in the cloud today is helping to answer those connectivity questions across complex topologies.”
Kentik’s tool can provide network engineers with a picture of the current network, “the thing that they inherited,” Rohan says. “That helps them visualize the flows – the good and the bad. And they can say, ‘Okay, if we install a transit gateway here, and a peering connection here…’ and use their network skills, and they can actually use our tool to wrest control of their networks.”
Network metrics for cloud visibility
Telemetry data that can reveal the state of hybrid-cloud networks comes from all kinds of networks—data center, WAN, internet, cloud, mobile, edge—and from all types of network elements, including physical and virtual appliances, and dedicated or cloud-native devices.
Data is pulled from data-center components, cloud infrastructure (such as service meshes, transit and ingress gateways), internet infrastructure, campus devices, traditional WAN routers and switches, SD-WAN gateways, and IoT endpoints, to name a few. Telemetry types can include flow data exported from network devices (flow collection standards such as NetFlow, J-Flow, sFlow, the IETF’s IPFIX); cloud providers’ virtual private cloud flow logs; SNMP-based device telemetry; and event notifications sent via syslog or SNMP trap.
Along with passive monitoring data, such as network flows and packets, network teams are increasingly turning to active monitoring techniques, such as basic ping test and Layer 7 synthetic monitoring, to augment traditional infrastructure and traffic monitoring metrics, according to EMA. The research firm finds that 21% of network teams are using synthetic traffic tools for sustained network-availability and performance monitoring.
It’s not that enterprises haven’t monitored these networks and devices before; rather, the goal is to provide coordinated monitoring across a variety of networks, a unified view of the results, and the ability to integrate analytic findings with automated workflows. Tooling is going beyond the core infrastructure monitoring to provide more application-level views and insight into the application performance that end users are experiencing.
Who’s selling network-performance management tools?
The product landscape for network performance management is crowded. Vendors include Accedian, AppNeta, Cisco-ThousandEyes, cPacket Networks, Kentik, LogicMonitor, ManageEngine, Riverbed and SolarWinds. There’s no one vendor that covers all the bases, and many of the tools are complementary rather than competitive—a typical IT organization uses between four and 10 tools to monitor and troubleshoot its network, EMA finds.
Research firm Gartner, in its Market Guide for Network Performance Monitoring, says tools that are ideal for on-premises environments become less effective as organizations become increasingly hybrid. While some vendors can provide visibility across both on-premises and cloud environments, that is challenging due to data-transport requirements and differing networks, which can’t always be viewed through the same lens, Gartner says.
Among its recommendations for enterprises seeking network performance-management tools, Gartner recomments that companies “resist the desire to use the same monitoring approach in the cloud as your on-premises environment, especially when it comes to packet capture and analysis. Focus on vendors that provide support for cloud-native functions, such as APIs or true network-ﬂow data.”
Adding AI to network troubleshooting
There’s no lack of telemetry data to analyze. What distinguishes modern network monitoring tools is their ability to measure performance and put the findings in a context that answers the questions network teams are being asked.
“This move to hybrid cloud, it’s not really about ‘is it working or not working? Is it up or is it down?’ It’s the idea that ‘slow’ is the new ‘down,’” says AppNeta’s Stevens. Users aren’t calling to say they can’t connect to Salesforce, for example. They’re complaining that a script in Salesforce is running slow and impacting their ability to do their job, he says.
“Regardless of the architecture being deployed, we’re going to give the business the visibility to understand, ‘Here’s the performance I need. Here’s the performance I’m getting. Is that gap so big that I need to take action, or can I set it to the side and go work on another problem?’” he says.
That’s where artificial intelligence comes in play. Tools increasingly support AI-based diagnostics that are designed to find patterns in network data and draw conclusions from them based on historical anomaly detection and root-cause analysis.
“We don’t just tell you there’s a problem, we tell you where it is. We tell you why it is, we give a remediation suggestion, and we also give you a confidence score” that quantifies how likely it is that the proposed remediation will work, Stevens says.
Having tooling that can give network teams the confidence to understand the issues and prioritize remediation gives IT credibility at a time when companies are undertaking major business transformation projects, Stevens says. “These are big projects that touch a lot of people, and IT is being asked to be a business partner.”
Scott Bulger, a senior systems network engineer who spent more than 30 years of his career working with network vendors and corporate IT networks, has spent the last three years working with AppNeta’s technology at two large enterprises.
“Visibility into the cloud infrastructure is minimal, and so the ability to track end-to-end packet loss, jitter and latency, into the service provider cloud and back, gives you the autonomy and the validity to say to the cloud provider, ‘we have packet loss.’ You have hard, substantial evidence, and it’s irrefutable,” Bulger says.
For Bulger, the metric he’s most concerned about is packet loss. While TCP/IP-based networks were designed to accommodate loss, “there’s a point—above 4% or 5%, depending on your topology—at which loss starts to become noticeable and impactful to the end users. So some loss is acceptable, but a significant loss, or loss for extended periods of time, is impactful,” he says.
In the big picture, network visibility tools can not only help identify problems but also help avert performance problems altogether. “These platforms give you visibility to problems before they impact your customer,” Bulger says.
However, moving from a reactive to proactive posture isn’t easy. “If your DevOps or help-desk model is saturated supporting immediate problems, you don’t get much bandwidth for people saying, ‘here’s something that’s a little bit broken, but it could be a lot broken if we don’t do something about it,’” Bulger says.
“We need a culture that prioritizes proactive remediation,” he says. “The managers who get it are completely on board and never hesitate to fund it.”