It has been interesting to tell people why network analysis is important. We go through some examples, but they often get hung up on thinking about the problems we describe (and that NetMRI detects). For a business person, the problems often don’t mean much – what’s the business impact?
For the network engineer, the problems are interesting, but need to be related to the business in order to communicate the importance to the business people.
While each problem is numbered, the numbers themselves don’t indicate relative ranking. They are simply a means by which we can reference them.
1. Configuration not saved:
Reboot will cause the new configuration to be lost. Due to a power outage on a network device, the operation of the network changes because the new configuration is replaced by the old one upon reboot.
2. Saved configurations don’t meet corporate policy:
Source of many problems, from performance to reliability to security. Corporate policy may be due to regulatory policies (PCI, HIPAA, SOX), or may be based on accepted best practices. Checking that they are consistently applied across hundreds of routers and switches is nearly impossible to do with manual processes.
3. Bloated firewall rule set, unused ACL entries:
Poor firewall performance; Open, unused rules, creating potential security problems. Identifying unused firewall rules makes understanding and maintaining firewall rule sets much easier, identifying unused rules that can be safely removed, resulting in improved network security.
4. Firewall connection count exceeded:
New connections via the firewall fail; Business applications exhibit intermittent failure at high firewall loads; VPNs begin to fail. When the connection count of a busy firewall is exceeded, new connections are refused. The applications experience intermittent network connectivity as the connection count is exceeded and then drops, making it difficult to troubleshoot so you end up needing Managed It Services rather than using in house people.
5. Link hog – downloading music or videos:
Slower application response, impacting user productivity. When one application or user is consuming most of the bandwidth on a link, it impacts the other applications and users of that link. NetMRI uses Getflowˇ to immediately collect netflow data on a link that’s suddenly running at high utilization to identify applications and users of the link, allowing the network engineer to quickly understand the cause of the slowdown to other applications and take action if necessary.
6. Interface traffic congestion:
Unpredictable application performance, impacting user productivity. When a router interface is congested, it starts discarding packets, so monitoring packet discards is an early indicator that the applications using the link need more bandwidth, or that a rogue application is now consuming bandwidth that’s needed by business applications.
7. Link problems & stability:
Physical or DataLink errors cause slow or intermittent application performance; Link or interface stability can impact routing and spanning tree (see 13, 14, 15, 16, 20). Whenever a link has high errors or is unstable, applications will have problems making effective use of the link. When routing or spanning-tree protocols are impacted, the effects may spread to other parts of the network, depending on the network’s design.
8. Environmental limits exceeded:
Fan failure, power supply problems, and high temperatures are indicators of problems that will likely cause a network device to reboot, affecting any applications relying on the device. Identifying and correcting environmental problems will make the network, and the applications that depend on it, more reliable.
9. Memory utilization increasing:
A bug in the device’s operating system is consuming more memory and when no free memory exists, the device will reboot, disrupting applications that are transiting the device. Imagine troubleshooting a network problem that occurs every two weeks as the device runs out of memory and reboots. We’ve seen this happen in production networks. The business impact depends on how often it occurs and what applications are affected.
10. Incorrect serial bandwidth setting:
Causes routing protocols to make non-optimum routing decisions. If the bandwidth is too low, it can affect the operation of the routing protocol itself, making routes unstable. Remote branches will experience unreliable application operation, which will be difficult to troubleshoot because you’ll have to catch it when it is happening. As applications begin using more link bandwidth, the routing protocol can become unstable.
If you need to alter network traffic paths, use policy based routing mechanisms instead of changing link bandwidth parameters. Also make sure tunnels have accurate bandwidth settings.
11. No QoS:
Important business applications are not prioritized, yielding unpredictable or poor performance during times of interface congestion. Applications like VoIP or SAP are susceptible to high jitter and packet loss when QoS is not used. Configurations that match corporate policy for QoS deployment are important (see 2).
12. QoS Queue Drops:
Important business applications are slow; Business needs have changed since the queue definitions were created. A network design for four concurrent VoIP calls will not perform well when more people are hired and the number of concurrent calls increases. Similar conditions exist for other applications. Queue drops are an early indicator of potential problems that require a network change.
13. Route flaps:
Poor application performance as packets take the wrong or inefficient paths in the network. It may be caused by unstable links or improperly configured routing protocol timers (see 2, 7). Packets may also arrive out of order, which some applications cannot tolerate. Varying paths will also cause high jitter, which affects time sensitive applications like VoIP and SAP. Studies have shown that people can deal with relatively high delay as long as the variance in delay is constant. But high variance in application response will drive people crazy.
14 OSPF recalculations high
Routing protocol unstable; poor and inconsistent application performance. Link stability, link errors, or spanning tree stability can cause an OSPF topology to be unstable (see 7, 20). The routing protocol may intermittently select non-optimum paths (see 13). Applications experience high jitter or loss of connectivity if routes are flapping as a result.
15 Poor VoIP quality
Due to high jitter, delay, or packet loss; Choppy voice calls; Calls mysteriously disconnect. The root cause of poor VoIP quality can be many other problems. By monitoring delay, jitter, and packet loss, you can reduce the set of possible problems to examine. By identifying the range of phones that are reporting poor statistics, you can better identify the potential source of the problem.
16 Routing Neighbor changes high
Access via this router is negatively affected by a high number of neighbor changes (BGP, OSPF, EIGRP). Similar to problems 13 and 14, something is causing the neighbor relationships to change regularly, which affects the stability and reliability of the routing protocol. As a result, applications can experience high jitter or packets arriving out of order. Finding and fixing the cause of the neighbor changes will result in a more stable and efficient network.
17 OSPF area not connected to backbone
The disconnected OSPF area will not be reachable from other OSPF areas, impacting applications that need to communicate between areas. OSPF intra-area routing relies on connectivity through the backbone area (area 0). When an area is disconnected from the backbone, communications within the area works, but communications between systems in that area and systems in other areas will not work (the intra-area routes don’t exist). Users and systems within the area will report what seems to be intermittent connectivity, which is based on whether the destination is located within the area or in another area.
18 Unidirectional traffic flow
Number 18Typically the result of misconfigured routing, application traffic will be using non-optimum paths, increasing delay and potentially overloading other links and affecting other applications. Sometimes asymmetric routing is desired; however, it increases network complexity and complicates troubleshooting. Servers are often configured with incoming and outgoing interfaces, which may cause unicast flooding, a condition in which frames are sent to all ports in a VLAN. High traffic levels result, impacting the operation of all devices in the VLAN. In routed networks, a measure of zero packets in one direction on a link for long time periods indicates a potential routing misconfiguration.
19 Router interface down
Any router interface marked administratively up but is operationally down is likely to be a redundant connection that will cause an outage if the other connection also fails, affecting all applications that use it. Redundant networks hide first failures, so it is important to identify those failures before a second failure causes an outage. Best practices are to administratively shutdown router interfaces that are not supposed to be active, therefore making any interface in up/down state an indication of something that’s failed.
20 Unstable root bridge
Bridge priority not set; applications quit working over unstable VLANs. An inexpensive switch that has the same bridge priority but lower MAC address as the desired root bridge in a spanning tree will try to become the root bridge. But in a busy VLAN, it may not have the backplane bandwidth or CPU to handle the task and not send BPDUs as frequently as it should (2 seconds by default). When several BPDUs are missed, the other switches elect another switch as the root. The STP re-convergence will affect application connectivity. The change is difficult to troubleshoot because it is working by the time a network engineer looks at it. Application connectivity seems to be intermittent.
21 Duplex mismatch
Increasing link errors; Applications get slower as traffic volume increases. CRC errors, late collisions, and FCS errors are indicators of duplex mismatch. A server is installed and ping works, so it is declared functional, but as the traffic to it builds, errors increase. Finger pointing between the network, server, and application teams often results until the duplex mismatch is discovered. Vendor recommendations (Microsoft: fixed full duplex; Cisco: auto-negotiate) exacerbate the problem.
22 Downstream hub or switch
Unauthorized devices added to the network; Compromise to network integrity and security; See 20. Wireless routers, switches, hubs, and other network devices should be under a common administration in order to provide the best network security. Another switch could have a lower priority, making it the root bridge of a VLAN and causing stability problems (see 20). Rogue DHCP servers in wireless routers can cause intermittent connectivity problems within a subnet, unless specific configurations protect against it.
23 Port in ErrDisable state
The set of end stations connected via this port are disconnected from the network until the port is enabled (either automatically or by user control). A variety of configuration options allow switch ports to be disabled when certain conditions occur, such as receiving BPDUs or DHCP responses (see 20, 22). Some vendors will disable a port if it experiences too many errors. Automatically identifying these ports can avoid a trouble call from a user or server administrator who is having connectivity problems as a result of a port being disabled.
24 Unbalanced & unused ether-channels
Number 24Increased latency & jitter affecting sensitive applications like VoIP; Compromised redundancy. Packet distribution across an ether-channel may be unbalanced if a non-optimum packet distribution algorithm is selected. By changing the algorithm, the ether-channel packet distribution is more balanced and overall throughput increases. An unbalanced ether-channel will be more easily congested, resulting in application performance that’s less than expected.
25 HSRP or VRRP peer not found
Redundancy configured and not operating correctly; Outage when a second failure occurs. A connectivity or application outage may have not yet occurred, because one device in the redundant pair is still running. But the backup device is not known. The cause may be a broken link between devices, the redundant device has not yet been installed, or the redundant device, or its interface, has failed.
When the second failure in the redundant configuration occurs, a network outage occurs, impacting applications. Knowing that a redundant configuration is not operational allows it to be corrected before important applications are affected.
Identifying and correcting these problems will allow your network to better service your business’ network requirements.