GNS3 and VRRP Timers

June 30, 2015June 30, 2015 Brent routing, switching, training gns3, vrrp

While testing out a VRRP solution, I noticed that it was not performing as expected. The VRRP address was unresponsive so I started to investigate. Turning on console logging, I saw a large amount of flapping between Backup and Master states.

...
*Mar  1 02:37:23.739: VRRP: Grp 1 Event - Master down timer expired
*Mar  1 02:37:23.739: %VRRP-6-STATECHANGE: Vl20 Grp 1 state Backup -> Master
*Mar  1 02:37:25.095: %VRRP-6-STATECHANGE: Vl20 Grp 1 state Master -> Backup
...

It turns out that running 8 routers in GNS3 on my laptop was slightly under-powered platform and resulting in over a 2 second maximum response time from a VRRP peer.

Sending 8000, 100-byte ICMP Echos to 10.10.20.1, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!..!!!..........................
......................................................!!!!!!!!!!!!!!!!
!!....................................................................
.....................!!!......................!!!!!!!!!!!.!!!!!!!!!!!!
..!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!.!!!!!!!!!!.!!!!!!!!!!.!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.
Success rate is 74 percent (611/818), round-trip min/avg/max = 4/705/1996 ms
Server-A#

After adjusting the advertise timers, everything started to perform as expected.

R1#
interface Vlan20
 ip address 10.10.20.2 255.255.255.0
 vrrp 1 ip 10.10.20.1
 vrrp 1 timers advertise 10
 vrrp 1 priority 110
 
R2#
interface Vlan20
 ip address 10.10.20.3 255.255.255.0
 vrrp 1 ip 10.10.20.1
 vrrp 1 timers advertise 10

CCNP Achieved

February 22, 2015June 14, 2016 Brent routing, switching ccnp

I passed CCNP Route 642-813 in January before the exam changed thus completing all three exams. Route was the most challenging of the three exams for me because I am now taking the lead on projects that involve routing, which is part of why I wanted to peruse the certification. Exciting times and I’ve started to take a peek at the CCIE 5.0 exam.

Advanced Light Source User Meeting

October 6, 2014June 14, 2016 Brent internet2, switching

I was at the Advanced Light Source User Meeting as a representative of LBLnet today talking about the architecture of the Science DMZ to enable big data transfers across the WAN. We had an elegant poster that showed how the DMZ architecture fits into the enterprise design. There are still groups that are saving large data sets to hard drives and shipping them to the destination location rather than attempting to utilize the network and we want to help change that paradigm.

Credit for the majority of the design goes to my co-worker Michael at smitasin.com.

TP-LINK Powerline Adapter Performance

August 14, 2014 Brent switching

For the longest time I would never advocate Poweline Ethernet as a viable solution for getting connectivity into a troublesome area. I felt that the technology was prone to interference and therefore an unreliable solution that could never deliver a consistent connection.

After a few failed attempts to trace Cat5 cable into the garage in my San Francisco apartment in order to connect two pairs to get 100MBPS connectivity between the front and back of the apartment, I decided to try out a Powerline Ethernet solution. I picked up a pair of TP-LINK 200Mbps units and was surprised at the setup procedure. It was fairly easy and I had the units connected in under five minutes.

Of course I wanted to test performance so I started to graph latency to the wireless router on the other side of the apartment.

An iperf test also showed a solid 28.9 Mbits/second transfer rate.

C:\tools>iperf -c 10.10.1.27 -p 200 -t 120
------------------------------------------------------------
Client connecting to 10.10.1.27, TCP port 200
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[156] local 10.10.1.60 port 58742 connected with 10.10.1.27 port 200
[ ID] Interval       Transfer     Bandwidth
[156]  0.0-120.0 sec   413 MBytes  28.9 Mbits/sec

C:\tools>

Overall I’ve been impressed by the performance and now have Ethernet extended to the other half of the apartment. Not bad for a $30 connectivity solution.

Reserved IP Addresses in prefix-list Format

April 16, 2014June 14, 2016 Brent juniper, routing, switching ipv4, ipv6, localhost, martians, multicast, rfc1918

Use these with the load merge terminal command for easy cut-and-pasting in Junos.

policy-options {
    prefix-list localhost {
        127.0.0.1/32;
    }
    prefix-list martians-IPv4 {
        0.0.0.0/8;
        10.0.0.0/8;
        127.0.0.0/8;
        169.254.0.0/16;
        172.16.0.0/12;
        192.168.0.0/16;
    }
    prefix-list multicast {
        224.0.0.0/4;
    }
    prefix-list multicast-all-systems {
        224.0.0.1/32;
    }
    prefix-list rfc1918 {
        10.0.0.0/8;
        172.16.0.0/12;
        192.168.0.0/16;
    }
    prefix-list martians-IPv6 {
        ::/96;
        ::1/128;
        fe80::/10;
        fec0::/10;
        ff00::/8;
        ff02::/16;
    }
    prefix-list other-bad-src-addrs-IPv6 {
        ::/128;
        ::ffff:0.0.0.0/96;
        ::ffff:10.0.0.0/104;
        ::ffff:127.0.0.0/104;
        ::ffff:172.16.0.0/108;
        ::ffff:192.168.0.0/112;
        ::ffff:224.0.0.0/100;
        ::ffff:240.0.0.0/100;
        ::ffff:255.0.0.0/104;
        2001:db8::/32;
        2002:0000::/24;
        2002:0a00::/24;
        2002:7f00::/24;
        2002:ac10::/28;
        2002:c0a8::/32;
        2002:e000::/20;
        2002:ff00::/24;
        3ffe::/16;
        fc00::/7;
    }
}

ARP Cache Poisoning

April 11, 2014September 24, 2015 Brent cisco, switching arp, cache, mac, ping, poisoning

Overview

We received reports from end-users that a few client workstations on a specific subnet were experiencing around 70% packet loss when attempting to communicate between a few hosts. Since the initial report seemed to be rather isolated, we started with some basic ping tests, but as time went on and more hosts became affected, we started to escalate the issue and increasing our troubleshooting efforts.

Ping Tests

Take a look at the following ping samples from two different sources to different targets that were passing traffic on the switch in question:

-bash-4.1$ ping host1
PING host1 (1.2.3.4) 56(84) bytes of data.
64 bytes from host1 (1.2.3.4): icmp_seq=1 ttl=128 time=1.51 ms
64 bytes from host1 (1.2.3.4): icmp_seq=35 ttl=128 time=0.291 ms
64 bytes from host1 (1.2.3.4): icmp_seq=36 ttl=128 time=0.361 ms
64 bytes from host1 (1.2.3.4): icmp_seq=37 ttl=128 time=0.400 ms
64 bytes from host1 (1.2.3.4): icmp_seq=38 ttl=128 time=0.264 ms
64 bytes from host1 (1.2.3.4): icmp_seq=39 ttl=128 time=0.356 ms
64 bytes from host1 (1.2.3.4): icmp_seq=40 ttl=128 time=0.419 ms
64 bytes from host1 (1.2.3.4): icmp_seq=41 ttl=128 time=0.260 ms
64 bytes from host1 (1.2.3.4): icmp_seq=42 ttl=128 time=0.349 ms
64 bytes from host1 (1.2.3.4): icmp_seq=43 ttl=128 time=0.416 ms
64 bytes from host1 (1.2.3.4): icmp_seq=44 ttl=128 time=0.429 ms
64 bytes from host1 (1.2.3.4): icmp_seq=45 ttl=128 time=0.314 ms
64 bytes from host1 (1.2.3.4): icmp_seq=46 ttl=128 time=0.359 ms
64 bytes from host1 (1.2.3.4): icmp_seq=47 ttl=128 time=0.447 ms
64 bytes from host1 (1.2.3.4): icmp_seq=48 ttl=128 time=0.287 ms
64 bytes from host1 (1.2.3.4): icmp_seq=49 ttl=128 time=0.405 ms
64 bytes from host1 (1.2.3.4): icmp_seq=50 ttl=128 time=0.416 ms
^C

-bash-4.1$ ping host2
PING host2 (2.3.4.5) 56(84) bytes of data.
64 bytes from host2 (2.3.4.5): icmp_seq=1 ttl=128 time=1.21 ms
64 bytes from host2 (2.3.4.5): icmp_seq=30 ttl=128 time=0.484 ms
64 bytes from host2 (2.3.4.5): icmp_seq=59 ttl=128 time=0.467 ms
64 bytes from host2 (2.3.4.5): icmp_seq=83 ttl=128 time=0.197 ms
64 bytes from host2 (2.3.4.5): icmp_seq=84 ttl=128 time=0.241 ms
64 bytes from host2 (2.3.4.5): icmp_seq=85 ttl=128 time=0.210 ms
64 bytes from host2 (2.3.4.5): icmp_seq=86 ttl=128 time=0.240 ms
64 bytes from host2 (2.3.4.5): icmp_seq=87 ttl=128 time=0.171 ms
64 bytes from host2 (2.3.4.5): icmp_seq=88 ttl=128 time=0.216 ms
64 bytes from host2 (2.3.4.5): icmp_seq=89 ttl=128 time=0.194 ms
64 bytes from host2 (2.3.4.5): icmp_seq=90 ttl=128 time=0.392 ms
64 bytes from host2 (2.3.4.5): icmp_seq=91 ttl=128 time=0.240 ms
64 bytes from host2 (2.3.4.5): icmp_seq=92 ttl=128 time=0.235 ms
64 bytes from host2 (2.3.4.5): icmp_seq=93 ttl=128 time=0.222 ms
^C

The first response in both samples take a little longer than what we expected for two hosts connected to a local Gigabit switch, but falls within the profiled delay for ARP resolution. After the longer initial ping response time, all subsequent pings fall within the expected value for the local network.

The TTL values all look normal so we know the traffic isn’t leaving the local network and hitting additional hops, but take a look at the sequence numbers; they are not sequential and show a sign of a larger problem.

In the first sample, sequence numbers 2-34 were lost as were 2-29 in the second sample. We know that around 30 seconds of traffic was being completely lost in both test cases.

Find the Layer

At this point we could rule out a Layer 1 issues so we knew there was no issue with our optics being dirty or the input/output queues on any of the switching equipment.

There had to be a Layer 2 issue, which could include bridging, MAC address learning occurring on different pieces of equipment at different times, competition for an IP, or ARP poisoning.

As more time passed, more workstations started to notice more loss of traffic on the network. All signs pointed to something wrong with the ARP table. A show mac address-table command on one of the switches showed that a number of hosts were associated with a MAC address of 00:00:00:00:00:00 and that number was increasing over time.

Wireshark

After getting a SPAN port on the switch and looking at traffic with Wireshark, we found a large number of responses to ARP requests with a value of 00:00:00:00:00:00 for hosts in the local subnet that were all sourced from one machine. This one computer was poisoning the ARP cache on the network with all zeros, causing the location for every host to eventually become a blackhole for traffic.

As more ARP caches on end-user machines and switching equipment were timing out, they were sending out new ARP requests and getting poisoned information. Once the offending machine’s port was shut down, we started to notice traffic return to normal.

Intel NIC Broadcast Storm

April 8, 2014June 14, 2016 Brent switching dell, intel, latitude, optiplex

As part of a standardization project, we have been enabling new port-security options on our Access switches that provide connectivity for end-users. When we made this change for a switch that serves around 240 users, we started to receive alerts for port security violations from three hosts at very inconsistent hours. Below is a small sample of one of the broadcast storms.

2014-04-08_syslog

Given the large amount of MAC addresses that were broadcast in a short amount of time, the switchport port-security maximum 50 was being triggered after the switch saw the 51st MAC address.

interface GigabitEthernet1/1
 description Access Port
 switchport access vlan 200
 switchport mode access
 switchport port-security maximum 50
 switchport port-security
 switchport port-security aging time 1
 switchport port-security violation restrict
 no logging event link-status
 storm-control broadcast level 3.40
 storm-control action trap
 spanning-tree portfast
 ip dhcp snooping limit rate 50

I consolidated all the MAC addresses seen into a table and was not able to find any duplicates. A search on a OIU database also showed that they were unregistered so they appeared to be randomly generated.

Looking at the MAC address-table for each port after the storm incident, I discovered that each port contained only a single Dell computer with a Intel 82579M Gigabit NIC. Some research lead me to a case of OptiPlex 790, 7010, 9010 and Latitude E6520/E6530 systems generating a network broadcast storm after coming out of sleep mode (2) and requiring a driver update on the Intel NIC in order to fix the issue.

References

Catalyst Spring Cleaning

April 4, 2014June 14, 2016 Brent cisco, switching 4500

Don’t forget to remove dust from the inlet ports on your Catalyst 4500 chassis on a routine basis if they exist in locations where they are exposed to large amounts of particulates. You don’t want to be woken up at 6:19AM by a downed device.

Mar 18 06:19:40 1.2.3.4 : %C4K_IOSMODPORTMAN-2-MODULESHUTDOWNTEMP: Module 1 Sensor air outlet temperature is at or over shutdown threshold - current temp: 86C, shutdown threshold: 86C
Mar 18 08:25:23 1.2.3.4 : %C4K_IOSMODPORTMAN-2-MODULESHUTDOWNTEMP: Module 1 Sensor air outlet temperature is at or over shutdown threshold - current temp: 86C, shutdown threshold: 86C.
Mar 18 09:34:49 1.2.3.4 : %C4K_IOSMODPORTMAN-2-MODULESHUTDOWNTEMP: Module 1 Sensor air outlet temperature is at or over shutdown threshold - current temp: 86C, shutdown threshold: 86C

Sadly the Syslog events never made it to our central repository. Our solution for the first automated shutdown was to simply power the device back on to enable connectivity for the building. After two more instances of shutting itself off, we reviewed the logs on the local device and discovered that they never reached out syslog server.

A simple wipe of accumulated dust saw our temperature take a drastic drop.

Dropbox LAN sync Protocol Noise

April 2, 2014June 14, 2016 Brent switching dropbox, udp, wireshark

An end-user recently submitted a ticket that the connection sharing switch in his office was blinking more than usual and wanted confirmation that nothing malicious was occurring on the network. I took a small sample of the traffic that was on the end-user’s switch with Wireshark and found a surprising amount of Dropbox noise being broadcast on the network.

In our /22 subnet, I found 88 hosts that were advertising UDP/17500 Dropbox LanSync Protocol (db-lsp) traffic. The Dropbox advertisements were accounting for 67% of our total UDP traffic and were occurring on average every 0.14 seconds using 13.3 kbps of bandwidth. Hardly an impact on a Gigabit connection, but I can see from an end-user perspective that when they plug a quiet switch into the network there is a change in LED behavior.

802.3az Energy-Efficient Ethernet on Juniper EX3300 Switches

March 27, 2014April 14, 2014 Brent juniper, switching 802.3az, ex3300

Unlike Cisco 2960-X switches, 802.3az does not come enabled by default on Juniper EX3300 models. Use ether-options ieee-802-3az under the interface-range tree to enable this energy saving protocol.

interfaces {
    interface-range WIRED_PORTS {
        member-range ge-0/0/0 to ge-0/0/47;
        ether-options {
            ieee-802-3az-eee;
        }
    }
}

When committing this option, we noticed around 8 seconds of connectivity loss for 96 wired connected hosts. Be careful when enabling this in a production setting.

Data Engineering

stories from a network engineer in San Francisco

switching