Tuesday, December 17, 2013

Inter-VLAN traffic over OTV

This is one of the most interesting things I ever tried when understanding OTV. People most often then not after hearing about OTV and what it does ended up asking does this do inter-VLAN as well? At first the answer was like, eh...well...don't know, then it was let me get back to you on this, and now, its finally, yes it does [with a few ARP considerations which I will talk at the end of this post based on what I understood].

The topology will play a major role here, since, we will be introduce a new internal Layer3-link apart the the already available Layer2-link between the end hosts and the ED.



If you have seen my previous posts, you will be familiar with my topology. The change here is that I have 2 additional switches [SW1 and SW2], and additional L3 connections from the ED's to these switches.

I could have just shown inter-VLAN traffic in this post, but to be more realistic I have made use of HSRP configuration [as most of the data center's make use of HSRP for host mobility]. I will be using the HSRP configurations on the internal Layer-3 interface's of the ED's,
People who are not familiar with HSRP, request you to read some write up's on the same. This will help understand the HSRP configurations in this post.

To demonstrate the inter-VLAN functionality, I have used the OTV unicast-core mode. [For understanding OTV unicast-core configurations please visit this link: http://stayinginit.blogspot.in/2013/12/basic-unicast-overlay-transport.html]

The objective with the above topology is to be able to ping from VM1 to VM2 and vice versa. Keep in mind that VM1 is in VLAN15 with IP 172.16.15.10/24 and VM2 is in VLAN14 with IP 172.16.14.10/14.

Some details related to the setup before we proceed:
  1. The vswitch VLAN settings on the ESXi hosts is made 'None'
  2. the switch port to which the VM connects is an access port
    1. VM1 is connected to access port VLAN15 on SW1 
    2. VM2 is connected to access port VLAN14 on SW2
  3. The connections from the ED's to the SW's are all trunk connections, allowing all VLAN's [default]
Keeping the above in mind let us move ahead with the configurations.

VM1 - 172.16.15.10/24 ; GW: 172.16.15.100

VM2 - 172.16.14.10/24 ; GW: 172.16.14.100

SW1:

spanning-tree mode rapid-pvst
spanning-tree extend system-id
!
interface GigabitEthernet6/1
 description "Connected to VM1"
 switchport
 no shutdown ! ADDED BY ME
 switchport access vlan 15
 switchport mode access
!
interface GigabitEthernet6/2
 description "Layer-2 port to which ED1 is connected"
 switchport
 no shutdown ! ADDED BY ME
 switchport trunk encapsulation dot1q
 switchport mode trunk
!
interface GigabitEthernet6/3
 description "Layer-3 port to which ED1 is connected"
 switchport
 no shutdown ! ADDED BY ME
 switchport trunk encapsulation dot1q
 switchport mode trunk

ED1:

otv site bridge-domain 150
!
spanning-tree mode rapid-pvst
spanning-tree extend system-id
!
otv site-identifier 0000.0000.0002
!
interface Overlay150
 no ip address
 no shutdown ! ADDED BY ME
 otv join-interface GigabitEthernet0/0/1.15
 otv use-adjacency-server 10.1.14.2 unicast-only
 otv adjacency-server unicast-only
 service instance 14 ethernet
  encapsulation dot1q 14
  bridge-domain 14
 !
 service instance 15 ethernet
  encapsulation dot1q 15
  bridge-domain 15
 !
!
interface GigabitEthernet0/0/0
 description "Internal / Access / Layer-2 interface of ED1"
 no ip address
 no shutdown ! ADDED BY ME
 negotiation auto
 service instance 14 ethernet
  encapsulation dot1q 14
  bridge-domain 14
 !
 service instance 15 ethernet
  encapsulation dot1q 15
  bridge-domain 15
 !
 service instance 150 ethernet
  encapsulation dot1q 150
  bridge-domain 150
 !
!
interface GigabitEthernet0/0/1
 no ip address
 no shutdown ! ADDED BY ME
 negotiation auto
!
interface GigabitEthernet0/0/1.15
 description "Join-Interface of ED1, connected to CORE"
 encapsulation dot1Q 15
 ip address 10.1.15.2 255.255.255.0
 ip ospf 15 area 15
!
interface GigabitEthernet0/0/2
 no ip address
 no shutdown ! ADDED BY ME
 standby use-bia
 negotiation auto
!
interface GigabitEthernet0/0/2.14
 description "Layer-3 Interface, GW for 172.16.14.0/24"
 encapsulation dot1Q 14
 ip address 172.16.14.1 255.255.255.0
 standby 14 ip 172.16.14.100
 standby 14 preempt
 arp timeout 1200 ! Configured based on ARP Consideration
!
interface GigabitEthernet0/0/2.15
 description "Layer-3 Interface, GW for 172.16.15.0/24"
 encapsulation dot1Q 15
 ip address 172.16.15.1 255.255.255.0
 standby 15 ip 172.16.15.100
 standby 15 preempt
 arp timeout 1200 ! Configured based on ARP Consideration
!
router ospf 15
 router-id 15.15.15.1
!

CORE:

interface GigabitEthernet0/0/0
 no ip address
 no shutdown ! ADDED BY ME
 negotiation auto
!         
interface GigabitEthernet0/0/0.14
 description "Connected to Join-interface of ED2"
 encapsulation dot1Q 14
 ip address 10.1.14.1 255.255.255.0
 ip ospf 15 area 15
!
interface GigabitEthernet0/0/1
 no ip address
 no shutdown ! ADDED BY ME
 negotiation auto
!
interface GigabitEthernet0/0/1.15
 description "Connected to Join-interface of ED1"
 encapsulation dot1Q 15
 ip address 10.1.15.1 255.255.255.0
 ip ospf 15 area 15
!
router ospf 15
 router-id 15.15.15.2
!

ED2:

otv site bridge-domain 151
!
spanning-tree mode rapid-pvst
spanning-tree extend system-id
!
otv site-identifier 0000.0000.0003
!
interface Overlay150
 no ip address
 no shutdown ! ADDED BY ME
 otv join-interface GigabitEthernet0/0/0.14
 otv use-adjacency-server 10.1.15.2 unicast-only
 otv adjacency-server unicast-only
 service instance 14 ethernet
  encapsulation dot1q 14
  bridge-domain 14
 !
 service instance 15 ethernet
  encapsulation dot1q 15
  bridge-domain 15
 !
!
interface GigabitEthernet0/0/0
 no ip address
 no shutdown ! ADDED BY ME
 negotiation auto
!
interface GigabitEthernet0/0/0.14
 description "Join-Interface of ED2, connected to CORE" 
 encapsulation dot1Q 14
 ip address 10.1.14.2 255.255.255.0
 ip ospf 15 area 15
!
interface GigabitEthernet0/0/1
 description "Internal / Access / Layer-2 interface of ED2"
 no ip address
 no shutdown ! ADDED BY ME
 negotiation auto
 service instance 14 ethernet
  encapsulation dot1q 14
  bridge-domain 14
 !
 service instance 15 ethernet
  encapsulation dot1q 15
  bridge-domain 15
 !
 service instance 151 ethernet
  encapsulation dot1q 151
  bridge-domain 151
 !
!
interface GigabitEthernet0/0/2
 no ip address
 no shutdown ! ADDED BY ME
 standby use-bia
 negotiation auto
!
interface GigabitEthernet0/0/2.14
 description "Layer-3 Interface, GW for 172.16.14.0/24"
 encapsulation dot1Q 14
 ip address 172.16.14.2 255.255.255.0
 standby 14 ip 172.16.14.100
 standby 14 preempt
 arp timeout 1200
!
interface GigabitEthernet0/0/2.15
 description "Layer-3 Interface, GW for 172.16.14.0/24"
 encapsulation dot1Q 15
 ip address 172.16.15.2 255.255.255.0
 standby 15 ip 172.16.15.100
 standby 15 preempt
 arp timeout 1200
!
router ospf 15
 router-id 15.15.15.3
!

SW2:

spanning-tree mode rapid-pvst
spanning-tree extend system-id
!
interface GigabitEthernet5/1
 description "Connected to VM2"
 switchport
 no shutdown ! ADDED BY ME
 switchport access vlan 14
 switchport mode access
!
interface GigabitEthernet5/2
 description "Layer-2 port to which ED2 is connected"
 switchport
 no shutdown ! ADDED BY ME
 switchport trunk encapsulation dot1q
 switchport mode trunk
!
interface GigabitEthernet5/3
 description "Layer-3 port to which ED2 is connected"
 switchport
 no shutdown ! ADDED BY ME
 switchport trunk encapsulation dot1q
 switchport mode trunk
!

Verification:


The objective as mentioned is a successful ping, so lets check that:

Case-1:
VM1 --> VM2

[root@vm-aries-cel ~]# ping 172.16.14.10 -c 10
PING 172.16.14.10 (172.16.14.10) 56(84) bytes of data.
64 bytes from 172.16.14.10: icmp_seq=0 ttl=63 time=0.639 ms
64 bytes from 172.16.14.10: icmp_seq=1 ttl=63 time=0.675 ms
64 bytes from 172.16.14.10: icmp_seq=2 ttl=63 time=0.722 ms
64 bytes from 172.16.14.10: icmp_seq=3 ttl=63 time=0.682 ms
64 bytes from 172.16.14.10: icmp_seq=4 ttl=63 time=0.707 ms
64 bytes from 172.16.14.10: icmp_seq=5 ttl=63 time=0.535 ms
64 bytes from 172.16.14.10: icmp_seq=6 ttl=63 time=0.670 ms
64 bytes from 172.16.14.10: icmp_seq=7 ttl=63 time=0.698 ms
64 bytes from 172.16.14.10: icmp_seq=8 ttl=63 time=0.681 ms
64 bytes from 172.16.14.10: icmp_seq=9 ttl=63 time=0.673 ms

--- 172.16.14.10 ping statistics ---

10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.535/0.668/0.722/0.051 ms, pipe 2

[root@vm-aries-cel ~]#

VM2 --> VM1 -

[root@localhost devices]# ping 172.16.15.10
PING 172.16.15.10 (172.16.15.10) 56(84) bytes of data.
64 bytes from 172.16.15.10: icmp_seq=1 ttl=63 time=0.597 ms
64 bytes from 172.16.15.10: icmp_seq=2 ttl=63 time=0.633 ms
64 bytes from 172.16.15.10: icmp_seq=3 ttl=63 time=0.702 ms
64 bytes from 172.16.15.10: icmp_seq=4 ttl=63 time=0.677 ms
64 bytes from 172.16.15.10: icmp_seq=5 ttl=63 time=0.624 ms
64 bytes from 172.16.15.10: icmp_seq=6 ttl=63 time=0.642 ms
64 bytes from 172.16.15.10: icmp_seq=7 ttl=63 time=0.643 ms
64 bytes from 172.16.15.10: icmp_seq=8 ttl=63 time=0.771 ms
64 bytes from 172.16.15.10: icmp_seq=9 ttl=63 time=0.665 ms
64 bytes from 172.16.15.10: icmp_seq=10 ttl=63 time=0.616 ms

--- 172.16.15.10 ping statistics ---

10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.597/0.657/0.771/0.047 ms

[root@localhost devices]# 


ARP Considerations:



Firstly, idle MAC addresses are held in the ED's for a default of 30 minutes [maximum 60 minutes as of today]
The problem that can be faced when doing inter-VLAN routing over OTV is that sometimes ping or any other communication can fail on / after 30 minutes
This is majorly because the VM ARP entries being cached by their GW's

Let me briefly try to explain using the ping from VM1 to VM2:
  1. VM1 in VLAN15 wants to ping VM2 in VLAN14 [172.16.15.10 --> 172.16.14.10]
  2. The first packet from VM1 is an ARP for identifying the MAC address of its gateway [different subnet traffic]
  3. ARP packet reaches SW1, and SW1 promptly floods the ARP out to both its trunk links [6/2 and 6/3]
  4. Since the packet came over 6/2, the access-interface of ED1 learn's about VM1's MAC address and [generic switch behavior of learning the source MAC address] and sends the update to ED2 [over the OTV-ISIS]
  5. Also, since the ARP packet went over 6/3, the internal Layer-3 interface of ED1 obtains the MAC address of VM1, and also, responds to the ARP request with its MAC address
    1. Here is where the ARP caching occur [related to MAC of VM1]
  6. Equipped with the MAC of the gateway, VM1 now sends the information to its gateway
  7. Once the packet reaches the internal Layer-3 interface of ED1 "GigabitEthernet 0/0/2.15" 
    1. The router realizing that the destination subnet is local to it [172.16.14.0/24 as configured on GigabitEthernet 0/0/2.14] it sends out an ARP to figure out the MAC address of the destination
    2. Now, this again goes over SW1 port 6/3 and reaches port 6/2
  8. Here, the packet is forwarded over OTV and it reaches the other ED [ED2] where VM2 responds to the ARP
  9. From the ARP response, the MAC of VM2 is obtained and cached by ED2 and updated to ED1 over OTV
  10. The same response is sent back to ED1 and back over port 6/3 of SW1
    1. This too is cached by the router
  11. Now, the subsequent ping's from VM1 to VM2 flow is a much more streamlined manner, as in:
    1. Ping data directly goes over the port 6/3 of SW1
    2. MAC addresses are modified, and sent back over port 6/3 of SW1 to port 6/2 over OTV to the other end, and thus ping is successful
  12. However, we should notice here that neither MAC addresses of VM1 / VM2 are getting refreshed from the OTV point of view and eventually timeout
  13. A ping post timeout would become 'an unknown unicast' for the ED and thus gets dropped
    1. This results in the failures
    2. As far as my understanding goes, this much of design problem [how switches and ARP caches work] and not an issue from OTV
  14. So to overcome this design problem, we could entire put MAC idle timeout to infinity on OTV [which is currently not available] or we could change the ARP timeout value to force ARP's to timeout after a certain time
  15. So, we go for the latter which is changing the ARP timeout value
  16. Knowing that the default idle MAC timeout is 30 minutes, we keep the ARP timeout to something less, say in this case: 1200 seconds
  17. When we have this configured on the router's internal layer-3 interfaces, it forces ARP flooding across the ED's and thus re-learning of the hosts MAC addresses
Kindly note the above is just the way I have understood how the ARP works after discussing with my peers. I am still in the process of capturing all these over wireshark to re-confirm my understanding. However, if anyone knows that I am wrong, do post a comment so that I can re-evaluate my understanding. 

This concludes this post. This post especially has been very exciting for me and I hope you find this post informative.