Tuesday, June 14, 2005

How an IP Tunnel Interface Dynamically Adjusts its Link MTU

With the launch of OpenSolaris comes the opportunity to discuss the implementation details behind existing Solaris features. I'd like to share some of the details behind one of my contributions to Solaris 10; the implementation of dynamic MTU calculation for IP tunnel interfaces.

Solaris 8 was the first version of Solaris that implemented the IP in IP tunneling mechanism described in RFC1853. It did not, however, implement the "Tunnel MTU Discovery" section of this RFC. Tunneling over IPv6 (RFC2473) was implemented very early in Solaris 10 (and backported to Solaris 9 in Update 1) along with a Tunnel MTU Discovery mechanism that worked for IPv6 tunnel interfaces only.  Some mechanism was needed that worked for both IPv4 and IPv6 tunnels, and that was visible to the administrator. One drawback to the IPv6 tunnel implementation of Tunnel MTU Discovery for IPv6 tunnels was that there was no observability into the Tunnel MTU (ifconfig's output always showed some static MTU value that was unrelated to the actual tunnel interface's MTU).

This work became more important when customers (internal and external to Sun) started using Solaris' IPsec tunneling to implement VPN solutions.  Without proper Tunnel MTU Discovery, things like TCP MSS calculations can take longer to converge to usable values and protocols that don't have any insight into Path MTU (UDP for example) yield unecessary amounts of IP fragmentation. For more on the benefits of Tunnel MTU Discovery, see the two aformentioned RFC's on IP tunneling.

Without going into too much detail about the inner workings of the ip and tun modules or every line of code that was changed to implement this feature, I'd like to focus on two aspects of the implementation. The first is the mechanism used by the tun module to obtain path MTU information about the tunnel destination from ip, and the second is the mechanism by which the ip interface's MTU is dynamically changed when the tun module detects a change in the tunnel's link MTU.

IRE_DB_REQ_TYPE

In order for the tun module to be able to calculate a useful tunnel MTU, it needs to know the Path MTU of the tunnel destination. The tunnel destination is the IP node we'll send encapsulated packets to when sending them through the tunnel interface. In ifconfig output, it is the "tunnel dst":

# ifconfig ip.tun0
ip.tun0: flags=10008d1<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST,IPv4> mtu 1480 index 4
        inet tunnel src 11.0.0.1 tunnel dst 11.0.0.2
        tunnel hop limit 60
        inet 10.0.0.1 --> 10.0.0.2 netmask ff000000

In the above example, IP packets forwarded into ip.tun0 are encapsulated into an outer IP header with a source of 11.0.0.1 and a destination of 11.0.0.2. 11.0.0.2 is the "tunnel destination".

The Path MTU to the destination is the size of the largest IP packet that can be sent to the destination without being fragmented nor resulting in an ICMP fragmentation needed message. The tunnel MTU of a given tunnel is the Path MTU of the tunnel destination plus any tunneling overhead (encapsulating IP header and perhaps IPsec headers if IPsec tunneling is being used).

The ip module keeps this Path MTU information in a per-destination cache (aka IRE cache) table. The protocol used to keep track of this Path MTU information is described in RFC1191.  The ip module provides a number of methods of accessing this per-destination cache. One of them is the ire_ctable_lookup() functional interface, but because tun and ip are separate STREAMS module and this functional interface was previously only safe to use within the ip module's STREAMS perimeter[1], tun could not use this functional interface.

Another method ip provides is the IRE_DB_REQ_TYPE STREAMS message. An upstream module can send such a message down to ip, and ip will reply with an IRE_DB_TYPE message and append a copy of the IRE[2] requested to the message (assuming the requested IRE is found). This is the method used by the tun module. Periodically, tun sends down this message to get the current Path MTU for its tunnel destination.  For example, it does this when sending a packet down to ip and the Path MTU information it has expired in tun_wdata_v4().

/*
 * Request the destination ire regularly in case Path MTU has
 * increased.
 */
if (TUN_IRE_TOO_OLD(atp))
       tun_send_ire_req(q);

DL_NOTIFY_REQ/IND and DL_NOTE_SDU_SIZE

Once the tun module has obtained the Path MTU information of the destination, it needs to recalcule the link MTU of the tunnel interface and notify the upper instance of ip if the MTU has changed. The ip module can then update the IP interface's MTU accordingly. The MTU calculation is done by the tun_update_link_mtu() function, which in turn calls tun_sendsdusize() to notify the ip module of the new MTU if it has changed:

/*
 * Given the path MTU to the tunnel destination, calculate tunnel's link
 * mtu.  For configured tunnels, we update the tunnel's link MTU and notify
 * the upper instance of IP of the change so that the IP interface's MTU
 * can be updated.  If the tunnel is a 6to4 or automatic tunnel, just
 * return the effective MTU of the tunnel without updating it.  We don't
 * update the link MTU of 6to4 or automatic tunnels because they tunnel to
 * multiple destinations all with potentially differing path MTU's.
 */
static uint32_t
tun_update_link_mtu(queue_t *q, uint32_t pmtu, boolean_t icmp)
{
        tun_t *atp = (tun_t *)q->q_ptr;
        uint32_t newmtu = pmtu;
        boolean_t sendsdusize = B_FALSE;

        /*
         * If the pmtu provided came from an ICMP error being passed up
         * from below, then the pmtu argument has already been adjusted
         * by the IPsec overhead.
         */
        if (!icmp && (atp->tun_flags & TUN_SECURITY))
        newmtu -= atp->tun_ipsec_overhead;

        if (atp->tun_flags & TUN_L_V4) {
                newmtu -= sizeof (ipha_t);
                if (newmtu < IP_MIN_MTU)
                        newmtu = IP_MIN_MTU;
        } else {
                ASSERT(atp->tun_flags & TUN_L_V6);
                newmtu -= sizeof (ip6_t);
                if (atp->tun_encap_lim > 0)
                        newmtu -= IPV6_TUN_ENCAP_OPT_LEN;
                if (newmtu < IPV6_MIN_MTU)
                        newmtu = IPV6_MIN_MTU;
        }

        if (!(atp->tun_flags & (TUN_6TO4 | TUN_AUTOMATIC))) {
                if (newmtu != atp->tun_mtu) {
                        atp->tun_mtu = newmtu;
                        sendsdusize = B_TRUE;
                }

                if (sendsdusize)
                        tun_sendsdusize(q);
        }
        return (newmtu);
}

Note, there is a cosmetic bug in the above code. The fix would be a good starter fix for anyone wishing to be introduced to the OpenSolaris development process. :-) The sendsdusize variable is obviously not needed and the last if statement can be reduced to:

if (newmtu != atp->tun_mtu &&
    !(atp->tun_flags & (TUN_6TO4 | TUN_AUTOMATIC))) {
        atp->tun_mtu = newmtu;
        tun_sendsdusize(q);
}

How does the notification between tun and ip work? It's done via a DLPI notification mechanism that is Solaris specific. The dlpi(7P) man page describes the mechanism as "Notification Support", and it includes support for the asynchronous nofication of link status (up or down), SDU (send data unit, or MTU) size, link speed, and other information. The tun modules uses the SDU notification.

The mechanism works as follows:
  1. When an IP interface is plumbed, the ip module sends the underlying driver a DL_NOTIFY_REQ DLPi message. The message contains a bitfield representing the notifications that ip is interested in. This is done by ill_dl_phys():

    /*
     * Allocate a DL_NOTIFY_REQ and set the notifications we want.
     */
    notify_mp = ip_dlpi_alloc(sizeof (dl_notify_req_t) + sizeof (long),
        DL_NOTIFY_REQ);
    if (notify_mp == NULL)
            goto bad;
    ((dl_notify_req_t *)notify_mp->b_rptr)->dl_notifications =
        (DL_NOTE_PHYS_ADDR | DL_NOTE_SDU_SIZE | DL_NOTE_FASTPATH_FLUSH |
        DL_NOTE_LINK_UP | DL_NOTE_LINK_DOWN | DL_NOTE_CAPAB_RENEG);
    ...
    ill_dlpi_send(ill, notify_mp);
    
  2. The underlying driver (tun in this case) replies with a DL_NOTIFY_ACK containing the subset of capabilities that it support. The tun only supports DL_NOTE_SDU_SIZE.
  3. When an event that triggers a change in MTU occurs, the driver (tun) sends up a DL_NOTIFY_IND message to those DLPI users that were interested in DL_NOTE_SDU_SIZE notifications. The tun module does this in the tun_sendsdusize() function.
  4. When ip receives the DL_NOTIFY_IND message containing a DL_NOTE_SDU_SIZE notification, it updates the IP tunnel interface's MTU accordingly, and ifconfig shows the new dynamically updated MTU!


[1] The IP Multithreading feature of the FireEngine project now makes it possible for other modules to use this functional interface. Some modules such as ipf (IP Filter) and nattymod (IPsec NAT traversal) already use it. The tun module can now use it as well, which is something we plan on doing.

[2] An IRE, or internet routing entry is
a data structure internal to Solaris' IP implementation used to represent
forwarding table entries _and_ per-destination cache entries. Creation and
maintenance of IRE tables is by far the most complex (some would say overly
complex) parts of the ip module. The subject of IRE's would make for a very
lengthy blog entry on its own.



Technorati Tag:


Technorati Tag: