seblog

IP Tunneling and Amazon VPC

2009-12-02T01:05:00.000-08:00

Glenn Brunette has successfully been able to access the Amazon Virtual Private Cloud using OpenSolaris as a customer gateway. He describes the general concept in his blog along with pointers to a tool he's developed to automate the configuration. This OpenSolaris customer gateway uses core technologies such as IP tunneling, IPsec, and BGP to provide redundant secure links to the cloud (see the functional diagram that Glenn provides for a depiction of this).

The IP tunnel configuration using OpenSolaris allowed Glenn and his team (including Dan McDonald and others) to troubleshoot the operation of BGP over these tunnels using the observability provided by the Clearview project (integrated in OpenSolaris build 125), which would not have been possible before.

Thanks for doing this fantastic work Glenn!

Fluendo DVD player initial reaction

2009-11-12T23:18:00.000-08:00

I bought the Fluendo DVD player for OpenSolaris last night, and there seem to be some very rough edges. For one, the /usr/bin/fluendo-dvd script doesn't work out of the box and spews shell syntax errors. It assumes that "/bin/sh" is actually bash, which isn't the case on OpenSolaris. Changing the first line of the script to "#!/bin/bash" fixes the problem and the binary launches.

I've installed the player on two systems, both running development build 126 of OpenSolaris. One is my Sun Ultra 40 desktop, and the other is my Toshiba Portege r500 laptop. One common gripe in general is that the player has no control buttons at all (e.g. play, stop, pause, etc.). To control the player, one needs to go through the "DVD Player" menu, which is very odd.

There is also no evidence of the capability to navigate forward or backwards through a movie at higher or lower rates.

OpenSolaris itself is not contributing to a positive user experience, as after watching a movie for any more than ten minutes results in the audio stream being corrupted by what sounds like static clicks and hisses. Stopping and restarting the player causes the audio issue to go away, but it comes back after a short time. This exact same issue occurs for other gstreamer applications, so this is not fluendo-dvd player problem. There is likely a bug in the audio framework.

Aside from these common issues, the player is unable to play movies on my Toshiba Portege laptop. The first time I attempted to play a movie (after having applied the above fix to the launcher script), the application crashed with a segmentation fault. I have the core dump if anyone from Fluendo wishes to debug the issue (the segfault occurs in fluendo_css_descrambler_descramble()). From that point on, any attempt to play a movie results in the following popup.

I'm not sure what to make of that. Perhaps there are some file permission issues on this system, but the error is cryptic enough that there's no hope of diagnosing what the problem is.

On the positive side, on the one system I'm able to get it working, the video quality is excellent.

The laptop is the main platform from which I'd like to use this, so the current situation is disappointing. I'm hoping that these are simple bugs that can be expediently fixed, and that the mail I sent to the support channel at Fluendo last night will be answered (I'd expect so since the 20 Euros one pays for this includes 1 year of support).

OpenSolaris DVD player from Fluendo

2009-11-12T03:49:00.000-08:00

Fluendo released their DVD player for OpenSolaris today!

http://www.fluendo.com/shop/product/fluendo-dvd-player/

IPv6 in Shared-Stack Zones

2009-10-08T02:04:00.000-07:00

I was recently at an OpenSolaris user-group meeting where a question was asked regarding how IPv6 could be used from a shared-stack zone. For the benefit of anyone who has a similar question, here is an example of a working configuration:

bash-3.2# zoneadm list -iv
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              native   shared
   - test             installed  /export/home/test              native   excl  
   - test2            installed  /export/home/test2             native   shared

The exclusive-stack zone "test" has all of its own networking configured within it, so IPv6 inherently just works there. The question, however, was about shared-stack, and so I setup the "test2" zone to demonstrate this.

bash-3.2# zonecfg -z test2
zonecfg:test2> add net
zonecfg:test2:net> set physical=e1000g0
zonecfg:test2:net> set address=fe80::1234/10
zonecfg:test2:net> end
zonecfg:test2> add net
zonecfg:test2:net> set physical=e1000g0
zonecfg:test2:net> set address=2002:a08:39f0:1::1234/64
zonecfg:test2:net> end
zonecfg:test2> verify
zonecfg:test2> commit
zonecfg:test2> exit
bash-3.2# zonecfg -z test2 info
zonename: test2
zonepath: /export/home/test2
brand: native
...
net:
address: 10.8.57.111/24
physical: e1000g0
defrouter not specified
net:
address: fe80::1234/10
physical: e1000g0
defrouter not specified
net:
address: 2002:a08:39f0:1::1234/64
physical: e1000g0
defrouter not specified

Here I configured a link-local address fe80::1234/10, and a global address 2002:a08:39f0:1::1234/64. Each interface within each zone requires a link-local address for use with neighbor-discovery, and the global address is the address used for actual IPv6 communication by applications and services. The global address' prefix is one that is configured on the link to which the interface is connected. In the zone, we end up with:

bash-3.2# zlogin test2 ifconfig -a6
lo0:1: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
        inet6 ::1/128
e1000g0:2: flags=2000841<UP,RUNNING,MULTICAST,IPv6> mtu 1500 index 2
        inet6 fe80::1234/10
e1000g0:3: flags=2000841<UP,RUNNING,MULTICAST,IPv6> mtu 1500 index 2
        inet6 2002:a08:39f0:1::1234/64

The global zone has IPv6 connectivity using this same prefix as well as a default IPv6 route: [2]

bash-3.2# netstat -f inet6 -rn

Routing Table: IPv6
Destination/Mask            Gateway                   Flags Ref   Use    If
--------------------------- --------------------------- ----- --- ------- -----
2002:a08:39f0:1::/64        2002:a08:39f0:1:214:4fff:fe1e:1e72 U       1       0 e1000g0:1
fe80::/10                   fe80::214:4fff:fe1e:1e72    U       1       0 e1000g0
default                     fe80::1                     UG      1       0 e1000g0

From the non-global zone, we have IPv6 connectivity:

bash-3.2# zlogin test2 ping -sn 2002:8194:aeaa:1:214:4fff:fe70:5530
PING 2002:8194:aeaa:1:214:4fff:fe70:5530 (2002:8194:aeaa:1:214:4fff:fe70:5530): 56 data bytes
64 bytes from 2002:8194:aeaa:1:214:4fff:fe70:5530: icmp_seq=0. time=4.654 ms
64 bytes from 2002:8194:aeaa:1:214:4fff:fe70:5530: icmp_seq=1. time=2.632 ms
64 bytes from 2002:8194:aeaa:1:214:4fff:fe70:5530: icmp_seq=2. time=2.501 ms
64 bytes from 2002:8194:aeaa:1:214:4fff:fe70:5530: icmp_seq=3. time=2.571 ms
^C
----2002:8194:aeaa:1:214:4fff:fe70:5530 PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 2.501/3.090/4.654/1.044

The zone can then be configured to use DNS or local hosts to resolve names to IPv6 addresses in order to utilize IPv6 more effectively.

Clearview IP Tunneling in OpenSolaris

2009-09-25T01:03:00.000-07:00

I integrated Clearview IP Tunneling (the final component of the Clearview project) into the ON consolidation this week. It will be included in OpenSolaris build 125 which will make its way to the dev repository in due time. Thanks to all who participated including the Clearview project team (past and present), and members of various OpenSolaris communities who contributed by doing design and code reviews. This brings a close to a project that Meem and I conceived years ago while doodling network interface requirements on his whiteboard. We've now delivered every component that we initially identified as the solutions to meet our requirements. That's something to be proud of.

With this integration, IP tunnel links can be created using dladm, be given meaningful names using link vanity naming, observed using traditional network observability tools such as snoop and wireshark, assigned to exclusive stack non-global zones, and created from within non-global zones.

This integration also enables the use of dladm in general from within exclusive stack non-global zones. Aside from the IP tunnel subcommands which are supported from such zones, all of the show-* subcommands now work in such zones, allowing administrators to view datalink configuration pertinent to the zone. This is a first step towards gradually expanding the set of datalink features available in zones.

Enjoy, and feel free to communicate with us regarding this project at clearview-discuss@opensolaris.org.

Observe Loopback and Inter-Zone IP Packets With OpenSolaris

2008-11-17T22:47:00.000-08:00

I'm happy to announce that the IP Observability Devices component of the Clearview project has integrated into OpenSolaris build 103 (also see Phil Kirk's announcement to the ON community). This adds the following new capabilities to OpenSolaris:

Network observability at the IP layer for traditional DLPI-based tools such as snoop
Observability of loopback IP packets
Observability of inter-zone IP packets
Tools such as snoop can be run from within a non-global zone to observe packets associated with that zone
Snoop filtering based on zone id

The snoop command has grown a new "-I <interface-name>" option to access this feature. Its semantics are to snoop the IP interface named <interface-name> at the IP layer. When observing a particular IP interface with this facility, packets that have a source or destination IP address assigned to that interface can be observed, as well as packets that are forwarded to or from that IP interface, and broadcast and multicast packets received by that interface. Additional internal filtering is performed to ensure that an observer from a non-global zone can only see packets that belong to that zone, with the exception of the global zone, from which packets to or from any zone that shares its stack can be observed. Any IP interface visible through "ifconfig -a" can be observed using this feature.

We are also working towards integrating support for these IP Observability Devices into Wireshark and tcpdump in the near future.

Here are some examples using snoop:

Example 1: Observing the Loopback Interface

bash-3.2# snoop -I lo0
Using device ipnet/lo0 (promiscuous mode)
localhost -> localhost    ICMP Echo request (ID: 37110 Sequence number: 0)
localhost -> localhost    ICMP Echo reply (ID: 37110 Sequence number: 0)

The lo0 interface has the 127.0.0.1 address assigned to it, and so any communication using the address 127.0.0.1 is seen above (in this case, I was simply doing "ping 127.0.0.1"). Snoop's verbose output mode displays a new "ipnet" header that precedes all IP packets observed:

bash-3.2# snoop -v -I lo0
Using device ipnet/lo0 (promiscuous mode)
IPNET:  ----- IPNET Header -----
IPNET: 
IPNET:  Packet 1 arrived at 10:40:33.68506
IPNET:  Packet size = 108 bytes
IPNET:  dli_version = 1
IPNET:  dli_type = 4
IPNET:  dli_srczone = 0
IPNET:  dli_dstzone = 0
IPNET: 
...

Note above that the source and destination zone ids are displayed. In this case, I was running "ping 127.0.0.1" in the global zone, and so both the source and destination zone ids are "0".

Example 2: Running Snoop From a Non-Global Zone

bash-3.2# zoneadm list -v
ID NAME             STATUS     PATH                           BRAND    IP
0 global           running    /                              native   shared
4 test             running    /zones/test                    native   shared
bash-3.2# zlogin test
[Connected to zone 'test' pts/2]
...
bash-3.2# ifconfig -a
lo0:1: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
bge0:1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
        inet 10.8.57.34 netmask ffffff00 broadcast 10.8.57.255
lo0:1: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
        inet6 ::1/128
bge0:2: flags=202000841<UP,RUNNING,MULTICAST,IPv6,CoS> mtu 1500 index 2
        inet6 2002:a08:39f0:1::f/64
bash-3.2# snoop -I bge0
Using device ipnet/bge0 (promiscuous mode)
whitestar1-2.East.Sun.COM -> mf-ubur-01.East.Sun.COM DNS C 253.57.8.10.in-addr.arpa. Internet PTR ?
mf-ubur-01.East.Sun.COM -> whitestar1-2.East.Sun.COM DNS R 2.0.0.224.in-addr.arpa. Internet PTR ALL-ROUTERS.MCAST.NET.
whitestar1-6.East.Sun.COM -> whitestar1-2.East.Sun.COM TCP D=22 S=62117 Syn Seq=195630514 Len=0 Win=49152 Options=<mss
whitestar1-2.East.Sun.COM -> whitestar1-6.East.Sun.COM TCP D=62117 S=22 Syn Ack=195630515 Seq=195794440 Len=0 Win=49152
whitestar1-6.East.Sun.COM -> whitestar1-2.East.Sun.COM TCP D=22 S=62117 Ack=195794441 Seq=195630515 Len=0 Win=49152
whitestar1-2.East.Sun.COM -> whitestar1-6.East.Sun.COM TCP D=62117 S=22 Push Ack=195630515 Seq=195794441 Len=20 Win=491

Although not evident from the snoop output above, whitestar1-2 is 10.8.57.34 (the bge0:1 IP address in this non-global zone), and whitestar1-6 is actually an IP address in another zone on the same system. By snooping the bge0 interface, the user sees all packets associated with the bge0 IP addresses in the zone; even those that are locally delivered to other zones. Using snoop's verbose output mode allows us to see which zones these packets are flowing between:

bash-3.2# snoop -v -I bge0 whitestar1-6
Using device ipnet/bge0 (promiscuous mode)
IPNET:  ----- IPNET Header -----
IPNET: 
IPNET:  Packet 1 arrived at 10:44:10.86739
IPNET:  Packet size = 76 bytes
IPNET:  dli_version = 1
IPNET:  dli_type = 4
IPNET:  dli_srczone = 0
IPNET:  dli_dstzone = 4
IPNET: 
...

We can see above that the packet was from the global zone to the test zone.

Example 3: Filtering by Zone ID

Filtering by zone id can be useful on a system that has multiple zones. In this example, an administrator in the global zone observes packets being sent to or from IP addresses in the "test" zone.

bash-3.2# zoneadm list -v
ID NAME             STATUS     PATH                           BRAND    IP
0 global           running    /                              native   shared
4 test             running    /zones/test                    native   shared
bash-3.2# snoop -I bge0 zone 4
Using device ipnet/bge0 (promiscuous mode)
whitestar1-6.East.Sun.COM -> whitestar1-2.East.Sun.COM TCP D=22 S=61658 Syn Seq=374055417 Len=0 Win=49152 Options=<mss
whitestar1-2.East.Sun.COM -> whitestar1-6.East.Sun.COM TCP D=61658 S=22 Syn Ack=374055418 Seq=374124525 Len=0 Win=49152
whitestar1-6.East.Sun.COM -> whitestar1-2.East.Sun.COM TCP D=22 S=61658 Ack=374124526 Seq=374055418 Len=0 Win=49152

This can be particularly useful with the loopback interface, as the 127.0.0.1 address is shared among all shared-stack zones, and it can be difficult to associate a loopback packet to an application in a zone.

Note that there is a pending RFE to also be able to enter a zone name as well as a zone id as the argument to the snoop "zone" filtering primitive. For now, the zone id is the only allowable argument.

Clearview Vanity Naming BigAdmin Article

2008-06-05T04:47:00.000-07:00

I expanded upon one of my previous blog entries on network datalink vanity naming in OpenSolaris into a more thorough article with more examples. The result is the following BigAdmin article:

http://www.sun.com/bigadmin/sundocs/articles/vnamingsol.jsp

Enjoy.

Maybe Some Ice Cream With That OpenSolaris

2008-05-30T04:28:00.000-07:00

Well, the pickles and beer in the refrigerator were not enough to bribe my Ferrari into installing OpenSolaris. Maybe some Ice Cream will coax it into behaving better. Luckily, the cleaning people empty out the freezer on the last Friday of the month at 2:00pm (which is today!), leaving plenty of room for...

Props to Will Young for claiming to have done something like this first. ;-)

Not Too Much Mustard on That Ferrari Please

2008-05-30T03:59:00.000-07:00

My Acer Ferrari 3400 cannot go through an OpenSolaris installation without overheating and powering itself down. Because OpenSolaris has no power management for this laptop, the CPU runs at 100% clock rate 100% of the time, which isn't a problem for other OSs.

Luckily, the Ferrari has no problems sharing a cramped space with mustard, pickes, left-over Chinese food, and a beer. Ferrari 3400, meet OpenSolaris 2008.05:

Configuring an OpenSolaris 6to4 router

2008-03-28T03:01:00.000-07:00

A common problem in enterprise networks is that many IT departments have not begun to deploy IPv6 within their supported infrastructure, but developers need IPv6 networking in order to develop and test products which support IPv6. 6to4 (defined in RFC 3056) can be a quick way to obtain IPv6 connectivity between IPv6 nodes separated by IPv4 networks such as this. The general idea is that each 6to4 site has a 6to4 router which is responsible for automatically tunneling IPv6 packets from its site to other 6to4 routers in other 6to4 sites (or native IPv6 networks with the use of relay routers^[1]) over IPv4. 6to4, then, can often be the answer for such developers, where configuring a 6to4 router in a lab environment or in a small subnet within an enterprise network is very easy and addresses their basic IPv6 connectivity requirements.

OpenSolaris^[2] can be used as a 6to4 router, and I've received so many requests for basic instructions on how to configure a 6to4 router with OpenSolaris, that I've decided to write a short blog entry on the subject. Note that while this blog may come in handy, there is in fact official Sun documentation on 6to4 routing^[3] which may be even more useful.
The following instructions configure a persistent configuration which will be enabled after a reboot of the system. All of this can also be configured similarly on the running system, but it is simpler to give one set of instructions. Experienced administrators will surely know how to interpret these instructions to apply configuration to the running system, and that's left as an exercise to the reader.

Enable IPv6 on one of the physical interfaces of the 6to4 router:
```
touch /etc/hostname6.<intf>
```
Where <intf> is the interface in question (e.g., e1000g0).
Configure a 6to4 tunneling interface on the 6to4 router:
```
echo "tsrc <v4addr> up" > /etc/hostname6.ip.6to4tun0
```
Where <v4addr> is the IPv4 address of the 6to4 router.
Enable IPv6 forwarding on the 6to4 router:
```
routeadm -e ipv6-forwarding
```
Reboot the system. When the system comes back up, it will have an IPv6 interface name ip.6to4tun0 which will have an address like 2002:<hex-v4addr>::1 ^[4]. The "2002:<hex-v4addr>::" part is the 48-bit 6to4 site-prefix for your 6to4 site. All IPv6 nodes in the site that use this 6to4 router must share this common prefix, although it needs to be further sub-divided within each IPv6 subnet in the site in order to be useful (that's what the remaining 16 bits of the /64 prefix are for). For example, if the site consists of a single IPv6 subnet, then it's easy enough to create a single "2002:<hex-v4addr>:1::/64" prefix by following the following remaining steps.
Enable IPv6 router advertisements on the 6to4 router so that IPv6 hosts on the subnet automatically configure their IPv6 addresses and use this router as their default router:
```
cat << EOF > /etc/inet/ndpd.conf
ifdefault AdvSendAdvertisements 1
prefix 2002:<hex-v4addr>:1::/64 <intf>
EOF
```
Where <hex-v4addr> is the same as the <hex-v4addr> displayed in step 4, and <intf> is the physical interface attached to the IPv6 subnet in question. The ":1" following <hex-v4addr> is important, as this is the 16-bit subnet-id for the prefix being advertised. It uniquely identifies this /64 prefix from other prefixes in the site, which all share a common /48. The subnet-id must be non-zero (because the 0 subnet-id was allocated to the 6to4 router's ip.6to4tun0 interface) and unique within the site, so it doesn't necessarily need to be "1".

If the 6to4 router is attached to more than one subnet, then there would be additional "prefix" entries in the ndpd.conf file above, one for each interface. Each prefix would then have its own unique 16-bit subnet id.
Restart the neighbor discovery daemon for the changes to take effect.
```
svcadm restart routing/ndp
```

At this point, hosts which have IPv6 enabled in the link connected to the 6to4 router's <intf> interface will automatically
configure IPv6 addresses based on the advertised prefix, and will have a
default route to the 6to4 router. All packets destined off-link to other
6to4 sites will be tunneled to the remote 6to4 routers.
<shameless plug>Of course, when the Clearview IP Tunneling Device Driver component delivers to Nevada, one will be able to use dladm(1M) to create a 6to4 tunnel with a meaningful name, and to observe packets in the 6to4 tunnel using snoop(1M), wireshark, or other such tools.</shameless plug>

[1] I'm skipping discussing relay routers for various reasons which I won't go into here.
[2] In fact, Solaris starting with Solaris 9.
[3] Look for 6to4. Within this documentation, there are also instructions on how to configure 6to4 on Solaris, similar to this blog entry.

[4] The 2002::/16 prefix is the "magic" 6to4 prefix that allows 6to4 routers to tunnel to one another. The 32 bits that follow these initial 16 bits is an IPv4 address. It is the IPv4 address of the 6to4 router which is responsible for the automatic IPv6 tunneling of packets for its 6to4 site. For example, when a 6to4 router needs to tunnel an IPv6 packet with a destination of 2002:0a01:0203:1::1, it will know to automatically encapsulate this IPv6 packet in an IPv4 header with a destination of 10.1.2.3 (the IPv4 address of the remote 6to4 router).

Using New Networking Features in OpenSolaris

2008-01-29T11:42:00.000-08:00

The Nemo Unification and Vanity Naming component of project Clearview has integrated into OpenSolaris build 83, which (among other things) allows administrators to give meaningful names to network datalink interfaces, including VLAN interfaces. I thought I'd share how I used this feature on one of our lab routers here in Sun.

The system has four Ethernet NICs, but needs to be the router for 8 separate lab subnets. The aggregate bandwidth of four Gigabit pipes is plenty for all of the lab subnets combined, so it wasn't really worthwhile to go and add four more NICs to the system (plus, that's not really scalable). Instead, I created a single link aggregation (802.3ad) including all four Ethernet links, and created individual tagged VLAN interfaces (one for each of the 8 subnets) on top of this aggregation.

Step by step, here's what I did. Keep in mind that this is done using a nightly build of OpenSolaris from after January 24th 2008. Here was the list of datalinks on the system before I started changing things (bonus points for anyone who can tell me what kind of system I'm doing this on based on the devices listed below) :-) :

bash-3.2# dladm show-link
LINK        CLASS      MTU  STATE    OVER
nge0        phys      1500  up       --
nge1        phys      1500  up       --
e1000g0     phys      1500  up       --
e1000g1     phys      1500  up       --
bash-3.2# dladm show-phys
LINK        MEDIA               STATE      SPEED  DUPLEX   DEVICE
nge0        Ethernet            up        1000Mb  full     nge0
nge1        Ethernet            up        1000Mb  full     nge1
e1000g0     Ethernet            up        1000Mb  full     e1000g0
e1000g1     Ethernet            up        1000Mb  full     e1000g1

First, I unplumbed all IP interfaces on each of these links by issuing appropriate "ifconfig <intf> unplumb" commands. This was necessary since renaming datalinks requires that no IP interfaces be plumbed above them. I then gave each of these interfaces more generic names. The benefit of doing this is that if we replace the Ethernet cards in the future with cards of a different chip set, we won't have to change the interface names associated with that card (one of the big benefits of Clearview UV vanity naming).

bash-3.2# dladm rename-link nge0 eth0
bash-3.2# dladm rename-link nge1 eth1
bash-3.2# dladm rename-link e1000g0 eth2
bash-3.2# dladm rename-link e1000g1 eth3
LINK        CLASS      MTU  STATE    OVER
eth0        phys      1500  up       --
eth1        phys      1500  up       --
eth2        phys      1500  up       --
eth3        phys      1500  up       --
bash-3.2# dladm show-phys
LINK        MEDIA               STATE      SPEED  DUPLEX   DEVICE
eth0        Ethernet            up        1000Mb  full     nge0
eth1        Ethernet            up        1000Mb  full     nge1
eth2        Ethernet            up        1000Mb  full     e1000g0
eth3        Ethernet            up        1000Mb  full     e1000g1

Then I created a link aggregation using these four Ethernet links:

bash-3.2# dladm create-aggr -P L2,L3 -l eth0 -l eth1 -l eth2 -l eth3 default0

I named the link "default0" because this is the main untagged subnet for the lab network, and the network to which the default route points. Now the set of links looks like:

bash-3.2# dladm show-link
LINK        CLASS      MTU  STATE    OVER
eth0        phys      1500  up       --
eth1        phys      1500  up       --
eth2        phys      1500  up       --
eth3        phys      1500  up       --
default0    aggr      1500  up       eth0 eth1 eth2 eth3

The next step was to create the VLAN links on top of this aggregation. Our lab subnets have a color-coded naming scheme, which I used when naming the VLAN links. This is convenient when diagnosing network problems with particular systems, as our DNS naming uses a paralell scheme. For example, if a system's hostname is blue-98, I know to do my network snooping on the "blue" link. Creating the VLAN links was as simple as:

bash-3.2# dladm create-vlan -v 2 -l default0 orange0
bash-3.2# dladm create-vlan -v 3 -l default0 green0
bash-3.2# dladm create-vlan -v 4 -l default0 blue0
bash-3.2# dladm create-vlan -v 5 -l default0 white0
bash-3.2# dladm create-vlan -v 6 -l default0 yellow0
bash-3.2# dladm create-vlan -v 7 -l default0 red0
bash-3.2# dladm create-vlan -v 8 -l default0 cyan0

There is now one link for each subnet in the lab (one untagged link, and seven tagged VLAN links).

bash-3.2# dladm show-link
LINK        CLASS      MTU  STATE    OVER
eth0        phys      1500  up       --
eth1        phys      1500  up       --
eth2        phys      1500  up       --
eth3        phys      1500  up       --
default0    aggr      1500  up       eth0 eth1 eth2 eth3
orange0     vlan      1500  up       default0
green0      vlan      1500  up       default0
blue0       vlan      1500  up       default0
white0      vlan      1500  up       default0
yellow0     vlan      1500  up       default0
red0        vlan      1500  up       default0
cyan0       vlan      1500  up       default0
bash-3.2# dladm show-vlan
LINK          VID   OVER        FLAGS
orange0         2   default0    -----
green0          3   default0    -----
blue0           4   default0    -----
white0          5   default0    -----
yellow0         6   default0    -----
red0            7   default0    -----
cyan0           8   default0    -----

I then plumbed IP interfaces in each subnet. For example:

bash-3.2# ifconfig orange0 plumb ...
bash-3.2# ifconfig green0 plumb ...
...

Configuring this router also involved configuring IPv4 dynamic routing and forwarding, IPv6 dynamic routing and forwarding, etc. All of these latter steps involved placing the network interface names in some sort of persistent configuration (like /etc/hostname.<intf>, /etc/inet/ndpd.conf, and IP filter rules to name a few). This is where giving meaningful names to network interfaces has the most value. With all of these interface names in various configuration files, we don't want to ever have to go and reconfigure all of those things if the underlying hardware of the system were to change from under them. Before Clearview UV's vanity naming feature, a VLAN interface above the e1000g1 interface would look something like e1000g80001 (for VLAN tag 8), thanks to the moldy "VLAN PPA-hack". This is ridiculous enough as an interface name, but what happens when I replace my e1000g1 card with a Broadcom card which has a device name of bge0? I need to go fetch every piece of configuration on the system that made reference to e1000g1 and e1000g8001, and change everything to bge0 and bge8000.

With Clearview UV's vanity naming feature I could have named the link something meaningful like "private1", and assigned the newly added bge0 card that same name (using the dladm rename-link command I showcased above) to keep all of my network configuration intact.

Early Access to Clearview IP Tunneling

2007-09-25T10:07:00.000-07:00

Earlier today, early access build 74 of Project Clearview was announced to networking-discuss@opensolaris.org and clearview-discuss@opensolaris.org. This build introduces the new GLDv3-based IP tunneling driver to users. With this work, the 6000 or so lines of kernel code that comprised the "tun" STREAMS module is replaced with a GLDv3 driver which is half of that size and has more features.

With this driver, IP tunnels in Solaris are now fully observable using snoop:

seb# snoop -d ip.tun0
Using device ip.tun0 (promiscuous mode)
         seb -> my-desktop   TCP D=60722 S=22 Push Ack=624936085 Seq=693788605 Len=80 Win=49644  (1 encap)
  my-desktop -> seb          TCP D=22 S=60722 Ack=693788685 Seq=624936085 Len=0 Win=49644  (1 encap)
         seb -> dns-server   DNS C 3.1.168.192.in-addr.arpa. Internet PTR ?  (1 encap)

IP tunnels can be given meaningful names (thanks to Clearview vanity naming):

seb# dladm create-iptun -T 6to4 -s 10.8.57.44 ipv6gateway0
IP tunnel created: ipv6gateway0
seb# dladm show-iptun
LINK         TYPE  SOURCE              DESTINATION
ipv6gateway0 6to4  10.8.57.44          N/A               
seb# ifconfig ipv6gateway0 inet6 plumb up
seb# ifconfig ipv6gateway0 inet6
ipv6gateway0: flags=202200041<UP,RUNNING,NONUD,IPv6,CoS> mtu 65515 index 3
        inet tunnel src 10.8.57.44
        tunnel hop limit 64
        inet6 2002:a08:392c::1/1

seb# dladm create-iptun -T ipv4 -s seb -d vpngateway vpn0
IP tunnel created: vpn0
seb# ipsecconf -l -i vpn0
#INDEX vpn0,1
{ tunnel vpn0 negotiate tunnel laddr seb/32 dir out } ipsec { encr_algs aes-cbc(128..256) encr_auth_algs hmac-md5(128) sa shared }
#INDEX vpn0,2
{ tunnel vpn0 negotiate tunnel laddr seb/32 dir in } ipsec { encr_algs aes-cbc(128..256) encr_auth_algs hmac-md5(128) sa shared }
seb# ifconfig vpn0 plumb 10.0.0.1 10.0.0.2 up

IP tunnel links are administered using dladm (although pre-existing ifconfig syntax is still supported for backward compatibility):

seb# dladm create-iptun -T ipv6 -s me -d you trans0
IP tunnel created: trans0
seb# dladm show-linkprop trans0
LINK         PROPERTY        VALUE          DEFAULT        POSSIBLE            
trans0       autopush        --             --             --                  
trans0       zone            --             --             --                  
trans0       hoplimit        64             64             --                  
trans0       encaplimit      4              4              --                  
seb# dladm set-linkprop -p encaplimit=2 trans0
seb# dladm show-linkprop trans0
LINK         PROPERTY        VALUE          DEFAULT        POSSIBLE            
trans0       autopush        --             --             --                  
trans0       zone            --             --             --                  
trans0       hoplimit        64             64             --                  
trans0       encaplimit      2              4              --

We welcome users to bfu these bits and try out the new features. Click here for download instructions and release notes, and let us know what you think by sending us feedback at clearview-discuss@opensolaris.org.

How an IP Tunnel Interface Dynamically Adjusts its Link MTU

2005-06-14T01:17:00.001-07:00

With the launch of OpenSolaris comes the opportunity to discuss the implementation details behind existing Solaris features. I'd like to share some of the details behind one of my contributions to Solaris 10; the implementation of dynamic MTU calculation for IP tunnel interfaces.

Solaris 8 was the first version of Solaris that implemented the IP in IP tunneling mechanism described in RFC1853. It did not, however, implement the "Tunnel MTU Discovery" section of this RFC. Tunneling over IPv6 (RFC2473) was implemented very early in Solaris 10 (and backported to Solaris 9 in Update 1) along with a Tunnel MTU Discovery mechanism that worked for IPv6 tunnel interfaces only. Some mechanism was needed that worked for both IPv4 and IPv6 tunnels, and that was visible to the administrator. One drawback to the IPv6 tunnel implementation of Tunnel MTU Discovery for IPv6 tunnels was that there was no observability into the Tunnel MTU (ifconfig's output always showed some static MTU value that was unrelated to the actual tunnel interface's MTU).

This work became more important when customers (internal and external to Sun) started using Solaris' IPsec tunneling to implement VPN solutions. Without proper Tunnel MTU Discovery, things like TCP MSS calculations can take longer to converge to usable values and protocols that don't have any insight into Path MTU (UDP for example) yield unecessary amounts of IP fragmentation. For more on the benefits of Tunnel MTU Discovery, see the two aformentioned RFC's on IP tunneling.

Without going into too much detail about the inner workings of the ip and tun modules or every line of code that was changed to implement this feature, I'd like to focus on two aspects of the implementation. The first is the mechanism used by the tun module to obtain path MTU information about the tunnel destination from ip, and the second is the mechanism by which the ip interface's MTU is dynamically changed when the tun module detects a change in the tunnel's link MTU.

IRE_DB_REQ_TYPE

In order for the tun module to be able to calculate a useful tunnel MTU, it needs to know the Path MTU of the tunnel destination. The tunnel destination is the IP node we'll send encapsulated packets to when sending them through the tunnel interface. In ifconfig output, it is the "tunnel dst":

# ifconfig ip.tun0
ip.tun0: flags=10008d1<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST,IPv4> mtu 1480 index 4
        inet tunnel src 11.0.0.1 tunnel dst 11.0.0.2
        tunnel hop limit 60
        inet 10.0.0.1 --> 10.0.0.2 netmask ff000000

In the above example, IP packets forwarded into ip.tun0 are encapsulated into an outer IP header with a source of 11.0.0.1 and a destination of 11.0.0.2. 11.0.0.2 is the "tunnel destination".

The Path MTU to the destination is the size of the largest IP packet that can be sent to the destination without being fragmented nor resulting in an ICMP fragmentation needed message. The tunnel MTU of a given tunnel is the Path MTU of the tunnel destination plus any tunneling overhead (encapsulating IP header and perhaps IPsec headers if IPsec tunneling is being used).

The ip module keeps this Path MTU information in a per-destination cache (aka IRE cache) table. The protocol used to keep track of this Path MTU information is described in RFC1191. The ip module provides a number of methods of accessing this per-destination cache. One of them is the ire_ctable_lookup() functional interface, but because tun and ip are separate STREAMS module and this functional interface was previously only safe to use within the ip module's STREAMS perimeter^[1], tun could not use this functional interface.

Another method ip provides is the IRE_DB_REQ_TYPE STREAMS message. An upstream module can send such a message down to ip, and ip will reply with an IRE_DB_TYPE message and append a copy of the IRE^[2] requested to the message (assuming the requested IRE is found). This is the method used by the tun module. Periodically, tun sends down this message to get the current Path MTU for its tunnel destination. For example, it does this when sending a packet down to ip and the Path MTU information it has expired in tun_wdata_v4().

/*
 * Request the destination ire regularly in case Path MTU has
 * increased.
 */
if (TUN_IRE_TOO_OLD(atp))
       tun_send_ire_req(q);

DL_NOTIFY_REQ/IND and DL_NOTE_SDU_SIZE

Once the tun module has obtained the Path MTU information of the destination, it needs to recalcule the link MTU of the tunnel interface and notify the upper instance of ip if the MTU has changed. The ip module can then update the IP interface's MTU accordingly. The MTU calculation is done by the tun_update_link_mtu() function, which in turn calls tun_sendsdusize() to notify the ip module of the new MTU if it has changed:

/*
 * Given the path MTU to the tunnel destination, calculate tunnel's link
 * mtu.  For configured tunnels, we update the tunnel's link MTU and notify
 * the upper instance of IP of the change so that the IP interface's MTU
 * can be updated.  If the tunnel is a 6to4 or automatic tunnel, just
 * return the effective MTU of the tunnel without updating it.  We don't
 * update the link MTU of 6to4 or automatic tunnels because they tunnel to
 * multiple destinations all with potentially differing path MTU's.
 */
static uint32_t
tun_update_link_mtu(queue_t *q, uint32_t pmtu, boolean_t icmp)
{
        tun_t *atp = (tun_t *)q->q_ptr;
        uint32_t newmtu = pmtu;
        boolean_t sendsdusize = B_FALSE;

        /*
         * If the pmtu provided came from an ICMP error being passed up
         * from below, then the pmtu argument has already been adjusted
         * by the IPsec overhead.
         */
        if (!icmp && (atp->tun_flags & TUN_SECURITY))
        newmtu -= atp->tun_ipsec_overhead;

        if (atp->tun_flags & TUN_L_V4) {
                newmtu -= sizeof (ipha_t);
                if (newmtu < IP_MIN_MTU)
                        newmtu = IP_MIN_MTU;
        } else {
                ASSERT(atp->tun_flags & TUN_L_V6);
                newmtu -= sizeof (ip6_t);
                if (atp->tun_encap_lim > 0)
                        newmtu -= IPV6_TUN_ENCAP_OPT_LEN;
                if (newmtu < IPV6_MIN_MTU)
                        newmtu = IPV6_MIN_MTU;
        }

        if (!(atp->tun_flags & (TUN_6TO4 | TUN_AUTOMATIC))) {
                if (newmtu != atp->tun_mtu) {
                        atp->tun_mtu = newmtu;
                        sendsdusize = B_TRUE;
                }

                if (sendsdusize)
                        tun_sendsdusize(q);
        }
        return (newmtu);
}

Note, there is a cosmetic bug in the above code. The fix would be a good starter fix for anyone wishing to be introduced to the OpenSolaris development process. :-) The sendsdusize variable is obviously not needed and the last if statement can be reduced to:

if (newmtu != atp->tun_mtu &&
    !(atp->tun_flags & (TUN_6TO4 | TUN_AUTOMATIC))) {
        atp->tun_mtu = newmtu;
        tun_sendsdusize(q);
}

How does the notification between tun and ip work? It's done via a DLPI notification mechanism that is Solaris specific. The dlpi(7P) man page describes the mechanism as "Notification Support", and it includes support for the asynchronous nofication of link status (up or down), SDU (send data unit, or MTU) size, link speed, and other information. The tun modules uses the SDU notification.

The mechanism works as follows:

When an IP interface is plumbed, the ip module sends the underlying driver a DL_NOTIFY_REQ DLPi message. The message contains a bitfield representing the notifications that ip is interested in. This is done by ill_dl_phys():

/*
 * Allocate a DL_NOTIFY_REQ and set the notifications we want.
 */
notify_mp = ip_dlpi_alloc(sizeof (dl_notify_req_t) + sizeof (long),
    DL_NOTIFY_REQ);
if (notify_mp == NULL)
        goto bad;
((dl_notify_req_t *)notify_mp->b_rptr)->dl_notifications =
    (DL_NOTE_PHYS_ADDR | DL_NOTE_SDU_SIZE | DL_NOTE_FASTPATH_FLUSH |
    DL_NOTE_LINK_UP | DL_NOTE_LINK_DOWN | DL_NOTE_CAPAB_RENEG);
...
ill_dlpi_send(ill, notify_mp);

The underlying driver (tun in this case) replies with a DL_NOTIFY_ACK containing the subset of capabilities that it support. The tun only supports DL_NOTE_SDU_SIZE.
When an event that triggers a change in MTU occurs, the driver (tun) sends up a DL_NOTIFY_IND message to those DLPI users that were interested in DL_NOTE_SDU_SIZE notifications. The tun module does this in the tun_sendsdusize() function.
When ip receives the DL_NOTIFY_IND message containing a DL_NOTE_SDU_SIZE notification, it updates the IP tunnel interface's MTU accordingly, and ifconfig shows the new dynamically updated MTU!

[1] The IP Multithreading feature of the FireEngine project now makes it possible for other modules to use this functional interface. Some modules such as ipf (IP Filter) and nattymod (IPsec NAT traversal) already use it. The tun module can now use it as well, which is something we plan on doing.

[2] An IRE, or internet routing entry is
a data structure internal to Solaris' IP implementation used to represent
forwarding table entries _and_ per-destination cache entries. Creation and
maintenance of IRE tables is by far the most complex (some would say overly
complex) parts of the ip module. The subject of IRE's would make for a very
lengthy blog entry on its own.

Technorati Tag: OpenSolaris

Technorati Tag: Solaris