It's a bird, it's a plane, it's

Everything here is my opinion. I do not speak for your employer.
January 2007
February 2007

2007-01-13 »

NiR: NetIntelligence 2.0 and second-system effect

After hiring people at NITI, we didn't really suffer from the second-system effect (with the possible exception of UniConf). That's probably because I got it out of my system in the early days when it was just me and dcoombs.

Even in version 1.0, Weaver had two related features called NetMap and NetIntelligence. NetMap was a passive packet-sniffing program that monitored the network and tracked which IP addresses were where; NetIntelligence (before its name was co-opted to include other features) analyzed the data in NetMap and used it to draw conclusions about the local network layout.

The first versions of Weaver couldn't even act as an ethernet-to-ethernet router; they were designed for dialup, so you routed to them on your ethernet, used them as your default gateway, and the Weaver's default gateway would either be nonexistent, or a PPP connection, or a demand-dial interface that would bring up a PPP connection if you tried to access it. In those days, NetIntelligence's job was easy: it just had to detect the local IP subnet number and netmask, and pick an address for itself on that network. (Weavers were often installed on networks without a DHCP server so they could become the DHCP server; requesting an address from DHCP usually didn't help.)

In later 1.x versions of Weaver, we added ethernet-to-ethernet routing in order to support cable modems, T1 routers, and so on. We extended NetIntelligence to do three other relatively easy tasks: figure out which network interface was the "Internet" one, figure out which device on that interface was the default gateway, and set up the firewall automatically so that the "Internet" would never be allowed to route to your local network, even for a moment. This code was very successful and worked great; it was the origin of the "trusted vs. untrusted" network concept in Weaver, and it's pretty easy to find out which node should be your default gateway when you know you can't lose. (That is, it's always better to have a default gateway than no default gateway, so even picking the wrong one is okay as long as the user can fix it.)

That was version 1. NetMap/NetIntelligence 2.0 was where things started going wrong. I decided that this concept was so cool that we should extend it one more level: what if we install Weaver on a more complex network, with multiple subnets connected by routers scattered about? What can we do to eliminate "false" data produced by misconfigured nodes? (Trust me, there are always misconfigured nodes.) What if there's more than one Internet connection, and sometimes one of them goes down? Wouldn't it be great if Weaver could find all the subnets automatically, configure the firewall appropriately, and allow any node on any connected subnet to find any other node using Weaver? It seemed like a great timesaver.

Except that it wasn't. First of all, it took a long time to write the code to handle all these special cases, and it never did really work correctly. We had some very angry customers when we put them through our 2.0 beta cycle and Weaver regularly went completely bonkers, auto-misconfiguring its routes and firewall so badly that you couldn't even reach WebConfig anymore. Or sometimes you'd end up with 100 different unrelated routes to individual subnets, because Weaver wasn't sure that 100 routes through the same router really meant that was your default gateway. Those messes were the origin of the "NetScan" front panel command, which made NetIntelligence forget everything it knew and start over. To this day, I consider this a terrible hack. But it's sure better than 2.0beta1, which didn't have a NetScan and had to have a developer (ie. me) come on-site to debug any network problems.

NetIntelligence 2.0 was a perfect example of the second-system effect: we chose to add a lot of cool but not-really-necessary features all at once, we had a non-working product until the whole thing was done, it was an order of magnitude more work than the 1.0 version, and bugs in the new features caused old, 100% reliable features (like the ability to reach WebConfig!) to fail randomly. It was a disaster.

In retrospect, the mistake is easy to see. Not long after, the proliferation of DHCP meant that auto-discovering subnets was much less important. But more importantly, Weaver's network discovery feature was supposed to make Weaver easy to configure on simple networks. Any IT administrator who managed to set up a network with multiple subnets already knows what those subnets are and how he wants to route between them, so auto-discovering isn't worth anything. The existence of a complex network implies the ability to configure a router for it. We sacrificed sanity on networks where people didn't have the ability, all in the name of giving a useless feature on complex networks that didn't need it. Oops.

By the time we were actually hiring developers back in 2000 and 2001, we had already been through all this mess. Nowadays in Weaver (now Nitix) 3.x and 4.x, we've wrangled NetIntelligence under control, and all those broken-but-cool features from 2.0 actually work and do cool stuff. But to this day, once in a while, it still produces a huge, insane list of correct-but-pointless subnet routes that you have to delete by hand.

So yes, I know a thing or two about the second-system effect.

As applied to business

As I continue to lay the groundwork for a new company, it's important to keep this sort of thing in mind. Just because a few "cool" things were missing the first time around, don't lose sight of the basics in round 2.

I'm CEO at Tailscale, where we make network problems disappear.

Why would you follow me on twitter? Use RSS.

apenwarr on gmail.com