Regarding automation exceptions

It can get quite exciting when you start to think about network automation and what it can do for you and your network. Once you’ve automated everything you can instead focus on deep work to evolve your business. However this daydream can soon fade away as you start to think about the things you can’t automate, or at least don’t know how to do. Ivan Pepelnjak wrote a piece about automating the exceptions. The post is based on a discussion he had with Rok Papež and his ideas about handling exceptions in an automated way.

While the strategy presented is great I think it overlooks some parts when it comes to exceptions that can arise, also the post doesn’t highlight how limitation of the configuration management tools were solved.

The problem

The underlying question is; When implementing some form of network automation how should you handle exceptions? In this case we are talking about exceptions in the configuration. When looking at a new solution for network automation you might find something which allows you to automate almost everything you want to do. However there might also be some odd cases or some legacy things which doesn’t completely fit into the new workflow you will be using. In Rok’s case the solution was to “store the non-automated part of the device configuration (configuring the pilot services) in an extra field in their configuration management database, and append it to the configuration generated from device/service data model through standardized templates“.

As Ivan also pointed out in the best of worlds there should be no exceptions. You can look at this from a few different point of views. The most convenient would be that you gain so much from automating the standard parts so that you don’t care about the few manual steps you need to take. You could also argue that you don’t have to automate everything at once.

Taking a step back the underlying problem might be that the automation solution doesn’t allow for exceptions or that it takes too much time to implement those exceptions in the config management tool. It can also be that engineers are happy enough by implementing new standard services using the config management solution, but they don’t have the skills to add these new exceptions to the config management workflow.

Regardless whichever tool you choose to use there will always be things it can’t do. The tools I would stay away from would be those that doesn’t allow you to handle those issues outside of the tool. A simple example would be the standard Ansible inventory which is just a flat text file, it works well enough for testing but isn’t very flexible. In this case Ansible allows you to use a dynamic inventory which you can create yourself and have total control over and connect it to your real inventory tool.

Another thing to consider in terms of network automation is that even if you have a solution to handle all of the configuration exceptions, there will always be other kind of exceptions. Some of these might be impossible to plan for in advance, likewise it might not be possible to solve them with your current automation tool.

A real world scenario

So it was the evening before a bank holiday and I’ve just finished the barbecue dinner, when the phone rang.

“Hi, we have an issue. Do you have a moment?”

It was a client and their issue was something which they previously referred to as a “code purple”. A lot of their customers had lost their internet connection. At this point I eyed the wine glass in front of me, it turned out to be half empty.

This customer has a config management solution which provisions and configures all of their switches. It owns the config and makes sure that every switch is in sync with the desired configuration template. If someone were to login and change some lines the config management tool would revert this.

What happened though was that there was a bug in the config management tool. The way the bug worked was that it constantly removed a specific vlan from each switch. As that vlan wasn’t used that bug in itself shouldn’t have mattered. As it turned out it really did matter. For a lot of the switches this wasn’t a problem, however on some switch models it triggered a bug in the switch os.

The bug in the switches made it look like arp had stopped working on all the customer facing ports. In reality the switch cut four bytes from each packet, but the end result was that nothing worked. Earlier on the day the switch vendor had confirmed this as a bug and promised to deliver a patch quickly. However before that they had come up with a workaround, the bug condition was fixed if you removed the access vlan from the customer ports and reapplied the same configuration (this wasn’t the same vlan as the one which was removed earlier).

Now the problem for my client was that their config management solution couldn’t do this. That tool only ensured that the config was correct, it couldn’t make changes in the way which was needed. So they wanted me to write a script to do this instead. The configuration needed to be removed and reapplied on somewhere between 30 and 35 000 ports. They knew which switches were impacted and had a list of these, however they didn’t have a list of all the impacted ports since not all ports were configured the same way. Some of the ports were exceptions with non-standard config. The configuration management could handle exceptions in that case but the information about which ports were configured in that way was locked within the config management tool and not easily accessible.

The script needed to do two things, first login to each switch and figure out which ports needed to be reconfigured and then only reconfigure those ports.

Unfortunately I didn’t drink any more wine that evening, but I fixed the clients issue.

Bottom line

Granted the above scenario is extreme and you will probably never face that. However that’s the point, the client didn’t think they would either. It wasn’t something which they could have planned for. When purchasing the network config tool they never asked if it would be possible to do something like this. Why would they have?

It was a one off fluke incident which will never happen again. Still other strange issues will happen. Hey it’s networking.

The important point is that you shouldn’t only rely on your automation tool to help you. Just like Ivan and Rok discussed you will need to look at what’s possible and then have someone who has the creativity and skill to work around the limitations of your tools.

So the point I was missing in Ivans article was that the real solution to many situations like these are the people who can think of how to work around issues and have the skills to solve them. Extra points to your organization if that is in fact people as opposed to a single person.