How to kill your network with Ansible

  • by Patrick Ogenstad
  • October 03, 2017

Aside from being a user, I write about Ansible and try to help others to understand how it works. A few days ago I was answering questions from other Ansible users. Someone was having trouble figuring out why the ios_config module didn’t apply his template correctly. I explained what was wrong with the template, afterwards I thought about the issue some more and realized that the error could potentially be really dangerous. As in a game-over-level-event for your employment dangerous.

A scenario

Let’s imagine you are are a network administrator who has just been given the task of rolling out a new ip network to all your branch offices. You are running Ansible 2.4 (this might work differently in later versions once they are released). The new addressing details have already been pre populated in the IPAM system, and the dynamic inventory script will pull that information for you. What you need to do is to expand the template by adding the new networks and push out the change. Also the guy who used to manage the network had entered “wormhole” as the description for all of the interfaces pointing to the wide area network. While you’re adding those ip networks, can you please change the description to “WAN”. The added network will be used by a new digital signage solution.

You start by reviewing the current template which looks like this:

interface FastEthernet0
 description Wormhole exit
 ip address {{ wan_ip }} {{ wan_mask }}

interface FastEthernet1
 description POS
 ip address {{ pos_ip }} {{ pos_mask }}

interface FastEthernet2
 description OFFICE
 ip address {{ office_ip }} {{ office_mask }}

Logging into the router in Wisconsin you see that it is configured as follows:

interface FastEthernet0
 description Wormhole exit
 ip address 172.29.58.161 255.255.255.224

interface FastEthernet1
 description POS
 ip address 10.17.80.1 255.255.255.0

interface FastEthernet2
 description OFFICE
 ip address 10.17.81.1 255.255.255.0

interface FastEthernet3
 no ip address

This seems simple enough, the new network will be connected to FastEthernet3. You fire up your editor and update the template. The new file ends up like this:

interface FastEthernet0
 description WAN
 ip address {{ wan_ip }} {{ wan_mask }}

interface FastEthernet1
 description POS
 ip address {{ pos_ip }} {{ pos_mask }}

interface FastEthernet2
 description OFFICE
 ip address {{ office_ip }} {{ office_mask }}

interface FastEthernet3
description SIGNAGE
ip address {{ signage_ip }} {{ signage_mask }}

Sweet an updated template ready to use. You commit the new template to your git repo fire up the terminal.

ansible-playbook network-baseline.yml

Welcome to trouble

After giving yourself a mental highfive you stop to savor the moment. The joy lasts up until about the time when someone asks:

“Why did Wisconsin just drop of the map?”

That’s strange, you only added a new network. Right? , is when you get that nagging feeling in your stomach. The one that isn’t related to the Wisconsin office. Instead it’s that question if you limited the run to a single office or if you just killed all of your branch offices?

Ping timeout

What just happened?

Before going into details of what went wrong. It’s important to realize that these things can happen. You can make errors, or there could be bugs in the software. What’s really important is that you test what you do in a safe environment first. A good idea in this case would have been to use the check mode in Ansible (-C) along with the verbose flag (-v). That way you would be able to see what configuration would have been sent to the device without actually changing anything. Another vital point is that you should never run something like this across your entire network if you aren’t certain what will happen. Use the –limit option and start with a few devices.

Using the verbose option we can gain some insight as to what went wrong.

Verbose output

So it looks like the actual config that gets sent to the poor device is:

interface FastEthernet0
description WAN
description SIGNAGE
ip address 10.17.82.1 255.255.255.0

That’s nice, the playbook reconfigured the WAN interface and gave it the ip address which should have been assigned to FastEthernet3. From this point you could call your therapist, Red Hat support or perhaps your lawyer. Or you could read on to figure out why this happened.

How Ansible parses templates for network devices

Like the rest of Ansible the networking modules use the Jinja2 templating engine. However it works a bit different with networking as opposed to when you are templating a configuration for Nginx or some other service. What Ansible does is to parse the running configuration of the device and based on this it decides what needs to be applied to the device. So for example, nothing is changed on FastEthernet1 and FastEthernet2. So Ansible doesn’t try to change anything on those interfaces.

Given a template Ansible will only apply the configuration which isn’t already on the device. The thing is that Ansible doesn’t actually understand the configuration or what it does. Instead it tries to parse the configuration based on a set of predefined rules. If we start with the description on the WAN interface that is obviously a thing that needs to be changed. However, we can’t only add the description command as it needs to be configured under the interface. Since the description line is indented under interface FastEthernet0, Ansible treats the interface line as a parent for that section. Even though the interface line exists in the configuration Ansible will send that command before the description command. This is why the updates sent out by Ansible starts with:

interface FastEthernet0
 description WAN

The rentering of the later part of the template would look like this:

interface FastEthernet3
description SIGNAGE
ip address 10.17.82.1 255.255.255.0

Since there’s no indentation Ansible won’t realize that interface FastEthernet3 is a parent command to description and ip address. Instead it will just assume that those commands are global config snippets, and since there’s no line for description or `ip address directly under global config they will be included in the list of commands sent to the device.

Normally we’d get an error if we entered a description and ip address directly under global config.

WIS-RTR-01(config)#description SIGNAGE
                   ^
% Invalid input detected at '^' marker.

WIS-RTR-01(config)#ip address 10.17.82.1 255.255.255.0
                           ^
% Invalid input detected at '^' marker.

WIS-RTR-01(config)#

However, because we also changed the description of FastEthernet0 the ssh session will still be in the config-if context. Since we’re not sending an exit command to return to global config (i.e. go from (config-if)# to `(config)#) the wrong ip address is applied to the FastEthernet0 interface. The final configuration will be:

interface FastEthernet0
 description SIGNAGE
 ip address 10.17.82.1 255.255.255.0

interface FastEthernet1
 description POS
 ip address 10.17.80.1 255.255.255.0

interface FastEthernet2
 description OFFICE
 ip address 10.17.81.1 255.255.255.0

interface FastEthernet3
 no ip address

Whoops.

What to keep in mind

I think it’s safe to say that for the above scenario to play out like it did a bit of unfortunate luck would have to be involved. The point I would like to make is that it can be easy for something similar to happen. In this case it was two missing space characters along with another unrelated description change. Even if it’s not as disastrous as this you might end up applying configuration where you didn’t want it.

Again to reiterate. Make sure you test and validate what you are doing! Use the dry runs and look at the result so that you see what’s going on.

A different approach

If the above example sounds scary to you, keep in mind that you don’t have to use templates in this way. You can also use the lines and parents parameters with ios_config. You might also want to take a look at the NAPALM library, specifically the napalm_install_config module for Ansible. As mentioned earlier the core Ansible networking modules parses the running configuration and tries to figure out what configuration is missing from the device in order to decide which commands to send. NAPALM on the other hand is completely oblivious to the configurations on the device and instead leaves the decision of what to apply up to the device. For an IOS device NAPALM would copy the entire rentered template to the file system of the device, evaluate if a change is needed and then merge it to the running configuration (or replace the entire configuration if you wanted to).

Conclusion

To close this off, what I hope that this article will help you to understand of how Ansible works with templates when applied to network devices, and why it’s really important use the correct indentation. More importantly the real takeaway; Make sure you always test what you do before you push out config changes.

Finally, I hope you don’t kill any networks in Wisconsin or anywhere else for that matter. :)