Backends are not removed from VIP table when killed



Environment: DCOS 1.8.7 (Azure, installed using ACS Engine modified ARM templates - Ubuntu 16.04 with 4.4.0-28-generic).

After adding a docker contaneriser based app set to use the default `dcos` overlay network and assigned a VIP via label in `portMappings` (see attachment with app config in my-group.json, it's the /my-app/backend app), VIP assignment works and traffic gets routed correctly to the overlay network IPs of its tasks (using both the `$VIP_PORT` as well as the `IP:$VIP_PORT`). The issue appears when removing (properly killing) tasks, either by scaling up and down, scaling down the initial 3 instances, or completely removing the app: backends are never removed from the VIP and so requests directed to the no-longer-existing backends fail, though `total_failures` continued to be tracked for each backend.

Steps to reproduce

1. Before launching the app: check vips in any cluster node (results are the same across, I've tested in different nodes and types of nodes):

2. Launching (`dcos marathon group add my-group.json` ) the app will create the VIP and digging the DNS name will give (only relevant info here)

Vips will show backends. After a few requests:

The status of the group (all apps have their healthcheck in place) will be:

3 Scaling up, and all continues well...

4 Scale down and the issue appears.

Though all tasks appear healthy, on querying our backend endpoint we get a few of

(see failed_requests.txt for more)


VIPs table. Just what we had prior to scaling up above in 2.

Current result

VIPs tables like this:

^^ Notice the `total_failures` on the first one. What's more, failures are more common the more backends that have been ever registered and are now removed (for instance in a zdd), or when IP allocation is done so another app gets an overlay IP registerd in the back end, in which case requests are directed to a different task.

Unfortunately removing a group/app does not reset the backends (of the VIP) either. The above output for the vips endpoint is the same after complete group removal.


Deepak Goel
January 27, 2017, 6:19 AM

Fixed via commit 52487c79e794d1d92e9c9c2b8481cb8cd54da241 dcos/minuteman

Marco Reni
January 25, 2017, 9:08 AM

We're having the same issue. Some services have 34 entries while we have only 4 instances running, while others are correctly aligned.

DC/OS 1.8.7 with 1x master and 2x slaves with CentOS 7.2.1511 .

Apparently, the instances are not aligned if the service is using Virtual Network and the port does NOT have the "Expose endpoints on host network" option enabled.

We did a couple of tests, creating new instances and restarting them a couple of times.

a- NOT exposing the port on host network, each time the deployment requires a new IP from the virtual network, a new entry is added for the VIP:

b- exposing the port on host network, even if the service receives different IPs on virtual network, the VIP is aligned correctly (the IP is the IP of the physical agent )

I've attached the corresponding


Hope this helps,

Deepak Goel
January 25, 2017, 8:16 AM

I couldn't reproduce this scenario. I tried with scaling up and down both individual service and group services. In both the cases, backends were added and removed successfully. I tried this on DC/OS 1.8.7 using 1 master and 3 agents running coreos.

Albert Strasheim
January 20, 2017, 12:24 AM

Hey . Thanks for the report. We're looking into this.



Deepak Goel