Backends are not removed from VIP table when killed



Environment: DCOS 1.8.7 (Azure, installed using ACS Engine modified ARM templates - Ubuntu 16.04 with 4.4.0-28-generic).

After adding a docker contaneriser based app set to use the default `dcos` overlay network and assigned a VIP via label in `portMappings` (see attachment with app config in my-group.json, it's the /my-app/backend app), VIP assignment works and traffic gets routed correctly to the overlay network IPs of its tasks (using both the `$VIP_PORT` as well as the `IP:$VIP_PORT`). The issue appears when removing (properly killing) tasks, either by scaling up and down, scaling down the initial 3 instances, or completely removing the app: backends are never removed from the VIP and so requests directed to the no-longer-existing backends fail, though `total_failures` continued to be tracked for each backend.

Steps to reproduce

1. Before launching the app: check vips in any cluster node (results are the same across, I've tested in different nodes and types of nodes):

2. Launching (`dcos marathon group add my-group.json` ) the app will create the VIP and digging the DNS name will give (only relevant info here)

Vips will show backends. After a few requests:

The status of the group (all apps have their healthcheck in place) will be:

3 Scaling up, and all continues well...

4 Scale down and the issue appears.

Though all tasks appear healthy, on querying our backend endpoint we get a few of

(see failed_requests.txt for more)


VIPs table. Just what we had prior to scaling up above in 2.

Current result

VIPs tables like this:

^^ Notice the `total_failures` on the first one. What's more, failures are more common the more backends that have been ever registered and are now removed (for instance in a zdd), or when IP allocation is done so another app gets an overlay IP registerd in the back end, in which case requests are directed to a different task.

Unfortunately removing a group/app does not reset the backends (of the VIP) either. The above output for the vips endpoint is the same after complete group removal.


Deepak Goel