Backends are not removed from VIP table when killed

Description

Description

Environment: DCOS 1.8.7 (Azure, installed using ACS Engine modified ARM templates - Ubuntu 16.04 with 4.4.0-28-generic).

After adding a docker contaneriser based app set to use the default `dcos` overlay network and assigned a VIP via label in `portMappings` (see attachment with app config in my-group.json, it's the /my-app/backend app), VIP assignment works and traffic gets routed correctly to the overlay network IPs of its tasks (using both the `SERVICE_LABEL.l4lb.thisdcos.directory:$VIP_PORT` as well as the `IP:$VIP_PORT`). The issue appears when removing (properly killing) tasks, either by scaling up and down, scaling down the initial 3 instances, or completely removing the app: backends are never removed from the VIP and so requests directed to the no-longer-existing backends fail, though `total_failures` continued to be tracked for each backend.

Steps to reproduce

1. Before launching the app: check vips in any cluster node (results are the same across, I've tested in different nodes and types of nodes):

2. Launching (`dcos marathon group add my-group.json` ) the app will create the VIP and digging the DNS name will give (only relevant info here)

Vips will show backends. After a few requests:

The status of the group (all apps have their healthcheck in place) will be:

3 Scaling up, and all continues well...

4 Scale down and the issue appears.

Though all tasks appear healthy, on querying our backend endpoint we get a few of

(see failed_requests.txt for more)

Expected

VIPs table. Just what we had prior to scaling up above in 2.

Current result

VIPs tables like this:

^^ Notice the `total_failures` on the first one. What's more, failures are more common the more backends that have been ever registered and are now removed (for instance in a zdd), or when IP allocation is done so another app gets an overlay IP registerd in the back end, in which case requests are directed to a different task.

Unfortunately removing a group/app does not reset the backends (of the VIP) either. The above output for the vips endpoint is the same after complete group removal.

Activity

Show:
Albert Strasheim
January 20, 2017, 12:24 AM

Hey . Thanks for the report. We're looking into this.

Deepak Goel
January 25, 2017, 8:16 AM
Edited

I couldn't reproduce this scenario. I tried with scaling up and down both individual service and group services. In both the cases, backends were added and removed successfully. I tried this on DC/OS 1.8.7 using 1 master and 3 agents running coreos.

Marco Reni
January 25, 2017, 9:08 AM

We're having the same issue. Some services have 34 entries while we have only 4 instances running, while others are correctly aligned.

DC/OS 1.8.7 with 1x master and 2x slaves with CentOS 7.2.1511 .

Apparently, the instances are not aligned if the service is using Virtual Network and the port does NOT have the "Expose endpoints on host network" option enabled.

We did a couple of tests, creating new instances and restarting them a couple of times.

a- NOT exposing the port on host network, each time the deployment requires a new IP from the virtual network, a new entry is added for the VIP:

b- exposing the port on host network, even if the service receives different IPs on virtual network, the VIP is aligned correctly (the IP 10.xxx.yyy.zzz is the IP of the physical agent )

I've attached the corresponding

.

Hope this helps,
Marco

Deepak Goel
January 27, 2017, 6:19 AM
Edited

Fixed via commit 52487c79e794d1d92e9c9c2b8481cb8cd54da241 dcos/minuteman

Assignee

Deepak Goel

Labels

Components

Configure