Backends are not removed from VIP table when killed

Description

Description

Environment: DCOS 1.8.7 (Azure, installed using ACS Engine modified ARM templates - Ubuntu 16.04 with 4.4.0-28-generic).

After adding a docker contaneriser based app set to use the default `dcos` overlay network and assigned a VIP via label in `portMappings` (see attachment with app config in my-group.json, it's the /my-app/backend app), VIP assignment works and traffic gets routed correctly to the overlay network IPs of its tasks (using both the `SERVICE_LABEL.l4lb.thisdcos.directory:$VIP_PORT` as well as the `IP:$VIP_PORT`). The issue appears when removing (properly killing) tasks, either by scaling up and down, scaling down the initial 3 instances, or completely removing the app: backends are never removed from the VIP and so requests directed to the no-longer-existing backends fail, though `total_failures` continued to be tracked for each backend.

Steps to reproduce

1. Before launching the app: check vips in any cluster node (results are the same across, I've tested in different nodes and types of nodes):

curl -s http://localhost:61421/vips | python -m json.tool { "vips": {} }

2. Launching (`dcos marathon group add my-group.json` ) the app will create the VIP and digging the DNS name will give (only relevant info here)

;; QUESTION SECTION: ;ctrl1.marathon.l4lb.thisdcos.directory. IN A ;; ANSWER SECTION: ctrl1.marathon.l4lb.thisdcos.directory. 5 IN A 11.157.241.22

Vips will show backends. After a few requests:

{ "vips": { "11.157.241.22:3000": { "9.0.5.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 3 }, "9.0.6.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 2 }, "9.0.7.134:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 1 } } } }

The status of the group (all apps have their healthcheck in place) will be:

$ dcos marathon app list ID MEM CPUS TASKS HEALTH DEPLOYMENT CONTAINER CMD /my-app/frontend 100 0.1 3/3 3/3 --- DOCKER None /my-app/backend 100 0.1 3/3 3/3 --- DOCKER None

3 Scaling up, and all continues well...

/my-app/backend 100 0.1 4/4 4/4 --- DOCKER None
{ "vips": { "11.157.241.22:3000": { "9.0.5.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 4 }, "9.0.5.135:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 2 }, "9.0.6.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 5 }, "9.0.7.134:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 2 } } } }

4 Scale down and the issue appears.

Though all tasks appear healthy, on querying our backend endpoint we get a few of

user@dcos-master-17841738-0:~$ curl -I http://ctrl1.marathon.l4lb.thisdcos.directory:3000/api/up curl: (7) Failed to connect to ctrl1.marathon.l4lb.thisdcos.directory port 3000: No route to host

(see failed_requests.txt for more)

Expected

VIPs table. Just what we had prior to scaling up above in 2.

Current result

VIPs tables like this:

{ "vips": { "11.157.241.22:3000": { "9.0.5.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 3, "total_sucesses": 4 }, "9.0.5.135:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 4 }, "9.0.6.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 8 }, "9.0.7.134:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 6 } } } }

^^ Notice the `total_failures` on the first one. What's more, failures are more common the more backends that have been ever registered and are now removed (for instance in a zdd), or when IP allocation is done so another app gets an overlay IP registerd in the back end, in which case requests are directed to a different task.

Unfortunately removing a group/app does not reset the backends (of the VIP) either. The above output for the vips endpoint is the same after complete group removal.

Attachments

6
  • 25 Jan 2017, 09:08 AM
  • 25 Jan 2017, 09:08 AM
  • 19 Jan 2017, 10:57 PM
  • 19 Jan 2017, 10:48 PM
  • 19 Jan 2017, 10:44 PM
  • 19 Jan 2017, 10:43 PM

Activity

Show:

Deepak Goel January 27, 2017 at 6:19 AM
Edited

Fixed via commit 52487c79e794d1d92e9c9c2b8481cb8cd54da241 dcos/minuteman

Marco Reni January 25, 2017 at 9:08 AM

We're having the same issue. Some services have 34 entries while we have only 4 instances running, while others are correctly aligned.

DC/OS 1.8.7 with 1x master and 2x slaves with CentOS 7.2.1511 .

Apparently, the instances are not aligned if the service is using Virtual Network and the port does NOT have the "Expose endpoints on host network" option enabled.

We did a couple of tests, creating new instances and restarting them a couple of times.

a- NOT exposing the port on host network, each time the deployment requires a new IP from the virtual network, a new entry is added for the VIP:

"11.33.146.52:10102": { "9.0.1.132:10102": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 0 }, "9.0.1.138:10102": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 0 }, "9.0.2.136:10102": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 0 } },

b- exposing the port on host network, even if the service receives different IPs on virtual network, the VIP is aligned correctly (the IP 10.xxx.yyy.zzz is the IP of the physical agent )

"11.204.225.228:10103": { "10.xxx.yyy.zzz:5281": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 0 } },

I've attached the corresponding

.

Hope this helps,
Marco

Deepak Goel January 25, 2017 at 8:16 AM
Edited

I couldn't reproduce this scenario. I tried with scaling up and down both individual service and group services. In both the cases, backends were added and removed successfully. I tried this on DC/OS 1.8.7 using 1 master and 3 agents running coreos.

Albert Strasheim January 20, 2017 at 12:24 AM

Hey . Thanks for the report. We're looking into this.

Done

Details

Assignee

Labels

Components

Created January 19, 2017 at 11:05 PM
Updated January 27, 2017 at 6:21 AM
Resolved January 27, 2017 at 6:20 AM