We're updating the issue view to help you get more done. 

Backends are not removed from VIP table when killed

Description

Description

Environment: DCOS 1.8.7 (Azure, installed using ACS Engine modified ARM templates - Ubuntu 16.04 with 4.4.0-28-generic).

After adding a docker contaneriser based app set to use the default `dcos` overlay network and assigned a VIP via label in `portMappings` (see attachment with app config in my-group.json, it's the /my-app/backend app), VIP assignment works and traffic gets routed correctly to the overlay network IPs of its tasks (using both the `SERVICE_LABEL.l4lb.thisdcos.directory:$VIP_PORT` as well as the `IP:$VIP_PORT`). The issue appears when removing (properly killing) tasks, either by scaling up and down, scaling down the initial 3 instances, or completely removing the app: backends are never removed from the VIP and so requests directed to the no-longer-existing backends fail, though `total_failures` continued to be tracked for each backend.

Steps to reproduce

1. Before launching the app: check vips in any cluster node (results are the same across, I've tested in different nodes and types of nodes):

1 2 3 4 curl -s http://localhost:61421/vips | python -m json.tool { "vips": {} }

2. Launching (`dcos marathon group add my-group.json` ) the app will create the VIP and digging the DNS name will give (only relevant info here)

1 2 3 4 5 ;; QUESTION SECTION: ;ctrl1.marathon.l4lb.thisdcos.directory. IN A ;; ANSWER SECTION: ctrl1.marathon.l4lb.thisdcos.directory. 5 IN A 11.157.241.22

Vips will show backends. After a few requests:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 { "vips": { "11.157.241.22:3000": { "9.0.5.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 3 }, "9.0.6.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 2 }, "9.0.7.134:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 1 } } } }

The status of the group (all apps have their healthcheck in place) will be:

1 2 3 4 $ dcos marathon app list ID MEM CPUS TASKS HEALTH DEPLOYMENT CONTAINER CMD /my-app/frontend 100 0.1 3/3 3/3 --- DOCKER None /my-app/backend 100 0.1 3/3 3/3 --- DOCKER None

3 Scaling up, and all continues well...

1 /my-app/backend 100 0.1 4/4 4/4 --- DOCKER None
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 { "vips": { "11.157.241.22:3000": { "9.0.5.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 4 }, "9.0.5.135:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 2 }, "9.0.6.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 5 }, "9.0.7.134:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 2 } } } }

4 Scale down and the issue appears.

Though all tasks appear healthy, on querying our backend endpoint we get a few of

1 2 user@dcos-master-17841738-0:~$ curl -I http://ctrl1.marathon.l4lb.thisdcos.directory:3000/api/up curl: (7) Failed to connect to ctrl1.marathon.l4lb.thisdcos.directory port 3000: No route to host

(see failed_requests.txt for more)

Expected

VIPs table. Just what we had prior to scaling up above in 2.

Current result

VIPs tables like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 { "vips": { "11.157.241.22:3000": { "9.0.5.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 3, "total_sucesses": 4 }, "9.0.5.135:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 4 }, "9.0.6.133:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 8 }, "9.0.7.134:3000": { "is_healthy": true, "latency_last_60s": {}, "pending_connections": 0, "total_failures": 0, "total_sucesses": 6 } } } }

^^ Notice the `total_failures` on the first one. What's more, failures are more common the more backends that have been ever registered and are now removed (for instance in a zdd), or when IP allocation is done so another app gets an overlay IP registerd in the back end, in which case requests are directed to a different task.

Unfortunately removing a group/app does not reset the backends (of the VIP) either. The above output for the vips endpoint is the same after complete group removal.

Status

Assignee

Deepak Goel

Labels

Components