Environment: DCOS 1.8.7 (Azure, installed using ACS Engine modified ARM templates - Ubuntu 16.04 with 4.4.0-28-generic).
After adding a docker contaneriser based app set to use the default `dcos` overlay network and assigned a VIP via label in `portMappings` (see attachment with app config in my-group.json, it's the /my-app/backend app), VIP assignment works and traffic gets routed correctly to the overlay network IPs of its tasks (using both the `SERVICE_LABEL.l4lb.thisdcos.directory:$VIP_PORT` as well as the `IP:$VIP_PORT`). The issue appears when removing (properly killing) tasks, either by scaling up and down, scaling down the initial 3 instances, or completely removing the app: backends are never removed from the VIP and so requests directed to the no-longer-existing backends fail, though `total_failures` continued to be tracked for each backend.
Vips will show backends. After a few requests:
The status of the group (all apps have their healthcheck in place) will be:
Though all tasks appear healthy, on querying our backend endpoint we get a few of
(see failed_requests.txt for more)
VIPs table. Just what we had prior to scaling up above in 2.
VIPs tables like this:
^^ Notice the `total_failures` on the first one. What's more, failures are more common the more backends that have been ever registered and are now removed (for instance in a zdd), or when IP allocation is done so another app gets an overlay IP registerd in the back end, in which case requests are directed to a different task.
Unfortunately removing a group/app does not reset the backends (of the VIP) either. The above output for the vips endpoint is the same after complete group removal.
Hey . Thanks for the report. We're looking into this.
I couldn't reproduce this scenario. I tried with scaling up and down both individual service and group services. In both the cases, backends were added and removed successfully. I tried this on DC/OS 1.8.7 using 1 master and 3 agents running coreos.
We're having the same issue. Some services have 34 entries while we have only 4 instances running, while others are correctly aligned.
DC/OS 1.8.7 with 1x master and 2x slaves with CentOS 7.2.1511 .
Apparently, the instances are not aligned if the service is using Virtual Network and the port does NOT have the "Expose endpoints on host network" option enabled.
We did a couple of tests, creating new instances and restarting them a couple of times.
a- NOT exposing the port on host network, each time the deployment requires a new IP from the virtual network, a new entry is added for the VIP:
b- exposing the port on host network, even if the service receives different IPs on virtual network, the VIP is aligned correctly (the IP 10.xxx.yyy.zzz is the IP of the physical agent )
I've attached the corresponding
Hope this helps,
Fixed via commit 52487c79e794d1d92e9c9c2b8481cb8cd54da241 dcos/minuteman