Adminrouter 502 status for arangodb3 service after agent failure

Description

BUG REPORT FROM USER :

After starting up a new cluster on AWS via this template: https://dcos.io/docs/1.8/administration/installing/cloud/aws/ I wanted to launch an arangodb3 database (latest version 1.0.4 - default settings).
That triggered launching the arangodb3-framework.
However for whatever reason the underlying agent seems to have had a problem. The task would never leave "STAGING" (was checking it on the /mesos subsite).
I could NOT access the Tasks sandbox at any time so the agent was declared dead after a while. DC/OS operated as expected and after declaring the agent dead it rescheduled the framework task on another node and arangodb3 was coming up just fine.
However I could not access the ArangoDB webinterface via the /service/arangodb3 link. The page showed a 502.
After sshing into the cluster I checked the mesos state:

From reading the source here: https://github.com/dcos/adminrouter/blob/master/master/service.lua#L24 this seems to be the relevant info for the adminrouter.

Then I curled the arangodb webui_url:

So the webinterface was in fact there.

After restarting the framework in the service UI everything was again working as expected.

My guess would be that during initial startup which did not work due to the failing agent the adminrouter cached the assigned IP and port of the failing agent.
The removal of that agent and rescheduling on a new agent seems to not have invalidated that cache.

Assignee

Jörg Schad

Labels

Configure