ready.spartan unavailable during (marathon) deployment

Description

We ran into a strange problem with spartan/minuteman/VIPs.
We run DCOS 1.8.1 on AWS in 2 AZ. The problem happens on all agents. And I suspended all services for debugging.

The thing I observe is:

On a agent (10.0.10.168) I run:

while true; do ping ready.spartan -c1; sleep 1; done

And it always can resolve and ping it.
Then I deploy an app via marathon. (Constraints set to above host, just for debugging, but it happens on any node.)

As soon as I press the deploy button in marathon, the ping to ready.spartan returns "ping: unknown host ready.spartan". And of course all VIPs and Mesos DNS queries will fail. (Which is quite dramatic, because we use a VIP to download some files before containers start. So the container constantly keeps failing and re-deploying.)
About 30 secs after I cancel the deployment ready.spartan is back available.

I have no clue how to debug this any further.

journalctl -u docs-spartan does not show anything unusual.
journalctl -u docs-minuteman does report some errors (which I don't understand)

I have attached the output of docs-minuteman. From a few seconds before ready.spartan became unavailable.

I can provide more logs if necessary. Just don't know which.

Any help would very much be appreciated, that issue is keeping us from moving forward.

I think we faced the same issue on DCOS 1.8.0 but not sure, would need to test that.

Assignee

Cody Maloney

Labels

None

Components

Configure