We ran into a strange problem with spartan/minuteman/VIPs.
We run DCOS 1.8.1 on AWS in 2 AZ. The problem happens on all agents. And I suspended all services for debugging.
The thing I observe is:
On a agent (10.0.10.168) I run:
while true; do ping ready.spartan -c1; sleep 1; done
And it always can resolve and ping it.
Then I deploy an app via marathon. (Constraints set to above host, just for debugging, but it happens on any node.)
As soon as I press the deploy button in marathon, the ping to ready.spartan returns "ping: unknown host ready.spartan". And of course all VIPs and Mesos DNS queries will fail. (Which is quite dramatic, because we use a VIP to download some files before containers start. So the container constantly keeps failing and re-deploying.)
About 30 secs after I cancel the deployment ready.spartan is back available.
I have no clue how to debug this any further.
journalctl -u docs-spartan does not show anything unusual.
journalctl -u docs-minuteman does report some errors (which I don't understand)
I have attached the output of docs-minuteman. From a few seconds before ready.spartan became unavailable.
I can provide more logs if necessary. Just don't know which.
Any help would very much be appreciated, that issue is keeping us from moving forward.
I think we faced the same issue on DCOS 1.8.0 but not sure, would need to test that.