We ran into a strange problem with spartan/minuteman/VIPs.
We run DCOS 1.8.1 on AWS in 2 AZ. The problem happens on all agents. And I suspended all services for debugging.
The thing I observe is:
On a agent (10.0.10.168) I run:
while true; do ping ready.spartan -c1; sleep 1; done
And it always can resolve and ping it.
Then I deploy an app via marathon. (Constraints set to above host, just for debugging, but it happens on any node.)
As soon as I press the deploy button in marathon, the ping to ready.spartan returns "ping: unknown host ready.spartan". And of course all VIPs and Mesos DNS queries will fail. (Which is quite dramatic, because we use a VIP to download some files before containers start. So the container constantly keeps failing and re-deploying.)
About 30 secs after I cancel the deployment ready.spartan is back available.
I have no clue how to debug this any further.
journalctl -u docs-spartan does not show anything unusual.
journalctl -u docs-minuteman does report some errors (which I don't understand)
I have attached the output of docs-minuteman. From a few seconds before ready.spartan became unavailable.
I can provide more logs if necessary. Just don't know which.
Any help would very much be appreciated, that issue is keeping us from moving forward.
I think we faced the same issue on DCOS 1.8.0 but not sure, would need to test that.
I have rebooted the host. As soon as I had ssh connection run
while true; do ping -c1 ready.spartan; date; sleep 1; done
started some marathon deployments to show to problem.
and noted what happens.
Wed Aug 17 08:31:59 UTC 2016 - node ready
monitor a few minutes -> ready.spartan constantly available
Wed Aug 17 08:34:34 UTC 2016 - spartan unavailable (marathon deployment)
Wed Aug 17 08:35:25 UTC 2016 - rollback marathon deployment
Wed Aug 17 08:36:19 UTC 2016 - spartan back available
leave it a few minutes -> ready.spartan constantly available
Wed Aug 17 08:38:32 UTC 2016 - spartan unavailable (marathon deployment)
Wed Aug 17 08:39:20 UTC 2016 - spartan back available (deployment finished - task ready)
attached as journal.log is the output of journalctl -b (
Are you running DC/OS 1.8? can you ping + cat /etc/resolv.conf? and dig ready.spartan @198.51.100.2
: Ok I see what is happening now.
dig ready.spartan @198.51.100.2 is always working even when ping ready.spartan is not. This is because the /etc/resolv.conf changes during a deployment. (That I think is strange)
It switches between those two:
Generated by gen_resolvconf.py. Do not edit.
Change configuration options by changing DC/OS cluster configuration.
This file must be overwritten regularly for proper cluster operation around
This file is managed by systemd-resolved(8). Do not edit.
Third party programs must not access this file directly, but
only through the symlink at /etc/resolv.conf. To manage
resolv.conf(5) in a different way, replace the symlink by a
static file or a different symlink.
Why is this? Who is doing the update of the file? And what the hell is 10.0.0.2, there is no such ip in our setup.
10.0.0.2 would be an AWS Internal server for DNS within a specific VPC subnet.
If you do a ls -alh /etc/resolv.conf I'm guessing it'll give something like:
If you do a rm /etc/resolv.conf on the hosts the problem will go away next time gen resolvconf runs.
I submitted: https://github.com/dcos/dcos/pull/552 to help
you sir, are a hero!
Thank you and props to the team, this is exceptional good support.