ready.spartan unavailable during (marathon) deployment


We ran into a strange problem with spartan/minuteman/VIPs.
We run DCOS 1.8.1 on AWS in 2 AZ. The problem happens on all agents. And I suspended all services for debugging.

The thing I observe is:

On a agent ( I run:

while true; do ping ready.spartan -c1; sleep 1; done

And it always can resolve and ping it.
Then I deploy an app via marathon. (Constraints set to above host, just for debugging, but it happens on any node.)

As soon as I press the deploy button in marathon, the ping to ready.spartan returns "ping: unknown host ready.spartan". And of course all VIPs and Mesos DNS queries will fail. (Which is quite dramatic, because we use a VIP to download some files before containers start. So the container constantly keeps failing and re-deploying.)
About 30 secs after I cancel the deployment ready.spartan is back available.

I have no clue how to debug this any further.

journalctl -u docs-spartan does not show anything unusual.
journalctl -u docs-minuteman does report some errors (which I don't understand)

I have attached the output of docs-minuteman. From a few seconds before ready.spartan became unavailable.

I can provide more logs if necessary. Just don't know which.

Any help would very much be appreciated, that issue is keeping us from moving forward.

I think we faced the same issue on DCOS 1.8.0 but not sure, would need to test that.


August 17, 2016, 8:46 AM

I have rebooted the host. As soon as I had ssh connection run
while true; do ping -c1 ready.spartan; date; sleep 1; done

started some marathon deployments to show to problem.
and noted what happens.

Wed Aug 17 08:31:59 UTC 2016 - node ready
monitor a few minutes -> ready.spartan constantly available

Wed Aug 17 08:34:34 UTC 2016 - spartan unavailable (marathon deployment)
Wed Aug 17 08:35:25 UTC 2016 - rollback marathon deployment
Wed Aug 17 08:36:19 UTC 2016 - spartan back available
leave it a few minutes -> ready.spartan constantly available

Wed Aug 17 08:38:32 UTC 2016 - spartan unavailable (marathon deployment)
Wed Aug 17 08:39:20 UTC 2016 - spartan back available (deployment finished - task ready)

attached as journal.log is the output of journalctl -b (


Sargun Dhillon
August 17, 2016, 8:49 AM

Are you running DC/OS 1.8? can you ping + cat /etc/resolv.conf? and dig ready.spartan @

August 17, 2016, 9:10 AM

: Ok I see what is happening now.

dig ready.spartan @ is always working even when ping ready.spartan is not. This is because the /etc/resolv.conf changes during a deployment. (That I think is strange)

It switches between those two:


  1. Generated by Do not edit.

  2. Change configuration options by changing DC/OS cluster configuration.

  3. This file must be overwritten regularly for proper cluster operation around

  4. master failure.

options timeout:1
options attempts:3


Not Working:

  1. This file is managed by systemd-resolved(8). Do not edit.

  2. Third party programs must not access this file directly, but

  3. only through the symlink at /etc/resolv.conf. To manage

  4. resolv.conf(5) in a different way, replace the symlink by a

  5. static file or a different symlink.

search eu-central-1.compute.internal

Why is this? Who is doing the update of the file? And what the hell is, there is no such ip in our setup.

Cody Maloney
August 17, 2016, 2:40 PM would be an AWS Internal server for DNS within a specific VPC subnet.
If you do a ls -alh /etc/resolv.conf I'm guessing it'll give something like:

If you do a rm /etc/resolv.conf on the hosts the problem will go away next time gen resolvconf runs.

I submitted: to help

August 17, 2016, 2:49 PM

you sir, are a hero!
Thank you and props to the team, this is exceptional good support.



Cody Maloney