ready.spartan unavailable during (marathon) deployment

Description

We ran into a strange problem with spartan/minuteman/VIPs.
We run DCOS 1.8.1 on AWS in 2 AZ. The problem happens on all agents. And I suspended all services for debugging.

The thing I observe is:

On a agent (10.0.10.168) I run:

while true; do ping ready.spartan -c1; sleep 1; done

And it always can resolve and ping it.
Then I deploy an app via marathon. (Constraints set to above host, just for debugging, but it happens on any node.)

As soon as I press the deploy button in marathon, the ping to ready.spartan returns "ping: unknown host ready.spartan". And of course all VIPs and Mesos DNS queries will fail. (Which is quite dramatic, because we use a VIP to download some files before containers start. So the container constantly keeps failing and re-deploying.)
About 30 secs after I cancel the deployment ready.spartan is back available.

I have no clue how to debug this any further.

journalctl -u docs-spartan does not show anything unusual.
journalctl -u docs-minuteman does report some errors (which I don't understand)

I have attached the output of docs-minuteman. From a few seconds before ready.spartan became unavailable.

I can provide more logs if necessary. Just don't know which.

Any help would very much be appreciated, that issue is keeping us from moving forward.

I think we faced the same issue on DCOS 1.8.0 but not sure, would need to test that.

Activity

Show:
f
August 17, 2016, 8:46 AM
Edited

I have rebooted the host. As soon as I had ssh connection run
while true; do ping -c1 ready.spartan; date; sleep 1; done

started some marathon deployments to show to problem.
and noted what happens.

Wed Aug 17 08:31:59 UTC 2016 - node ready
monitor a few minutes -> ready.spartan constantly available

Wed Aug 17 08:34:34 UTC 2016 - spartan unavailable (marathon deployment)
Wed Aug 17 08:35:25 UTC 2016 - rollback marathon deployment
Wed Aug 17 08:36:19 UTC 2016 - spartan back available
leave it a few minutes -> ready.spartan constantly available

Wed Aug 17 08:38:32 UTC 2016 - spartan unavailable (marathon deployment)
Wed Aug 17 08:39:20 UTC 2016 - spartan back available (deployment finished - task ready)

attached as journal.log is the output of journalctl -b (

)

Sargun Dhillon
August 17, 2016, 8:49 AM

Are you running DC/OS 1.8? can you ping + cat /etc/resolv.conf? and dig ready.spartan @198.51.100.2

f
August 17, 2016, 9:10 AM
Edited

: Ok I see what is happening now.

dig ready.spartan @198.51.100.2 is always working even when ping ready.spartan is not. This is because the /etc/resolv.conf changes during a deployment. (That I think is strange)

It switches between those two:

Working:

  1. Generated by gen_resolvconf.py. Do not edit.

  2. Change configuration options by changing DC/OS cluster configuration.

  3. This file must be overwritten regularly for proper cluster operation around

  4. master failure.

options timeout:1
options attempts:3

nameserver 198.51.100.1
nameserver 198.51.100.2
nameserver 198.51.100.3

Not Working:

  1. This file is managed by systemd-resolved(8). Do not edit.
    #

  2. Third party programs must not access this file directly, but

  3. only through the symlink at /etc/resolv.conf. To manage

  4. resolv.conf(5) in a different way, replace the symlink by a

  5. static file or a different symlink.

nameserver 10.0.0.2
search eu-central-1.compute.internal

Why is this? Who is doing the update of the file? And what the hell is 10.0.0.2, there is no such ip in our setup.

Cody Maloney
August 17, 2016, 2:40 PM

10.0.0.2 would be an AWS Internal server for DNS within a specific VPC subnet.
If you do a ls -alh /etc/resolv.conf I'm guessing it'll give something like:

If you do a rm /etc/resolv.conf on the hosts the problem will go away next time gen resolvconf runs.

I submitted: https://github.com/dcos/dcos/pull/552 to help

f
August 17, 2016, 2:49 PM

you sir, are a hero!
Thank you and props to the team, this is exceptional good support.

Done

Assignee

Cody Maloney

Labels

None

Components