OAUTH broken after master failover
We have a dev cluster on AWS with Centos AMI's, created by cloudformation per docs.
The leader, .72, failed because of disk space and took itself down in ZK, so far so good as expected:
The bad part came when operators attempted to OAUTH. They would click the login/OAUTH selector, get a brief flash of the DCOS dashboard, then be returned to the login/OAUTH selector.
I did notice that during this time, the dcos-adminrouter on the "down" master was still getting auth traffic, which I did not expect (but I see similar on leader also):
After fixing .72's OS problem, it put itself back into "serving" state and auths worked again.
Please let us know what logs we can provide before they get rotated away.
Okay, thanks for your thoughts. We'll keep an eye on this ticket. Let me know if you need any experiments or logs.
I think the ELB does a reasonably basic health check against Mesos, which would probably have been green throughout.
There's a couple of things we can fix to make the system more reliably available in case of this particular failure mode, but this needs some careful thought.
We'll do a more detailed writeup soon.
The simple manual workaround would have been to log in to a specific master IP or manually remove the affected master from the ELB.
Yes, 3 masters were in an ELB. While one in the "down" state it did not occur to me to check what the ELB thought of the situation, but I'm happy to experiment to find out, if you want. Is there a good way to command a master to down state while leaving the OS mostly up?
do make sure we have the full picture, you were accessing the cluster with the broken master via an ELB, correct?