The dcos-spartan-watchdog.service is failing regularly with our integration tests on Azure. It appears that the test is sensitive to system startup time, and we may need to adjust the watchdog service so that it only starts once the spartan service is fully initialized.
Failure snippet:
Example of the integration test failure:
https://teamcity.mesosphere.io/viewLog.html?buildId=382623&buildTypeId=ClosedSource_Dcos_IntegrationTests_CloudIntegrationTests_DcosOssAzureIntegration&tab=buildLog
This was discovered as part our new Azure integration tests here: https://github.com/dcos/dcos/pull/591
We've muted this particular failure in our CI job until 9/2/2016.
Any updates ?
what are we using to signal to azure that a host is up? If we could make that wait until 3dt reports all units on the host are healthy, would solve this.
See discussion here: https://github.com/dcos/dcos/pull/657