test_systemd_units_health for dcos-spartan-watchdog.service failing regularly on Azure

Description

The dcos-spartan-watchdog.service is failing regularly with our integration tests on Azure. It appears that the test is sensitive to system startup time, and we may need to adjust the watchdog service so that it only starts once the spartan service is fully initialized.

Failure snippet:

Example of the integration test failure:
https://teamcity.mesosphere.io/viewLog.html?buildId=382623&buildTypeId=ClosedSource_Dcos_IntegrationTests_CloudIntegrationTests_DcosOssAzureIntegration&tab=buildLog

This was discovered as part our new Azure integration tests here: https://github.com/dcos/dcos/pull/591

Activity

Show:
Jeremy Lingmann
August 26, 2016, 6:41 PM
Edited

We've muted this particular failure in our CI job until 9/2/2016.

Jeremy Lingmann
August 30, 2016, 7:36 PM

Any updates ?

Cody Maloney
September 5, 2016, 8:03 AM

what are we using to signal to azure that a host is up? If we could make that wait until 3dt reports all units on the host are healthy, would solve this.

Sargun Dhillon
September 12, 2016, 3:22 PM

Assignee

Sargun Dhillon

Labels

Components