Navstar crash after installing marathon-lb (DCOS 1.8.7)

Description

Hi,

we are using DC/OS 1.8.7 on CentOS 7.3.1611 , with 3x master and 1x slave.

Everything works fine, but after adding Marathon-lb we start receiving crash reports from navstar-env (attached as journalctl-navstar-marathonlb.txt ). The log file is obtained with journalctl -flu dcos-navstar.service

The errors keep appearing for a variable interval of time (30 sec to 5 minutes), at a 30sec rate, then stop. If we try to start/stop another service while marathon-lb is running, the crashes keep appearing for the new instance ( log attached as journalctl-navstar-api.txt )

We have been having this issue on a 1x master 2x slave configuration, too.

If we stop marathon-lb and wait for the corresponding crashes to end, then we are able to start/stop other services without errors.

It seems that the DNS configuration update is delayed until the errors stop: following the previous example and trying to dig api.marathon.autoip.dcos.thisdcos.directory after stopping the service, while the errors are appearing, the "old" IP keeps appearing in the ANSWER SECTION.

I've also attached the marathon.json file for marathon-lb.

What could be the issue?
We're available to provide further information if needed.

Thanks in advance,
Marco

EDIT: I've also attached relevant crash.log from /opt/mesosphere/packages/navstar--...../navstar/log/

Activity

Show:
Nicholas Sun
January 26, 2017, 8:02 PM

Also, did Marathon-lb ever stabilize? Or was it flapping (going up, crashing, relaunched, etc.)?

Marco Reni
January 27, 2017, 11:57 AM

This is an Open DC/OS Cluster, and the marathon-lb instance never crashes after launching.

We noticed that in the 3x master cluster the navstar errors appear only on leader.

We did another test to provide further information. The starting configuration is:

  • Open DC/OS 1.8.7 on CentOS 7.3.1611 , with 3x master and 1x slave.

  • 1x Marathon-lb instance up since 20hours

  • 1x "api" instance up since 20hours

  • No errors appearing on the /var/log/messages of the leader nor on the stdout/stderr of the marathon-lb instance.

We shut down the "api" service at 11:10 (by scaling down from 1 to 0 instances) and the following happens:

  • errors ( attached

  • ) appear on the leader messages log

  • no relevant logs on other masters

  • no problems on the agent ( attached

  • , of the shutting down handling)

  • marathon-lb does NOT crash but prints some errors on the instance stdout ( attached

  • - note, we have other services running on the cluster, marked as running-service-a / e )

Here are the IP correspondances for the various roles:
10.249.3.241 = LEADER
10.249.3.240 = MASTER
10.249.3.251 = MASTER
10.249.3.252 = AGENT

Hope this helps,
Marco

Nicholas Sun
January 28, 2017, 1:17 AM

Just to confirm, the errors that occur in the leaders eventually go away? I have reproduced this error, but the error appears to go away after a couple messages. If this is also the case for you, then this was likely something we fixed in DC/OS 1.9.

Marco Reni
January 28, 2017, 9:37 AM

Yes, the error eventually go away. Yesterday it persisted for more than one hour before going away while creating a 3x marathon-lb service, and in that time frame the DNS was not up to date. Is there a way to force the update or to work around this issue while on 1.8?

And also, is there an expected release date for DC/OS 1.9?

Thanks,
Marco

AdamB
January 31, 2017, 5:59 PM

We should have a public EA of 1.9 sometime in February, and GA likely in March. Exact dates will of course depend on stability and test results.

Assignee

Nicholas Sun

Labels

None

Components