VIPs network failure

Description

Hi,
We are using DCOS 1.8.7 on CoreOS Stable 1235.5.0 and we have been encountering stability issues with VIPs.

The VIPs seem to work but we get a lot of connections errors in our applications logs. Sometimes it stop working completely until we reboot the involved nodes.

In our python web app we get a lot of those errors about the connection to the postgresql server :

In our gitlab we have the same kind of error about the connection to the external postgresql :

In our nginx frontend which try to connect to our python app we get those errors :

We are using the curl command to try and check if the VIPS are working.

So here It works when i try to access a python web app from a public node.

But if i try the same command from inside our nginx container :

The curl fail 9 out of 10 times from inside the nginx container.

The connection usually improve when we reboot the involved nodes but it start failing again after a while.

We also found this topic : https://groups.google.com/a/dcos.io/forum/#!searchin/users/vips/users/bKv9mucQBi0/QxgwmczmAAAJ which looks a little like the problem we have so we added the file /etc/sysctl.d/netfilter.conf with the following content on every node :

but it doesn't solve our network issue.

Do you have any idea where the problem could be ?
We can provide more information about our configuration and environment if necessary.

Activity

Show:
Albert Strasheim
January 20, 2017, 12:33 AM

Think we've answered this one. Please reopen if not. Thanks for the report.

Guilhem
January 16, 2017, 3:59 PM
Edited

After quite a few server restart I was able to find why the nf_conntrack_tcp_be_liberal wouldn't configure properly at reboot.
When we tried a cat of the parameter right after a server reboot, it said the file didn't exist. Same thing in the minuteman logs.

The nf_conntrack module was not yet loaded when minuteman or our own config file tried to set the parameter.

The link you gave me here https://coreos.com/os/docs/latest/other-settings.html#tuning-sysctl-parameters helped a lot.
As they say on the first paragraph there is a way to force the module to load early. Their example actually looked like what we need but it was not the right module. The fix was to load the nf_conntrack_ipv4 module early.

We will test the VIPs again and report if we still encounter issues but hopefully it should be alright.

Senthil Kumaran
January 13, 2017, 8:43 PM

,

The

information looks exactly like what we want, and we use this information to set the settings for the service (Ref: https://github.com/dcos/dcos/blob/1.8.7/pkgpanda/actions.py#L295)

You could verify the services trying to set these values in your journald logs. For e.g, I did:

And I see that bootstrap process before running the service attempting to set these values.

The cannot stat messages are for inapplicable settings and for the setting which is applicable, we indeed set it. It can verify rightly after the service is started.


1) Is there any other custom process that is overwriting that those? Can you try removing the ansible managed block in your netfilter.conf?
2) On manual tuning of the kernel parameters, I was able to follow the advice given here: https://coreos.com/os/docs/latest/other-settings.html#tuning-sysctl-parameters and noticed that my tuning parameters persisted across reboots.

So, in one of two ways (DCOS bootstap) or sysctl.d, we should be covered. Perhaps you try the above in your system and if you find any oddity let us know.

Guilhem
January 13, 2017, 3:18 PM
Edited

1)

2) the netfilter.conf files contain already :

but no it doesn't seem to work as the parameter is set to 0 when i check after a reboot.

Senthil Kumaran
January 13, 2017, 2:31 PM

, thanks for the above information. I'll need two additional details from your system setup.

1. Can you login to any node and give the output of

The information in that is used by packages while configuring the services and it should happen at restart too.

2. Does your /etc/sysctl.d/netfilter.conf changes persist across reboots? (if yes, you could try similar one for this nf_conntrack_tcp_be_liberal too for your purposes). Although, as I mentioned in step 1, this should not be required as we maintain a service configuration setup and whenever a particular is loaded we set the configuration before loading the service.

Done

Assignee

Albert Strasheim

Labels

None