The other day we discovered a bug in older Linux kernel versions and I thought you would like to know about it.
Use Linux kernel 3.19+ when running Hazelcast on AWS. TCP connections can get stuck with older kernel versions. They appear to be fine, but data are not flowing. This can result in hard-to-explain timeouts.
The Gory Details
tl;dr: There is a bug in Xen network driver and AWS happens to use Xen for virtualization. A workaround is to disable (sudo ethtool -K eth0 sg off) the buggy features, but it comes with a performance price and it’s better to use a kernel version with fix = 3.19+. I assume Linux distribution vendors back-ported the fix to older kernel versions, but I have not checked that.