Connection Timeout Issue

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Connection Timeout Issue

Owen Pahl
Hi All,

Apologies for the length of this post/question. This covers a problem that
I have been attempting to debug for the past few months ...

I am having an on-going problem with JMeter where I get connection time out
failures once load passes a threshold on our load generation machines.
The error rate is 0.5-1% of total number of transactions.

Packet captures show JMeter sometimes not closing connections correctly and
then re-using the port sometime later but before the other end has cleaned
up the half-open connection.
(I am also open to the possibility that this is a symptom rather than a
cause).

Further details:
In this particular test scenario, I am testing a load-balanced application
that is fronted by a VIP hosted on an F5.

From the trace mentioned above, at loads above some threshold (yet to be
determined) when the F5 initiates the connection tear-down the JMeter
machine responds with an ACK but never sends a FIN to close its side of the
connection.
This leaves a half-open connection on the F5 but a closed connection on the
JMeter machine. Some time later after the timed-wait period (2 mins) is
passed but before the idle connection clean up job runs on the F5 (5 mins);
a connection is initiated on the same local port. The F5 sees this as
further traffic on a connection it thought it closed and re-sends its FIN.
This confuses the JMeter side that thinks it is standing up a new
connection and re-sends its SYN. This repeats with exponential back-off 7
times as per the TCP spec. After this the connection is reported as failing
with a connection time of ~127 seconds.

To prove this is the probable cause I was able to temporarily get the clean
up timeout on the F5 reduced to 90 seconds. In this scenario with the same
applied load we saw no connection timeout errors.

The load machine is RedHat 7.7 running as far as I am aware, the standard
defaults (the host was built and maintained by a 3rd party provider).
I have tried JMeter 5.0 (the current version we use at work) and 5.2.1,
again running vanilla here other than setting a large JVM heap size. No
changes to HTTPClient config, JVM args etc.
The test is triggered from Jenkins although I doubt this has any impact as
I am just starting JMeter as I would on the command line. Jenkins just
helps with pulling the results back through many jumps.
I have tried on the HotSpot 1.8.0_51 and openJDK 1.8.0_232 JVMs, both give
the same behaviour.
CPU, memory and GC activity all look low to normal during the test.
The JMeter host is a virtual machine with 4 cpus and 16GB ram.
And to discount issues with the host application, I have run the same load
across multiple machines (on the same VLAN etc) and everything works as
expected.
During the test at the point where the errors are being logged, there are
still over 21k ephemeral ports available.
The process is peaking at ~1500 threads although I'm not sure if this is
being reported correctly by RedHat. I was expecting around 6-7x that number
as I have configured the script to download embedded resources with 6
threads.
The machine is peaking at ~7500 open files which matches roughly the number
of open connections reported by ss as expected.

I see in the documentation for Apache HttpClient there is a section on
connection management that seems to describe the behaviour displayed.

http://hc.apache.org/httpcomponents-client-4.3.x/tutorial/html/connmgmt.html#d5e405
"One of the major shortcomings of the classic blocking I/O model is that
the network socket can react to I/O events only when blocked in an I/O
operation. When a connection is released back to the manager, it can be
kept alive however it is unable to monitor the status of the socket and
react to any I/O events. If the connection gets closed on the server side,
the client side connection is unable to detect the change in the connection
state (and react appropriately by closing the socket on its end)."

It suggests running a separate thread to periodically clean up the
connection pools by calling
HttpClientConnectionManager.closeExpiredConnections();. I can see this
currently gets called but only at the start of a new thread iteration.

This leads to some obvious questions:

- is this behaviour anyone has seen before?
- is this a bug with JMeter/HTTP Client?
- is this maybe a bug with Java? or potentially RedHat reusing ports too
aggressively?
- Any suggestions for a work-around or further diagnosis?


Cheers,
Owen
Reply | Threaded
Open this post in threaded view
|

Re: Connection Timeout Issue

Mariusz W
Hi,
Did you monitor count of TIME_WAIT sockets during test? Did you check free
handlers count for process (jmeter/java)?
I see this (or similar) behaviour (Connection timeout) on windows 10 pro.
But I noticed that while I  have this behaviour when I use jmeter on laptop
connected with vpn, I don't have this then I use windows server in internal
net. The other components are similar (f5 and vip). My current
diagnosis/checking are is TIME_WAIT sockets on windows side. I don't check
components between yet.

Regards,
Mariusz

On Fri, 17 Apr 2020 at 11:48, Owen Pahl <[hidden email]> wrote:

> Hi All,
>
> Apologies for the length of this post/question. This covers a problem that
> I have been attempting to debug for the past few months ...
>
> I am having an on-going problem with JMeter where I get connection time out
> failures once load passes a threshold on our load generation machines.
> The error rate is 0.5-1% of total number of transactions.
>
> Packet captures show JMeter sometimes not closing connections correctly and
> then re-using the port sometime later but before the other end has cleaned
> up the half-open connection.
> (I am also open to the possibility that this is a symptom rather than a
> cause).
>
> Further details:
> In this particular test scenario, I am testing a load-balanced application
> that is fronted by a VIP hosted on an F5.
>
> From the trace mentioned above, at loads above some threshold (yet to be
> determined) when the F5 initiates the connection tear-down the JMeter
> machine responds with an ACK but never sends a FIN to close its side of the
> connection.
> This leaves a half-open connection on the F5 but a closed connection on the
> JMeter machine. Some time later after the timed-wait period (2 mins) is
> passed but before the idle connection clean up job runs on the F5 (5 mins);
> a connection is initiated on the same local port. The F5 sees this as
> further traffic on a connection it thought it closed and re-sends its FIN.
> This confuses the JMeter side that thinks it is standing up a new
> connection and re-sends its SYN. This repeats with exponential back-off 7
> times as per the TCP spec. After this the connection is reported as failing
> with a connection time of ~127 seconds.
>
> To prove this is the probable cause I was able to temporarily get the clean
> up timeout on the F5 reduced to 90 seconds. In this scenario with the same
> applied load we saw no connection timeout errors.
>
> The load machine is RedHat 7.7 running as far as I am aware, the standard
> defaults (the host was built and maintained by a 3rd party provider).
> I have tried JMeter 5.0 (the current version we use at work) and 5.2.1,
> again running vanilla here other than setting a large JVM heap size. No
> changes to HTTPClient config, JVM args etc.
> The test is triggered from Jenkins although I doubt this has any impact as
> I am just starting JMeter as I would on the command line. Jenkins just
> helps with pulling the results back through many jumps.
> I have tried on the HotSpot 1.8.0_51 and openJDK 1.8.0_232 JVMs, both give
> the same behaviour.
> CPU, memory and GC activity all look low to normal during the test.
> The JMeter host is a virtual machine with 4 cpus and 16GB ram.
> And to discount issues with the host application, I have run the same load
> across multiple machines (on the same VLAN etc) and everything works as
> expected.
> During the test at the point where the errors are being logged, there are
> still over 21k ephemeral ports available.
> The process is peaking at ~1500 threads although I'm not sure if this is
> being reported correctly by RedHat. I was expecting around 6-7x that number
> as I have configured the script to download embedded resources with 6
> threads.
> The machine is peaking at ~7500 open files which matches roughly the number
> of open connections reported by ss as expected.
>
> I see in the documentation for Apache HttpClient there is a section on
> connection management that seems to describe the behaviour displayed.
>
>
> http://hc.apache.org/httpcomponents-client-4.3.x/tutorial/html/connmgmt.html#d5e405
> "One of the major shortcomings of the classic blocking I/O model is that
> the network socket can react to I/O events only when blocked in an I/O
> operation. When a connection is released back to the manager, it can be
> kept alive however it is unable to monitor the status of the socket and
> react to any I/O events. If the connection gets closed on the server side,
> the client side connection is unable to detect the change in the connection
> state (and react appropriately by closing the socket on its end)."
>
> It suggests running a separate thread to periodically clean up the
> connection pools by calling
> HttpClientConnectionManager.closeExpiredConnections();. I can see this
> currently gets called but only at the start of a new thread iteration.
>
> This leads to some obvious questions:
>
> - is this behaviour anyone has seen before?
> - is this a bug with JMeter/HTTP Client?
> - is this maybe a bug with Java? or potentially RedHat reusing ports too
> aggressively?
> - Any suggestions for a work-around or further diagnosis?
>
>
> Cheers,
> Owen
>