Skip to content

Troubleshooting  Slow LTR-LMR connection

Context

During the integration, sometimes the flow between LTR and LMR is very slow, ie few hundred of Events Per Seconds (EPS).

Two usual reasons

  • 1) PunchPlatform configuration issue.
  • 2) Network issue

Let's explain theses issues and how to fix them:

PunchPlatform configuration issue

During the log transport, the logs are stored in a queue (kafka) before sending them to the LMR. The component responsible for this function read the queue and wait an acknowledgement. Frequently, the network allows a large bandwith but is a little bit slow, so a large number of logs are stored in memory directly in the component (topology).

To prevent an Out of Memory, the topology.max.spout.pending setting is set.

If the topology.max.spout.pending is too slow, the component stops reading the queue before receiving acknowledgements. It produces a significant drop of performance.

We recommend to update this setting with the following table


EPS(avg) topology.max.spout.pending


under 1000 600

between 1000 and 5000 6000

between 5000 and 10k 12000

above 10k 30000


Network Issue

this issue, is an advanced one...

Between LTR and LMR, we recommend to use an acknowledged applicative protocole such as Lumberjack. This means a no negligible return flow exists.

In fact, it's more easy to see this pattern on the rates flow :

image

And you must see on worker logs (before and after increasing and decreasing rate) :

1
2017-11-13 16:15:00.283 c.t.s.c.p.c.n.i.NettyLumberJackClientImpl [WARN] message="keepalive timeout reached" now=60046 timeout=60000

The issue, is a lumberjack timeout during the keepalive test. Usually, the acknowledgements don't return from the LMR to the LTR.

The fix is to reduce the MTU, for instance to 800.