Troubleshooting Slow LTR-LMR connection¶
Context¶
During the integration, sometimes the flow between LTR and LMR is very slow, ie few hundred of Events Per Seconds (EPS).
Two usual reasons¶
- 1) PunchPlatform configuration issue.
- 2) Network issue
Let's explain theses issues and how to fix them:
PunchPlatform configuration issue¶
During the log transport, the logs are stored in a queue (kafka) before sending them to the LMR. The component responsible for this function read the queue and wait an acknowledgement. Frequently, the network allows a large bandwith but is a little bit slow, so a large number of logs are stored in memory directly in the component (topology).
To prevent an Out of Memory, the topology.max.spout.pending setting is set.
If the topology.max.spout.pending is too slow, the component stops reading the queue before receiving acknowledgements. It produces a significant drop of performance.
We recommend to update this setting with the following table
EPS(avg) topology.max.spout.pending
under 1000 600
between 1000 and 5000 6000
between 5000 and 10k 12000
above 10k 30000
Network Issue¶
this issue, is an advanced one...
Between LTR and LMR, we recommend to use an acknowledged applicative protocole such as Lumberjack. This means a no negligible return flow exists.
In fact, it's more easy to see this pattern on the rates flow :
And you must see on worker logs (before and after increasing and decreasing rate) :
1 | 2017-11-13 16:15:00.283 c.t.s.c.p.c.n.i.NettyLumberJackClientImpl [WARN] message="keepalive timeout reached" now=60046 timeout=60000 |
The issue, is a lumberjack timeout during the keepalive test. Usually, the acknowledgements don't return from the LMR to the LTR.
The fix is to reduce the MTU, for instance to 800.