Troubleshooting Slow LTR-LMR connection¶

Symptoms¶

During the integration, sometimes the flow between LTR and LMR is very slow, i.e. only few hundred of Events Per Seconds (EPS).

There are two usual reasons

punch configuration issue.
Network issue

Punch configuration issue¶

During the log transport, the logs are stored in a queue (kafka) before sending them to the LMR. The component responsible for this function read the queue and wait an acknowledgement. Frequently, the network allows a large bandwidth but is a little slow, so many logs are stored in memory directly in the component (topology).

To prevent an Out of Memory, the punchline.max.spout.pending setting is set.

If the topology.max.spout.pending is too slow, the component stops reading the queue before receiving acknowledgements. It produces a significant drop of performance.

We recommend updating this setting with the following table

EPS(avg) topology.max.spout.pending

under 1000 600

between 1000 and 5000 6000

between 5000 and 10k 12000

above 10k 30000

Network Issues¶

Between LTR and LMR,the lumberjack acknowledged protocol is used. This protocol is two-ways but only little information goes upstream. This protocol leverages some keepalive mechanism that will help you to detect slow network issues.

Check you must see on worker logs (before and after increasing and decreasing rate) :

2017-11-13 16:15:00.283 c.t.s.c.p.c.n.i.NettyLumberJackClientImpl [WARN] message="keepalive timeout reached" now=60046 timeout=60000

The issue is a lumberjack timeout during the keepalive test. Check your bandwidth capacity using the punchplatform-injector.sh too, One check also the proper MTU settings or your link.