This chapter highlights a typical migration strategy to achieve minimal or no service interruption and no data loss. It is informational.
0. Initial state¶
Here is a typical production setup. The production flow goes through
LTR refers to these components in charge of forwarding the data from remote sites, LMC refers to the ones
deployed in the backend data center to receive and process (i.e. parse) the data.
1. Start of migration¶
- The ansible and PunchPlatformAdmin (admin_user) functions are upgraded only on node 1 (ASB01 and LMC01).
- Disable PunchPlatformAdmin (admin_server) on the online node (node 1).
- Deploy new the new storm on LTR2.
- Start the [tenant]-a1-ltr2 (a for Ahmad version, 1 for the first migration) composed by one input topology that writing logs in a new topic called [tenant]-a1-[channel].
At this step, the production flow is still processing. There is only on LTR2 new topologies running which doesn't received any logs.
- Start the bridge-a1-[channel] which read logs in the old kafka topic and send them to [tenant]-a1-[channel] from LMC02 (not migrated node
At this step, the new kafka topic is starting to fill up with production logs. And, the production flow is still processing.
3. Migration Log Processing Logic¶
- Start a new temporary channel [tenant]-migr-a1 from LMC1 to test the migrated code. This channel provide a strategy to test the parsing, enrichment etc... It is based of first_offset kafka strategy and store the parsed log in a temporary index : [tenant]-migr-a1-YYYY.MM.DD in the Elasticsearch query cluster.
At this step, the production flow is still processing. The new parsed log are stored in the query cluster to help the staff to test the correct processing of logs.
4. Start with new storm¶
- Update production configuration
- Switch the virtual ip of the new input topology on LTR2.
At this step, the production flow is processing on the new version of storm.
5. End of Migration¶
- Kill all topologies in the LTR1 and Main storm clusters.
- Deploy the new storm in LTR1 and Main storm cluster.
- Start the updated production configuration channels.
- Deploy the new PunchPlatformAdmin on both nodes.
During this step, the service interruption is affected for the duration of the ip switch. At most 2min.
- Update Kibana Dashboards with the new metrics.
- Clean Zookeeper and Kafka old topic and channels.