I'm doing some business recently Redis Volume reduction work , It involves data migration , and Redis The data migration of the system seems quite simple , One to one data migration only requires slave Row configuration masterauth
and slaveof Two parameters are enough , Of course, the migration process involves other special circumstances, which need special treatment .

After the above steps are completed ,
Waiting for the instance to switch , However, we also need to check the synchronization before switching instances , Data consistency, etc . A strange phenomenon was found while checking the instance synchronization : Migrating 540 Individual instances out of instances (20 individual )
lag Relatively high , And there is an increasing trend , It's strange offset The value has always been 0.

After a period of observation , It was found that this phenomenon did not disappear , and lag It's also increasing with time ,offset Values are always maintained 0. Is this data slave It hasn't been applied ? 
cli upper slave After discovery slave Of dbsize and master Of dbsize Basically consistent ;monitor slave
It is also found that the command is synchronized normally ;master set One key, stay slave You can also read this correctly key; I'm looking at it slave Log on :
[13739] 10 Aug 15:49:46.017 * Connecting to MASTER xxxx.xxxx.xxx:xxx
[13739] 10 Aug 15:49:46.017 * MASTER <-> SLAVE sync started
[13739] 10 Aug 15:49:46.018 * Non blocking connect for SYNC fired the event.
[13739] 10 Aug 15:49:46.032 * Master replied to PING, replication can
[13739] 10 Aug 15:49:46.092 * Partial resynchronization not possible (no cached
[13739] 10 Aug 15:49:46.120 # Unexpected reply to PSYNC from master: -LOADING
Redisis loading the dataset in memory
[13739] 10 Aug 15:49:46.120 * Retrying with SYNC...
[13739] 10 Aug 15:49:46.641 * MASTER <-> SLAVE sync: receiving 26845995 bytes
[13739] 10 Aug 15:49:47.276 * MASTER <-> SLAVE sync: Flushing old data
[13739] 10 Aug 15:49:47.276 * MASTER <-> SLAVE sync: Loading DB in memory
[13739] 10 Aug 15:49:47.620 * MASTER <-> SLAVE sync: Finished with success
[13739] 10 Aug 15:49:47.621 * Background append only file rewriting started by
pid 22605
[22605] 10 Aug 15:49:48.724 * SYNC append only file rewrite performed
[22605] 10 Aug 15:49:48.725 * AOF rewrite: 0 MB of memory used by copy-on-write
[13739] 10 Aug 15:49:48.822 * Background AOF rewrite terminated with success
[13739] 10 Aug 15:49:48.822 * Parent diff successfully flushed to the rewritten
AOF (1148bytes)
[13739] 10 Aug 15:49:48.822 * Background AOF rewrite finished successfully
It seems to be normal, too :(

So here comes the question... , Since our data can be synchronized properly , Why? master Information displayed on slave It's been delayed ? Do you? ? 
Open the code and find this lag ,offset How to calculate :
if (slave->replstate == SLAVE_STATE_ONLINE)
lag = time(NULL) - slave->repl_ack_time;

info = sdscatprintf(info,
slave->repl_ack_off, lag);
long long repl_ack_off; /* Replication ack offset, if this is a slave. */
long long repl_ack_time;/* Replication ack time, if this is a slave. */

You can find this lag Yes master The current time of the slave adopt ACK I got it when I got it , At this point, we can suspect that it is slave It hasn't been ACK ?

Here, through to lag,offset The value is normal master Node execution monitor Commands can be found , Of these examples slave It did ACK Ordered back
, And this kind of node that seems to be abnormal does not exist ACK return .

Here it seems to have found the crux of the problem , So why is this seemingly unusual slave Do not send ACK to master What about ? This has to be removed layer by layer slave The veil of movement .

Check again slave Running log , And compared with other examples , It seems that we can find some different places :

You can see from the above figure , It can be sent normally ACK to master There are several more lines in the instance log , And these logs may be sent or not ACK The key to .

You can get the following information by looking at the source code : 
replicationCron Function is executed once per second , If the current instance is configured masterhost, Then check the status of synchronization server.repl_state ,
If this state by : REPL_STATE_CONNECT /* Must connect to master */, Then they will try and Master Establishing a connection
connectWithMaster(), If the connection is established properly , So and master Data synchronization  syncWithMaster(), And update the initial delivery PING
Package to Master, wait for Master
Can use PONG, When received Master Response to , And carry on AUTH After operation ,slave Partial synchronization is attempted , Full synchronization occurs when partial synchronization fails
slaveTryPartialResynchronization(), And the synchronization instruction sent in this function is PSYNC instructions , When master The response is not
+FULLRESYNC perhaps +CONTINUE Time , So the unified view Master incognizance PSYNC
instructions , And then , In order to be compatible with the old version of the synchronization mode , It will be used here SYNC Instruction Reissue Master.
psync_result = slaveTryPartialResynchronization(fd);
if (psync_result == PSYNC_CONTINUE) {
redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync: Master accepted a Partial
/* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
* and the server.repl_master_runid and repl_master_initial_offset are
* already populated. */
if (psync_result == PSYNC_NOT_SUPPORTED) {
redisLog(REDIS_NOTICE,"Retrying with SYNC...");

After execution of play instructions , If it's successful, we'll go on to the next step , Full synchronous reception RDB,FLUSHDB,LOADRDB And so on .

At this point, we can explain the inconsistency between the log output of the two instances , It can be considered that an instance uses the PSYNC Synchronization mode of , Another example uses the SYNC
The way . Used SYNC Examples of synchronous mode server.repl_master_initial_offset = -1 , and Used PSYNC Examples of synchronous mode
server.repl_master_initial_offset = 1 .( For this variable, you can use the gdb Tools to verify ,gdb Use and cherish risks ~)

And now back to sending ACK Where is the function :
/* Send ACK to master from time to time.
* Note that we do not send periodic acks to masters that don't
* support PSYNC and replication offsets. */
if (server.masterhost && server.master &&
!(server.master->flags & CLIENT_PRE_PSYNC))
#define CLIENT_PRE_PSYNC (1<<16) /* Instance don't understand PSYNC. */

That is to say, using SYNC Examples of synchronous mode (server.master->flags & CLIENT_PRE_PSYNC)
This condition is not satisfied , So this function will not be executed .

Why? slave Do not send ACK to master We found the root cause , Because slave The synchronization mode used is SYNC mode . 
Then use PSYNC and SYNC Time ,master Will you do different things ? What is the impact on synchronized data ?show me the code :

if (!strcasecmp(c->argv[0]->ptr,"psync")) {
if (masterTryPartialResynchronization(c) == C_OK) {
return; /* No full resync needed, return. */
} else {
char *master_replid = c->argv[1]->ptr;
if (master_replid[0] != '?') server.stat_sync_partial_err++;
} else {
/* If a slave uses SYNC, we are dealing with an old implementation
* of the replication protocol (like redis-cli --slave). Flag the client
* so that we don't expect to receive REPLCONF ACK feedbacks. */
---- It is also shown here , Used SYNC Mode ,master Not expected slave send out ACK come back
c->flags |= CLIENT_PRE_PSYNC;

You can see ,master Yes PSYNC and SYNC The entrance of the two synchronization modes is the same , The difference is that PSYNC Partial synchronization is possible , and SYNC Full synchronization is only possible .

Summary : 
redis slave There are two ways to synchronize data ,PSYNC and SYNC , When slave use PSYNC
When data synchronization fails , Will try to use SYNC Mode synchronization , And use SYNC When synchronizing data with , It won't be given Master send out ACK data , cause master See on slave Of lag
Inaccurate information .

lag This value may not be used to determine one slave Is there a delay , How long is the delay . We can base it on master_last_io_seconds
To judge this slave Is there a delay ; Or we need to find out through peripheral monitoring .

©2020 ioDraw All rights reserved
2020 Nobel Prize in physiology or medicine announced Implementation and challenge of metadata service in data Lake Enterprises face SEM Bidding and SEO How to choose ? Or both ?spark.sql.shuffle.partitions and spark.default.parallelism The difference between JavaScript Do a simple guess number games What are the types of variables ? Trump's "VIP therapy ": Is receiving a drug treatment that has not yet been approved ( Essence )2020 year 6 month 26 day C# Class library DataTable( Extension method ) program ( process ) How is it stored in the operating system , Space allocation Understanding neural network machine translation in three minutes