magical Redis delay - Blog

[{"createTime":1735734952000,"id":1,"img":"hwy_ms_500_252.jpeg","link":"https://activity.huaweicloud.com/cps.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=V1g3MDY4NTY=&utm_medium=cps&utm_campaign=201905","name":"华为云秒杀","status":9,"txt":"华为云38元秒杀","type":1,"updateTime":1735747411000,"userId":3},{"createTime":1736173885000,"id":2,"img":"txy_480_300.png","link":"https://cloud.tencent.com/act/cps/redirect?redirect=1077&cps_key=edb15096bfff75effaaa8c8bb66138bd&from=console","name":"腾讯云秒杀","status":9,"txt":"腾讯云限量秒杀","type":1,"updateTime":1736173885000,"userId":3},{"createTime":1736177492000,"id":3,"img":"aly_251_140.png","link":"https://www.aliyun.com/minisite/goods?userCode=pwp8kmv3","memo":"","name":"阿里云","status":9,"txt":"阿里云2折起","type":1,"updateTime":1736177492000,"userId":3},{"createTime":1735660800000,"id":4,"img":"vultr_560_300.png","link":"https://www.vultr.com/?ref=9603742-8H","name":"Vultr","status":9,"txt":"Vultr送$100","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":5,"img":"jdy_663_320.jpg","link":"https://3.cn/2ay1-e5t","name":"京东云","status":9,"txt":"京东云特惠专区","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":6,"img":"new_ads.png","link":"https://www.iodraw.com/ads","name":"发布广告","status":9,"txt":"发布广告","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":7,"img":"yun_910_50.png","link":"https://activity.huaweicloud.com/discount_area_v5/index.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=aXhpYW95YW5nOA===&utm_medium=cps&utm_campaign=201905","name":"底部","status":9,"txt":"高性能云服务器2折起","type":2,"updateTime":1735660800000,"userId":3}]

I'm doing some business recently Redis Volume reduction work , It involves data migration , and Redis The data migration of the system seems quite simple , One to one data migration only requires slave Row configuration masterauth
and slaveof Two parameters are enough , Of course, the migration process involves other special circumstances, which need special treatment .

After the above steps are completed ,
Waiting for the instance to switch , However, we also need to check the synchronization before switching instances , Data consistency, etc . A strange phenomenon was found while checking the instance synchronization ： Migrating 540 Individual instances out of instances (20 individual )
lag Relatively high , And there is an increasing trend , It's strange offset The value has always been 0.
slave0:ip=xxxx,port=xxxx,state=online,offset=0,lag=38392

After a period of observation , It was found that this phenomenon did not disappear , and lag It's also increasing with time ,offset Values are always maintained 0. Is this data slave It hasn't been applied ?
cli upper slave After discovery slave Of dbsize and master Of dbsize Basically consistent ;monitor slave
It is also found that the command is synchronized normally ;master set One key, stay slave You can also read this correctly key; I'm looking at it slave Log on :
[13739] 10 Aug 15:49:46.017 * Connecting to MASTER xxxx.xxxx.xxx:xxx
[13739] 10 Aug 15:49:46.017 * MASTER <-> SLAVE sync started
[13739] 10 Aug 15:49:46.018 * Non blocking connect for SYNC fired the event.
[13739] 10 Aug 15:49:46.032 * Master replied to PING, replication can
continue...
[13739] 10 Aug 15:49:46.092 * Partial resynchronization not possible (no cached
master)
[13739] 10 Aug 15:49:46.120 # Unexpected reply to PSYNC from master: -LOADING
Redisis loading the dataset in memory
[13739] 10 Aug 15:49:46.120 * Retrying with SYNC...
[13739] 10 Aug 15:49:46.641 * MASTER <-> SLAVE sync: receiving 26845995 bytes
frommaster
[13739] 10 Aug 15:49:47.276 * MASTER <-> SLAVE sync: Flushing old data
[13739] 10 Aug 15:49:47.276 * MASTER <-> SLAVE sync: Loading DB in memory
[13739] 10 Aug 15:49:47.620 * MASTER <-> SLAVE sync: Finished with success
[13739] 10 Aug 15:49:47.621 * Background append only file rewriting started by
pid 22605
[22605] 10 Aug 15:49:48.724 * SYNC append only file rewrite performed
[22605] 10 Aug 15:49:48.725 * AOF rewrite: 0 MB of memory used by copy-on-write
[13739] 10 Aug 15:49:48.822 * Background AOF rewrite terminated with success
[13739] 10 Aug 15:49:48.822 * Parent diff successfully flushed to the rewritten
AOF (1148bytes)
[13739] 10 Aug 15:49:48.822 * Background AOF rewrite finished successfully
It seems to be normal, too :(

So here comes the question... , Since our data can be synchronized properly , Why? master Information displayed on slave It's been delayed ? Do you? ?
Open the code and find this lag ,offset How to calculate ：
if (slave->replstate == SLAVE_STATE_ONLINE)
lag = time(NULL) - slave->repl_ack_time;

info = sdscatprintf(info,
"slave%d:ip=%s,port=%d,state=%s,"
"offset=%lld,lag=%ld\r\n",
slaveid,slaveip,slave->slave_listening_port,state,
slave->repl_ack_off, lag);
long long repl_ack_off; /* Replication ack offset, if this is a slave. */
long long repl_ack_time;/* Replication ack time, if this is a slave. */

You can find this lag Yes master The current time of the slave adopt ACK I got it when I got it , At this point, we can suspect that it is slave It hasn't been ACK ?

Here, through to lag,offset The value is normal master Node execution monitor Commands can be found , Of these examples slave It did ACK Ordered back
, And this kind of node that seems to be abnormal does not exist ACK return .

Here it seems to have found the crux of the problem , So why is this seemingly unusual slave Do not send ACK to master What about ? This has to be removed layer by layer slave The veil of movement .

Check again slave Running log , And compared with other examples , It seems that we can find some different places ：

You can see from the above figure , It can be sent normally ACK to master There are several more lines in the instance log , And these logs may be sent or not ACK The key to .

You can get the following information by looking at the source code ：
replicationCron Function is executed once per second , If the current instance is configured masterhost, Then check the status of synchronization server.repl_state ,
If this state by ： REPL_STATE_CONNECT /* Must connect to master */, Then they will try and Master Establishing a connection
connectWithMaster(), If the connection is established properly , So and master Data synchronization syncWithMaster(), And update the initial delivery PING
Package to Master, wait for Master
Can use PONG, When received Master Response to , And carry on AUTH After operation ,slave Partial synchronization is attempted , Full synchronization occurs when partial synchronization fails
slaveTryPartialResynchronization(), And the synchronization instruction sent in this function is PSYNC instructions , When master The response is not
+FULLRESYNC perhaps +CONTINUE Time , So the unified view Master incognizance PSYNC
instructions , And then , In order to be compatible with the old version of the synchronization mode , It will be used here SYNC Instruction Reissue Master.
psync_result = slaveTryPartialResynchronization(fd);
if (psync_result == PSYNC_CONTINUE) {
redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync: Master accepted a Partial
Resynchronization.");
return;
}
/* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
* and the server.repl_master_runid and repl_master_initial_offset are
* already populated. */
if (psync_result == PSYNC_NOT_SUPPORTED) {
redisLog(REDIS_NOTICE,"Retrying with SYNC...");
}

After execution of play instructions , If it's successful, we'll go on to the next step , Full synchronous reception RDB,FLUSHDB,LOADRDB And so on .

At this point, we can explain the inconsistency between the log output of the two instances , It can be considered that an instance uses the PSYNC Synchronization mode of , Another example uses the SYNC
The way . Used SYNC Examples of synchronous mode server.repl_master_initial_offset = -1 , and Used PSYNC Examples of synchronous mode
server.repl_master_initial_offset = 1 .（ For this variable, you can use the gdb Tools to verify ,gdb Use and cherish risks ~）

And now back to sending ACK Where is the function :
/* Send ACK to master from time to time.
* Note that we do not send periodic acks to masters that don't
* support PSYNC and replication offsets. */
if (server.masterhost && server.master &&
!(server.master->flags & CLIENT_PRE_PSYNC))
replicationSendAck();
#define CLIENT_PRE_PSYNC (1<<16) /* Instance don't understand PSYNC. */

That is to say, using SYNC Examples of synchronous mode (server.master->flags & CLIENT_PRE_PSYNC)
This condition is not satisfied , So this function will not be executed .

Why? slave Do not send ACK to master We found the root cause , Because slave The synchronization mode used is SYNC mode .
Then use PSYNC and SYNC Time ,master Will you do different things ? What is the impact on synchronized data ?show me the code :
{"sync",syncCommand,1,"ars",0,NULL,0,0,0,0,0},
{"psync",syncCommand,3,"ars",0,NULL,0,0,0,0,0},

if (!strcasecmp(c->argv[0]->ptr,"psync")) {
if (masterTryPartialResynchronization(c) == C_OK) {
server.stat_sync_partial_ok++;
return; /* No full resync needed, return. */
} else {
char *master_replid = c->argv[1]->ptr;
if (master_replid[0] != '?') server.stat_sync_partial_err++;
}
} else {
/* If a slave uses SYNC, we are dealing with an old implementation
* of the replication protocol (like redis-cli --slave). Flag the client
* so that we don't expect to receive REPLCONF ACK feedbacks. */
---- It is also shown here , Used SYNC Mode ,master Not expected slave send out ACK come back
c->flags |= CLIENT_PRE_PSYNC;
}

You can see ,master Yes PSYNC and SYNC The entrance of the two synchronization modes is the same , The difference is that PSYNC Partial synchronization is possible , and SYNC Full synchronization is only possible .

Summary ：
redis slave There are two ways to synchronize data ,PSYNC and SYNC , When slave use PSYNC
When data synchronization fails , Will try to use SYNC Mode synchronization , And use SYNC When synchronizing data with , It won't be given Master send out ACK data , cause master See on slave Of lag
Inaccurate information .

lag This value may not be used to determine one slave Is there a delay , How long is the delay . We can base it on master_last_io_seconds
To judge this slave Is there a delay ; Or we need to find out through peripheral monitoring .

Technology

Java296 blogs
Python265 blogs
Vue125 blogs
C Language122 blogs
Algorithm108 blogs
MySQL96 blogs
Flow Chart85 blogs
JavaScript79 blogs
More...