流復(fù)制的原理:
物理復(fù)制也叫流復(fù)制,流復(fù)制的原理是主庫把WAL發(fā)送給備庫,備庫接收WAL后,進(jìn)行重放。
邏輯復(fù)制的原理:
邏輯復(fù)制也是基于WAL文件,在邏輯復(fù)制中把主庫稱為源端庫,備庫稱為目標(biāo)端數(shù)據(jù)庫,源端數(shù)據(jù)庫根據(jù)預(yù)先指定好的邏輯解析規(guī)則對WAL文件進(jìn)行解析,把DML操作解析成一定的邏輯變化信息(標(biāo)準(zhǔn)SQL語句),源端數(shù)據(jù)庫把標(biāo)準(zhǔn)SQL語句發(fā)給目標(biāo)端數(shù)據(jù)庫,目標(biāo)端數(shù)據(jù)庫接收到之后進(jìn)行應(yīng)用,從而實(shí)現(xiàn)數(shù)據(jù)同步。
流復(fù)制和邏輯復(fù)制的區(qū)別:
流復(fù)制主庫上的事務(wù)提交不需要等待備庫接收到WAL文件后的確認(rèn),邏輯復(fù)制相反。
流復(fù)制要求主備庫的大版本一致,邏輯復(fù)制可以跨大版本的數(shù)據(jù)同步,也可以實(shí)現(xiàn)異構(gòu)數(shù)據(jù)庫的數(shù)據(jù)同步。
流復(fù)制的主庫可讀寫,從庫只允許讀,邏輯復(fù)制的目標(biāo)端數(shù)據(jù)庫要求可讀寫
流復(fù)制是對實(shí)例級別的復(fù)制(整個postgresql數(shù)據(jù)庫),邏輯復(fù)制是選擇性的復(fù)制一些表,所以是對表級別的復(fù)制。
流復(fù)制有主庫的DDL、DML操作,邏輯復(fù)制只有DML操作。
補(bǔ)充:PostgreSQL 同步流復(fù)制原理和代碼淺析
背景
數(shù)據(jù)庫ACID中的持久化如何實(shí)現(xiàn)
數(shù)據(jù)庫ACID里面的D,持久化。 指的是對于用戶來說提交的事務(wù),數(shù)據(jù)是可靠的,即使數(shù)據(jù)庫crash了,在硬件完好的情況下,也能恢復(fù)回來。
PostgreSQL是怎么做到的呢,看一幅圖,畫得比較丑,湊合看吧。
假設(shè)一個事務(wù),對數(shù)據(jù)庫做了一些操作,并且產(chǎn)生了一些臟數(shù)據(jù),首先這些臟數(shù)據(jù)會在數(shù)據(jù)庫的shared buffer中。
同時,產(chǎn)生這些臟數(shù)據(jù)的同時也會產(chǎn)生對應(yīng)的redo信息,產(chǎn)生的REDO會有對應(yīng)的LSN號(你可以理解為REDO 的虛擬地址空間的一個唯一的OFFSET,每一筆REDO都有),這個LSN號也會記錄到shared buffer中對應(yīng)的臟頁中。
walwriter是負(fù)責(zé)將wal buffer flush到持久化設(shè)備的進(jìn)程,同時它會更新一個全局變量,記錄已經(jīng)flush的最大的LSN號。
bgwriter是負(fù)責(zé)將shared buffer的臟頁持久化到持久化設(shè)備的進(jìn)程,它在flush時,除了要遵循LRU算法之外,還要通過LSN全局變量的比對,來保證臟頁對應(yīng)的REDO記錄已經(jīng)flush到持久化設(shè)備了,如果發(fā)現(xiàn)還對應(yīng)的REDO沒有持久化,會觸發(fā)WAL writer去flush wal buffer。 (即確保日志比臟數(shù)據(jù)先落盤)
當(dāng)用戶提交事務(wù)時,也會產(chǎn)生一筆提交事務(wù)的REDO,這筆REDO也攜帶了LSN號。backend process 同樣需要等待對應(yīng)LSN flush到磁盤后才會返回給用戶提交成功的信號。(保證日志先落盤,然后返回給用戶)
數(shù)據(jù)庫同步復(fù)制原理淺析
同步流復(fù)制,即保證standby節(jié)點(diǎn)和本地節(jié)點(diǎn)的日志雙雙落盤。
PostgreSQL使用另一組全局變量,記錄同步流復(fù)制節(jié)點(diǎn)已經(jīng)接收到的XLOG LSN,以及已經(jīng)持久化的XLOG LSN。
用戶在發(fā)起提交請求后,backend process除了要判斷本地wal有沒有持久化,同時還需要判斷同步流復(fù)制節(jié)點(diǎn)的XLOG有沒有接收到或持久化(通過synchronous_commit參數(shù)控制)。
如果同步流復(fù)制節(jié)點(diǎn)的XLOG還沒有接收或持久化,backend process會進(jìn)入等待狀態(tài)。
數(shù)據(jù)庫同步復(fù)制代碼淺析
對應(yīng)的代碼和解釋如下:
1
2
|
CommitTransaction @ src/backend/access/transam/xact.c RecordTransactionCommit @ src/backend/access/transam/xact.c |
1
2
3
4
5
6
7
8
9
10
11
|
/* * If we didn 't create XLOG entries, we' re done here; otherwise we * should trigger flushing those entries the same as a commit record * would. This will primarily happen for HOT pruning and the like ; we * want these to be flushed to disk in due time . */ if (!wrote_xlog) // 沒有產(chǎn)生redo的事務(wù),直接返回 goto cleanup; if (wrote_xlog && markXidCommitted) // 如果產(chǎn)生了redo, 等待同步流復(fù)制 SyncRepWaitForLSN(XactLastRecEnd); |
SyncRepWaitForLSN @ src/backend/replication/syncrep.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
|
/* * Wait for synchronous replication, if requested by user . * * Initially backends start in state SYNC_REP_NOT_WAITING and then * change that state to SYNC_REP_WAITING before adding ourselves * to the wait queue. During SyncRepWakeQueue() a WALSender changes * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed. * This backend then resets its state to SYNC_REP_NOT_WAITING. */ void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN) { ... /* * Fast exit if user has not requested sync replication, or there are no * sync replication standby names defined. Note that those standbys don 't * need to be connected. */ if (!SyncRepRequested() || !SyncStandbysDefined()) // 如果不是同步事務(wù)或者沒有定義同步流復(fù)制節(jié)點(diǎn),直接返回 return; ... /* * We don' t wait for sync rep if WalSndCtl->sync_standbys_defined is not * set . See SyncRepUpdateSyncStandbysDefined. * * Also check that the standby hasn 't already replied. Unlikely race * condition but we' ll be fetching that cache line anyway so it's likely * to be a low cost check . */ if (!WalSndCtl->sync_standbys_defined || XactCommitLSN <= WalSndCtl->lsn[mode]) // 如果沒有定義同步流復(fù)制節(jié)點(diǎn),或者判斷到 commit lsn小于已同步的LSN,說明XLOG已經(jīng)flush了,直接返回。 { LWLockRelease(SyncRepLock); return ; } ... // 進(jìn)入循環(huán)等待狀態(tài),說明本地的xlog已經(jīng)flush了,只是等待同步流復(fù)制節(jié)點(diǎn)的REDO同步狀態(tài)。 /* * Wait for specified LSN to be confirmed. * * Each proc has its own wait latch, so we perform a normal latch * check /wait loop here. */ for (;;) // 進(jìn)入等待狀態(tài),檢查latch是否滿足釋放等待的條件(wal sender會根據(jù)REDO的同步情況,實(shí)時更新對應(yīng)的latch) { int syncRepState; /* Must reset the latch before testing state. */ ResetLatch(&MyProc->procLatch); syncRepState = MyProc->syncRepState; if (syncRepState == SYNC_REP_WAITING) { LWLockAcquire(SyncRepLock, LW_SHARED); syncRepState = MyProc->syncRepState; LWLockRelease(SyncRepLock); } if (syncRepState == SYNC_REP_WAIT_COMPLETE) // 說明XLOG同步完成,退出等待 break; // 如果本地進(jìn)程掛了,輸出的消息內(nèi)容是,本地事務(wù)信息已持久化,但是遠(yuǎn)程也許還沒有持久化 if (ProcDiePending) { ereport(WARNING, (errcode(ERRCODE_ADMIN_SHUTDOWN), errmsg( "canceling the wait for synchronous replication and terminating connection due to administrator command" ), errdetail( "The transaction has already committed locally, but might not have been replicated to the standby." ))); whereToSendOutput = DestNone; SyncRepCancelWait(); break; } // 如果用戶主動cancel query,輸出的消息內(nèi)容是,本地事務(wù)信息已持久化,但是遠(yuǎn)程也許還沒有持久化 if (QueryCancelPending) { QueryCancelPending = false ; ereport(WARNING, (errmsg( "canceling wait for synchronous replication due to user request" ), errdetail( "The transaction has already committed locally, but might not have been replicated to the standby." ))); SyncRepCancelWait(); break; } // 如果postgres主進(jìn)程掛了,進(jìn)入退出流程。 if (!PostmasterIsAlive()) { ProcDiePending = true ; whereToSendOutput = DestNone; SyncRepCancelWait(); break; } // 等待wal sender來修改對應(yīng)的latch /* * Wait on latch. Any condition that should wake us up will set the * latch, so no need for timeout. */ WaitLatch(&MyProc->procLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1); |
注意用戶進(jìn)入等待狀態(tài)后,只有主動cancel , 或者kill(terminate) , 或者主進(jìn)程die才能退出無限的等待狀態(tài)。后面會講到如何將同步級別降級為異步。
前面提到了,用戶端需要等待LATCH的釋放信號。
那么誰來給它這個信號了,是wal sender進(jìn)程,源碼和解釋如下 :
src/backend/replication/walsender.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
StartReplication WalSndLoop ProcessRepliesIfAny ProcessStandbyMessage ProcessStandbyReplyMessage if (!am_cascading_walsender) // 非級聯(lián)流復(fù)制節(jié)點(diǎn),那么它將調(diào)用SyncRepReleaseWaiters修改backend process等待隊列中它們對應(yīng)的 latch。 SyncRepReleaseWaiters(); SyncRepReleaseWaiters @ src/backend/replication/syncrep.c /* * Update the LSNs on each queue based upon our latest state. This * implements a simple policy of first -valid-standby-releases-waiter. * * Other policies are possible, which would change what we do here and what * perhaps also which information we store as well. */ void SyncRepReleaseWaiters(void) { ... // 釋放滿足條件的等待隊列 /* * Set the lsn first so that when we wake backends they will release up to * this location. */ if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write) { walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write; numwrite = SyncRepWakeQueue( false , SYNC_REP_WAIT_WRITE); } if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) { walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush; numflush = SyncRepWakeQueue( false , SYNC_REP_WAIT_FLUSH); } ... |
SyncRepWakeQueue @ src/backend/replication/syncrep.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
/* * Walk the specified queue from head. Set the state of any backends that * need to be woken, remove them from the queue, and then wake them. * Pass all = true to wake whole queue; otherwise, just wake up to * the walsender's LSN. * * Must hold SyncRepLock. */ static int SyncRepWakeQueue(bool all , int mode) { ... while (proc) // 修改對應(yīng)的backend process 的latch { /* * Assume the queue is ordered by LSN */ if (! all && walsndctl->lsn[mode] < proc->waitLSN) return numprocs; /* * Move to next proc, so we can delete thisproc from the queue. * thisproc is valid, proc may be NULL after this. */ thisproc = proc; proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]), &(proc->syncRepLinks), offsetof(PGPROC, syncRepLinks)); /* * Set state to complete; see SyncRepWaitForLSN() for discussion of * the various states. */ thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE; // 滿足條件時,改成SYNC_REP_WAIT_COMPLETE .... |
如何設(shè)置事務(wù)可靠性級別
PostgreSQL 支持在會話中設(shè)置事務(wù)的可靠性級別。
off 表示commit 時不需要等待wal 持久化。
local 表示commit 是只需要等待本地數(shù)據(jù)庫的wal 持久化。
remote_write 表示commit 需要等待本地數(shù)據(jù)庫的wal 持久化,同時需要等待sync standby節(jié)點(diǎn)wal write buffer完成(不需要持久化)。
on 表示commit 需要等待本地數(shù)據(jù)庫的wal 持久化,同時需要等待sync standby節(jié)點(diǎn)wal持久化。
提醒一點(diǎn), synchronous_commit 的任何一種設(shè)置,都不影響wal日志持久化必須先于shared buffer臟數(shù)據(jù)持久化。 所以不管你怎么設(shè)置,都不好影響數(shù)據(jù)的一致性。
1
2
|
synchronous_commit = off # synchronization level ; # off , local , remote_write, or on |
如何實(shí)現(xiàn)同步復(fù)制降級
從前面的代碼解析可以得知,如果 backend process 進(jìn)入了等待循環(huán),只接受幾種信號降級。 并且降級后會告警,表示本地wal已持久化,但是sync standby節(jié)點(diǎn)不確定wal有沒有持久化。
如果你只配置了1個standby,并且將它配置為同步流復(fù)制節(jié)點(diǎn)。一旦出現(xiàn)網(wǎng)絡(luò)抖動,或者sync standby節(jié)點(diǎn)故障,將導(dǎo)致同步事務(wù)進(jìn)入等待狀態(tài)。
怎么降級呢?
方法1.
修改配置文件并重置
1
2
3
|
$ vi postgresql.conf synchronous_commit = local $ pg_ctl reload |
然后cancel 所有query .
1
|
postgres=# select pg_cancel_backend(pid) from pg_stat_activity where pid<>pg_backend_pid(); |
收到這樣的信號,表示事務(wù)成功提交,同時表示W(wǎng)AL不知道有沒有同步到sync standby。
1
2
3
4
5
6
7
8
|
WARNING: canceling wait for synchronous replication due to user request DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. COMMIT postgres=# show synchronous_commit ; synchronous_commit -------------------- off (1 row) |
同時它會讀到全局變量synchronous_commit 已經(jīng)是 local了。
這樣就完成了降級的動作。
方法2.
方法1的降級需要對已有的正在等待wal sync的pid使用cancel進(jìn)行處理,有點(diǎn)不人性化。
可以通過修改代碼的方式,做到更人性化。
SyncRepWaitForLSN for循環(huán)中,加一個判斷,如果發(fā)現(xiàn)全局變量sync commit變成local, off了,則告警并退出。這樣就不需要人為的去cancel query了.
WARNING: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
以上為個人經(jīng)驗(yàn),希望能給大家一個參考,也希望大家多多支持服務(wù)器之家。如有錯誤或未考慮完全的地方,望不吝賜教。
原文鏈接:https://blog.csdn.net/weixin_42009082/article/details/96481014