CS3223 Notes

AY2022/2023 Semester 2, written by See Toh Jin Wei
#nus #cs3223

Summary Notes

Available here in PDF.

Disk Access Time

  • Command Processing Time (negligible)
  • Seek Time: moving arms to position disk head on track
    • Average seek time: 5-6ms
  • Rotational Delay: waiting for block to rotate under head
    • Depends on RPM; average rotation delay = time for 1/21/2 revolution
    • x RPM => (60000)/x(60'000) / x ms for 1 revs
  • Transfer Time: time to move data to/from data surface
    • ntime for one revolutionnumber of sectors per trackn * \frac{\text{time for one revolution}}{\text{number of sectors per track}}
  • Access time = seek time + rotational delay + transfer time
  • Response time = queueing delay + access time

B+ Tree

  • Leaf nodes are doubly-linked
  • Internal nodes (p0,k1,p1,...,kn,pnp_0, k_1, p_1, ..., k_n, p_n)
  • Order dd: non-root node [d,2d][d, 2d]; root node [1,2d][1, 2d]
  • Maximum order: 2d(#bytes of key)+(2d+1)(#bytes of page address)#bytes of page size2d(\text{\#bytes of key}) + (2d+1)(\text{\#bytes of page address}) \le \text{\#bytes of page size}. Solve for dd
  • Minimum number of leaf nodes: 2(d+1)i12 * (d+1)^{i-1}; Maximum: (2d+1)i(2d + 1)^i; ii is levels of internal


  • Create N0=N/BN_0 = \lceil N / B \rceil sorted runs, N pages
  • Merging: B1B-1 pages for input, 11 for output
  • Total I/O = 2N(logB1(N0)+1)2N*(\lceil \log_{B-1} (N_0) \rceil + 1)
  • Optimised Merging: Bbb\lfloor \frac{B-b}{b} \rfloor for input, bb for output
  • Total IO = 2N(logF(N0)+1)2N*(\lceil \log_{F} (N_0) \rceil + 1), F=(Bboutput)/binputF = \lfloor (B - b_{output})/b_{input} \rfloor
  • Sequential I/O: N/b(#passes)((seek + rotate)+b(transfer))\lceil N/b \rceil * (\text{\#passes}) * ((\text{seek + rotate}) + b * (\text{transfer}))


  • Full Table Scan; Index Scan; Index Intersection (combination), + RID lookup
  • Goal: reduce number of index & data pages retrieved
  • Covering Index II for query QQ if QIQ \subseteq I; no RID lookup; index-only plan
  • Term: A op B; Conjuncts: terms connected by OR; CNF: conjuncts joined by AND
  • B+ Index I=(K1,K2,...)I = (K_1, K_2, ...) matches Predicate pp if (K1,...,Ki)(K_1, ..., K_i) is prefix of II AND only KiK_i can be non-equality
  • Hash Index II match if equal for every attribute
  • Subset that matches is primary conjuncts; Subset that covered is covered conjuncts
  • r||r||: #tuples, r|r|: #pages, bdb_d: #data records (entire tuple), bib_i: #data entries
  • B+ tree cost = (height of internal nodes) + (scan leaf pages) + (RID lookup)
    • Can reduce I/O cost of lookup by sorting
  • Hash cost = (retrieve data entries) + (retrieve data records)
  • Plans: Full Table Scan, Index Scan, Intersections, Unions
    • Primary: can traverse tree / hash (if not: check all leaf)
    • Covered: no RID lookup
    • Intersection: (retrieve leaf entries for both) + Grace-Hash Join + (retrieve data records)


  • Remove unwanted attributes, eliminate dupes
  • Sort: Extract -> Sort -> Remove Dupes (linear scan)
  • Optimised Sort: Create Sorted Runs with attributes L (read N pages, write NL#no of attrsN * \frac{|L|}{|\text{\#no of attrs}}) -> Merge Sorted Runs + Remove Dupes
    • If B>πL(R)B > \sqrt{|\pi^*_L(R)|}, N0πL(R)N_0 \approx \sqrt{|\pi^*_L(R)|}, similar as Hash
  • Hash: Partition into B1B - 1 partitions using hash function -> Remove Dupes in partitions -> Union partitions
    • Partition phase: 11 for input, B1B-1 for output -> remove unwanted attributes -> hash -> flush when buffer is full
    • Dupe Elim phase: Use in-memory hash table with h'
    • Partition overflow problem -> recursively apply partition until can fit in-memory
    • Approximately B>fπL(R)B > \sqrt{f * |\pi^*_L(R)|} to avoid partition overflow
    • If no partition overflow: Partition: R+πL(R)|R| + |\pi^*_L(R)|, Dupe Elim: πL(R)|\pi^*_L(R)|
  • Indexes: Use index scan; If B+ & wanted attributes is prefix: already sorted, so compare adjacent

Nested-Loop Join

  • Smaller should be outer (RR)
  • Tuple-Based: For each outer tuple: check each inner tuple R+RS|R| + ||R|| * |S|
  • Page-Based: For each outer page: check each inner page: compare tuples within these pages R+RS|R| + |R| * |S| (3 buffer pages, 2 input, 1 output)
  • Block-Based: read in B2B-2 sequential pages of R, read in page of S one-by-one
    • R+(RB2S)|R| + (\lceil \frac{|R|}{B-2} \rceil * |S|)
  • Index-Based: for each tuple in R: search S's index
    • Assuming uniform distribution: R+RJ|R| + ||R|| * J, J = tuple search cost
  • Minimum for any join: cost of R+S|R| + |S| with R+1+1|R| + 1 + 1 buffer pages, store entire |R| in memory

Sort-Merge Join

  • Sort both RR and SS, then join
  • Each tuple in RR partition merges with all tuples in matching S-partition
  • Advance pointer pointing to smaller tuple; rewind SS-pointer as necessary
  • I/O cost = (Cost to sort R) + (Cost to sort S) + Merging cost; (merging cost = |R| + |S| if no rewind, |R| + ||R|| * |S| if rewind everytime)
  • Optimised (partial sorting): if B>N(R,i)+N(S,j)B > N(R,i) + N(S,j), stop sorting, N(R,i)=N(R,i) = total number of sorted runs of R after pass i
    • I/O cost if B>2SB > \sqrt{2|S|}, 3(R+S)3*(|R|+|S|) => 2 for creating initial sorted runs (one pass is sufficient), 1 for merge
    • else 3(R+S)+cR+dS3*(|R|+|S|) + c * |R| + d * |S|, where cc and dd is number of merge passes for RR, SS

Grace Hash Join (no Hybrid Hash Join)

  • Split RR and SS into kk partitions each, join these kk partitions together in probing phase
    • read RiR_i to build hash table (build relation) - pick smaller for build (must fit in-memory)
    • read SiS_i to probe hash table (probe relation)
  • partitioning phase: 1 input buffer, kk hash buffers, once full, flush into page on disk
  • probing phase: 1 input buffer, 1 output buffer, 1 hash table; use different h'(.) and build a hash table for each partition, then, probe with SS, if match, add to output buffer
    • build R1R_1, probe S1S_1, build R2R_2, ...
    • once output buffer is full, flush (don't flush between partitions)
  • set k=B1k = B-1 to minimise partition sizes
  • assuming uniform hashing distribution:
    • size of each partition RiR_i: RB1\frac{|R|}{B-1}
    • size of hash table for RiR_i: fRB1\frac{f*|R|}{B-1}, ff is fudge factor
    • during probing phase, B>fRB1+2B > \frac{f*|R|}{B-1}+2, one each for input & output
    • approximately, B>fRB > \sqrt{f*|R|}
  • Partition Overflow Problem: hash table doesn't fit in memory, recursively partition overflowed partitions
  • I/O cost = 3(|R| + |S|) if no partition overflow
  • I/O cost = (c * 2 + 1)(|R| + |S|) where cc is number of partitioning phases

Join Conditions

  • Multiple Equality-Join conditions (R.A=S.A)and(R.B=S.B)(R.A = S.A) and (R.B = S.B)
    • Index Nested Loop Join: use index on all attrs; or only on primary conjuncts, then data lookup uncovered conjuncts
    • Sort-Merge Join: sort on combinations
    • Other algorithms are unchanged
  • Inequality-Join conditions (R.A<S.A)(R.A < S.A)
    • Index Nested Loop Join: requires B+ tree index
    • Sort-Merge Join: N/A (becomes nested loop join)
    • Hash-Based Join: N/A (becomes nested loop join)
    • Other algorithms are unchanged


  • Aggregation: scan table while maintaining running information
  • Group-by aggregation:
    • sort on grouping attributes, scan sorted relation to compute aggregate
    • build hash table on grouping attributes, maintain (group-value, running-information)
  • Index Optimisation: if have covering index, use it; avoids need for sorting

Query Evaluation

  • Materialised (temporary table) Evaluation - waits for everything to be done
    • operator is evaluated only when its operands are completely evaluated or materialised
    • intermediate results are materialised to disk
    • may reduce number of rows
  • Pipelined Evaluation - requires more memory
    • output produced by operator is passed directly to parent (interleaves execution of operators)
    • operator O is blocking operator if it requires full input before it can continue (e.g. external merge sort, sort-merge join, grace-hash join)
    • Iterator Interface: top-down, demand-driven (parent calls getNext() from child)

Query Plans

  • Query has many equivalent logical query plans, which has many physical query plans.
  • Want to avoid BAD plans, not pick the best
    • Ideally minimise size of intermediate results
  • join-plan notation
    • nested-loop: left is outer, right is inner
    • sort-merge: left is outer, right is inner
    • hash-join: left is probe, right is build

Relational Algebra Rules

  1. Commutativity
    1. R×SS×RR \times S \equiv S \times R
    2. RSSRR \Join S \equiv S \Join R
  2. Associativity
    1. (R×S)×TR×(S×T)(R \times S) \times T \equiv R \times (S \times T)
    2. (RS)TR(ST)(R \Join S) \Join T \equiv R \Join (S \Join T)
  3. Idempotence
    1. πL(πL(R))πL\pi_{L'}(\pi_L(R)) \equiv \pi_{L'} if LLattrs(R)L' \subseteq L \subseteq attrs(R)
    2. σp1(σp2(R))σp1p2\sigma_{p_1}(\sigma_{p_2}(R)) \equiv \sigma_{p_1 \land p_2}
    3. πL(σp(R))πL(σp(πLattrs(p)(R)))\pi_L(\sigma_p(R)) \equiv \pi_L(\sigma_p(\pi_{L \cup attrs(p)}(R)))
  4. Commutating Selection w/ Binary Ops - pushes operations down to leaf node
    1. σp(R×S)σp(R)×S\sigma_p(R \times S) \equiv \sigma_p(R) \times S if attrs(p)attrs(R)attrs(p) \subseteq attrs(R)
    2. σp(RpS)σp(R)pS\sigma_p(R \Join_{p'} S) \equiv \sigma_p(R) \Join_{p'} S if attrs(p)attrs(R)attrs(p) \subseteq attrs(R)
    3. σp(RS)σp(R)S\sigma_p(R \cup S) \equiv \sigma_p(R) \cup S
  5. Commutating Projection w/ Binary Ops
    • let L=LRLsL=L_R \cup L_s, where LRattrs(R)L_R \subseteq attrs(R) and LSattrs(S)L_S \subseteq attrs(S)
    1. πL(R×S)πLR(R)×πLS(S)\pi_L(R \times S) \equiv \pi_{L_R}(R) \times \pi_{L_S}(S)
    2. πL(RpS)πLR(R)pπLS(S)\pi_L(R \Join_p S) \equiv \pi_{L_R}(R) \Join_p \pi_{L_S}(S) if attrs(p)attrs(R)LRattrs(p) \cap attrs(R) \subseteq L_R and LSL_S
    3. πL(RS)πL(R)πL(S)\pi_L(R \cup S) \equiv \pi_{L}(R) \cup \pi_{L}(S)

Query Optimisation

  1. Search space
  2. Plan enumeration - how to enumerate search space
  3. Cost model

Query Plan Trees

  • Linear if at least one operand for each join is base relation; otherwise it's bushy
  • Left-deep if each right join is base relation
  • Right-deep if each left join is base relation
  • Use DP: compute optimal cost optPlan(Si)optPlan(S_i) for each subset of relations being joined
    • i.e. for RSTR \Join S \Join T, consider optPlan({R,S,T})=min{optPlan(R)+optPlan({S,T}),...}optPlan(\{R,S,T\}) = \min\{optPlan({R}) + optPlan(\{S,T\}), ...\}
    • Enhanced DP: might be worth using sub-optimal if produces sorted order, optPlan(Si,oi)optPlan(S_i, o_i), where oio_i captures attrs sorted (or null)

Cost Estimation

  • Uniformity Assumption: uniform distribution
  • Independence Assumption: independent distribution in different attrs
  • Inclusion Assumption: assumes all rRr \in R maps to some sSs \in S, if πA(R)πB(S)||\pi_A(R)|| \le ||\pi_B(S)||

Size Estimation

For q=σp(e)q = \sigma_p(e), p=t1...tnp = t_1 \land ... t_n, e=R1×...Rne = R_1 \times ... R_n

  • qe×Πi=1n(rf(ti))||q|| \approx ||e|| \times \Pi^n_{i=1}(rf(t_i)), reduction / selectivity factor rf(ti)=σti(e)erf(t_i) = \frac{||\sigma_{t_i}(e)||}{||e||}
  • rf(A=c)1πA(R)rf(A = c) \approx \frac{1}{||\pi_A(R)||} by uniform
  • rf(R.A=S.B)1max{πA(R),πB(S)}rf(R.A = S.B) \approx \frac{1}{max\{||\pi_A(R)||,||\pi_B(S)||\}} by inclusion

Estimation w/ Histogram

  • Equiwidth: each bucket has (almost) equal number of values
  • Equidepth (* better): each bucket has (almost) equal number of tuples; sub-ranges might overlap (can however, e.g. 1-6)
  • MCV: separately keep track of top-k

Transaction Properties

  • Atomicity: all or nothing (by recovery manager)
  • Consistency: if each Xact is consistent, and DB starts consistent, ends consistent
  • Isolation: execution of Xacts are isolated (by concurrency control manager)
  • Durability: once commit, persist (by recovery manager)


  • TjT_j reads OO from TiT_i in a schedule SS if last write action on OO before Rj(O)R_j(O) is Wi(O)W_i(O)
  • TjT_j reads from TiT_i if TjT_j read some object from TiT_i
  • TiT_i performs final write on OO in a schedule SS if last write action on OO in SS is Wi(O)W_i(O)
    • determines final state


  • VE if same read-froms & same final-writes
  • VSS if VE to some serial schedule
    • VSG - (Tj,TiT_j,T_i) if TiT_i reads-from TjT_j, or TiT_i does final-write
    • VSG cyclic => not VSS
    • VSG acyclic & (serial schedule from topo-sort is VE to S) => VSS
  • Conflicting Action if
    • at least one of them is write action
    • and actions are from different transactions
  • Anomalies from Interleaving
    • dirty read problem (WR)
      • W1(x),R2(x)W_1(x), R_2(x)
    • unrepeatable read problem (RW) => same row, different value
      • R1(x),W2(x),C2,R1(x)R_1(x), W_2(x), C_2, R_1(x)
    • lost update problem (WW)
      • R1(x),R2(x),W1(x),W2(x)R_1(x), R_2(x), W_1(x), W_2(x)
  • CE: every pair of conflicting actions are ordered in the same way
  • CSS: CE to some serial schedule (CSS => VSS) ("serialisable" = CSS)
    • CSS     \iff CSG is acyclic
  • blind write: Xact no read before it writes
    • VSS & no blind writes => CSS
  • cascading abort: if TiT_i reads from TjT_j and TjT_j aborts, TiT_i must abort too for correctness
  • Recoverable Schedule (essential): for every Xact TT that commits in SS, TT must commit after TT' if TT reads from TT
  • Cascadeless Schedule: can only read from committed Xacts
  • Strict Schedule (can use before-image): for every Wi(O)W_i(O), OO is not read or written by another Xact until TiT_i abort / commit
  • Strict \subseteq Cascadeless \subseteq Recoverable

Transaction Scheduler

for each input action (read, write, commit, abort):

  1. output action to scheduler (perform the action)
  2. postpone the action by blocking Xact
  3. reject and abort Xact

Lock-based Concurrency Control

  • if lock request not granted, Xact is blocked, Xact is added to O's request queue
  • 2PL => CSS: once release a lock, no more request
  • Strict 2PL => strict & CSS: Xact must hold onto lock until commit / abort
  • Wait-For-Graph: TiTjT_i \to T_j if TiT_i waiting for TjT_j (must remove edge)
  • Timeout mechanism: when Xact start, start timer, if timeout, assume deadlock
  • Deadlock Prevention - older Xact has higher priority (not restarted on kill)
    • suppose TiT_i requests a lock held by TjT_j (Higher-Lower)
    • wait-die: TiT_i wait for TjT_j, TiT_i suicide => may starve
    • wound-wait: kill TjT_j, TiT_i wait for TjT_j
    • if TjT_j dies, TiT_i still waits
  • Lock upgrade: similar to acquiring X
  • Lock downgrade: has not modified OO, has not released any lock
  • Phantom Read Problem: re-executes query for a search condition but obtains different rows due to another recently committed transaction
    • can't lock row if don't exist => perform predicate locking instead, but use index locking in practice for efficiency
  • Isolation Level (Dirty Read, Unrepeatable Read, Phantom Read, Write, Read, Predicate)
  • Read Uncommitted: Y, Y, Y, L, N, N
  • * Read Committed: N, Y, Y, L, S, N
  • Repeatable Read: N, N, Y, L, L, N
  • Serialisable: N, N, N, L, L, Y
  • Granularity: DB, Relation, Page, Tuple
    • higher lock => lower is locked
    • intention lock: must have I-lock on all ancestors
    • acquire top-down
    • to obtain S or IS, must have IS or IX on parent
    • to obtain X or IX, must have IX on parent
    • release bottom-up

MVCC - maintain multiple ver. of each object

  • read-only are never blocked / aborted
  • MVE if same read-from
  • MVSS if MVE to some serial monoversion schedule
    • monoversion: each read action returns the most recently created object version
    • VSS \subseteq MVSS (not other way round)
  • SI: Xact T takes snapshot of committed state of DB at start of T
    • can't read from concurrent Xacts
    • Concurrent if overlap start & commits
    • OiO_i is more recent than OjO_j if TiT_i commit after TjT_j
    • Concurrent Update Property: if multiple concurrency Xact update same object, only one can commit (if not, may not be serialisable)
  • First Committer Win (FCW): check at point of commit
  • First Updater Win (FUW) - locks only used for checking (NOT lock-based)
    • to update OO: request X-lock on OO; when commit / abort, release locks
    • if not held by anyone:
      • if O has been updated by concurrent Xact: abort
      • else: grant lock
    • else: wait for T' to abort / commit
      • if T' commit: abort
      • else: use (if not held by anyone) case
  • Garbage Collection: delete version OiO_i if exists a newer version OjO_j st for every active Xact TkT_k that started after commit of TiT_i, TjT_j commits before TkT_k starts (aka all active Xact can refer to OjO_j)
  • SI performs similarly to Read Committed, but different anomalies; does not guarantee serialisablity too (violates MVSS, but not detected)
    • Write Skew Anomaly
      • Both Xact read from initial value
    • Read-Only Xact Anomaly
      • A Read-Only Xact reads values that shouldn't be possible
  • SSI: keep track of rw dependencies among concurrent Xact
    • TiT_i -rw-> TjT_j -rw-> TkT_k: abort one of them (has false positives)
    • ww from T1T2T_1 \to T_2 if T1T_1 writes to O, then T2T_2 writes immediate successor of O
      • T1T_1 commit before TjT_j and no Xact that commits between them writes to O
    • wr from T1T2T_1 \to T_2 if T1T_1 writes to O, then T2T_2 reads this ver. of O
    • rw from T1T2T_1 \to T_2 if T1T_1 reads a ver. of O, then T2T_2 writes immediate successor of O
  • Dependency Serialisation Graph (dashed if concurrent, solid if not)
    • if S is SI that is not MVSS, then (1) at least one cycle in DSG, (2) for each cycle, exists TiT_i, TjT_j, TkT_k st
      • TiT_i and TkT_k might be same Xact
      • TiT_i and TjT_j are concurrent with Tirw>TjT_i -rw-> T_j
      • AND TjT_j and TkT_k are concurrent with Tjrw>TkT_j -rw-> T_k

Recovery Manager

  • Undo: remove effects of aborted Xact to preserve atomicity
  • Redo: re-installing effects of committed Xact to preserve durability
  • Failure
    1. transaction failure: transaction aborts
      • application rollbacks transaction (voluntary)
      • DBMS rollbacks transaction (e.g. deadlock, violation of integrity constraint)
    2. system crash: loss of volatile memory contents
      • power failure
      • bug in DBMS / OS
      • hardware malfunction
    3. media failures: data is lost / corrupted on non-volatile storage
      • disk head crash / failure during data transfer

Buffer Pool

  • Can evict dirty uncommitted pages? (yes => steal, no => no-steal)
  • Must all dirty pages be flushed before Xact commits? (yes => force => no redo, no => no-force)
  • in practice: use steal, no-force (need undo & need redo)
  • no steal => no undo needed (not practical because not enough buffer pages leads to blocking)
  • force => no redo needed (hurts performance of commit because random I/O)

Log-Based DB Recovery

  • Log (trail / journey): history of actions executed by DBMS - stored as sequential file of records in stable storage - uniquely identified by LSN
  • Algorithm for Recovery and Isolation Exploiting Semantics (ARIES) - designed for steal, no-force approach, assumes strict 2PL
  • Xact Table (TT): one entry per active Xact, contains: XactID, lastLSN (most recent for Xact), status (C or U) because kept until End Log Record
  • Dirty Page Table (DPT): one entry per dirty page, contains: pageID, recLSN (earliest for update that caused dirty)

Normal Processing

Updating TT (Xact ID, lastLSN, status):

  • first log record for Xact T: create new entry in TT with status = U
  • for new log record: update lastLSN
  • when commit: update status = C
  • when end log record: remove from TT

Updating DPT (pageID, recLSN):

  • when page P is updated (and not in DPT): create new entry with recLSN = LSN (don't update this)
  • when flushed: remove from DPT

Log Records

  • default: LSN, type, XactID, prevLSN (for same Xact, first points to NULL)
  • update log record (ULR): pageID, byte offset (within page), length (in bytes of update), before-image (for undo), after-image (for redo)
  • compensation log record (CLR) - made when ULR is undone: pageID, undoNextLSN (prevLSN in ULR), action taken to undo
  • commit log record
  • abort log record - created when aborting Xact: undo is initiated for this Xact
  • end log record - created after book-keeping after commit / abort is done
  • (simple) checkpoint log record: stores Xact table
  • (fuzzy) begin_checkpoint log record: time of snapshot of DPT & TT
  • (fuzzy) end_checkpoint log record: stores DPT & TT snapshots

* only ULR and CLR are redoable log records

Implementing Abort

  • Write-ahead logging (WAL) protocol: do not flush uncommitted update until log record is flushed
  • need to log changes needed for undo
  • to enforce, each DB page contains pageLSN (most recent log record), before flushing page P, ensure all log records up to pageLSN is flushed

Implementing Commit

  • Force-at-commit protocol: do not commit until after-images of all updated records are in stable storage
  • to enforce, write commit log record for Xact, flush all log records (not data)
  • Xact is committed     \iff its commit log record is written to stable storage

Implementing Restarts (order matters)

  • Analysis phase: determines point in log to start Redo phase, identifies superset of dirtied buffer pool pages & active Xacts at time of crash
  • Redo phase: redo actions to restore DB state
  • Undo phase: undo actions of uncommitted Xacts

Analysis Phase

  • init DPT and TT to be empty
  • sequentially scan logs
if r is end log record:
  remove T from TT
  add T to TT if not in TT
  set lastLSN = r's LSN
  status = C if commit log record
if (r is redoable) & (its P not in DPT):
  add P to DPT(pageID = P, recLSN = r)

Redo Phase

  • opt cond (defn already flushed) = (P is not in DPT) or (P's recLSN in DPT > r's LSN)
redoLSN = smallest recLSN in DPT
let r = log record w/ redoLSN
start scan from r:
if (r is ULR | CLR) & (not opt cond):
  fetch page P for r
  if P's pageLSN < r's LSN:
    haven't redo, so redo action
    set P's pageLSN = r's LSN
  else: because <= P's pageLSN is OK
    set P's recLSN in DPT =
      P's pageLSN+1

at end: create end log records for Xacts with status = C, & remove from TT

Undo Phase: abort loser Xacts

init L = lastLSNs (status = U) from TT
repeat until L is empty
  delete largest lastLSN from L
  let r be log record for ^
  if r is ULR:
    create CLR r2
    r2 undoNextLSN = r's prevLSN
    r2 prevLSN = r's LSN
    undo action
    update P's pageLSN = r2's LSN
    UpdateLAndTT(r's prevLSN)
  if r is CLR:
    UpdateLAndTT(r's undoNextLSN)
  if r is abort log record:
    UpdateLAndTT(r's prevLSN)

def UpdateLAndTT(lsn):
  if lsn is not null: add lsn to L
  else: # reached first log => done
    create end log record for T
    remove T from TT

Simple Checkpointing

  • periodically perform checkpointing: suspend normal processing, wait until all current processing is done, flush all dirty pages in buffer (to sync log record & DB), write checkpoint log record containing TT, resume
  • during Analysis Phase, start from begin_ LR, init with TT, DPT in end_

Fuzzy Checkpointing (no suspension)

  • snapshot DPT & TT, write begin_ log record
  • write end_ log record (very slow to write)
  • write special master record containing LSN of begin_ to known location (for fast retrieval) in stable storage
  • during Analysis Phase, start from begin_, init with TT and DPT in end_