CS4221 Notes

AY2023/2024 Semester 2, written by See Toh Jin Wei

#nus #cs4221

DBMS

Generic platforms for implementation and management of database applications.

SQL Examples

INSERT INTO items (i_id, i_im_id, i_name, i_price)
VALUES (501, '58158' 'Bread', 520)
;

DELETE FROM stocks
WHERE w_id = 394
;

UPDATE warehouses
SET s_city = 'Johor Bahru'
WHERE s_city = 'Nusajaya' AND w_country = 'Malaysia'
;

-- UPDATE .. SET .. FROM .. WHERE .. ;
UPDATE target_table
SET customer = subquery.customer,
    address = subquery.address,
    partn = subquery.partn
FROM (SELECT address_id, customer, address, partn
      FROM  /* big hairy SQL */ ...) AS subquery
WHERE dummy.address_id=subquery.address_id
;
-- https://stackoverflow.com/a/6258586

SELECT SUM(s_qty)
FROM items i NATURAL JOIN stocks s
WHERE i.i_name = 'Aspirin'
;

SELECT *
FROM stocks s
WHERE s.s_qty BETWEEN 0 AND 10;

WHERE w.w_city LIKE 'Si%_';
-- % = .*
-- _ = .

-- Extermal Query: Getting the maximum / minimum.
-- DON'T do `ORDER BY LIMIT 1`!! because there might be more than one answer
SELECT *
FROM stocks s1
WHERE s1.s_qty = ALL (
    SELECT MAX(s2.s_qty) FROM stocks s2;
);

UNION ALL is non-duplication elimination version
UNION is duplication elimination version (may be expensive)

-- select items that are in stock in every warehouse
-- select all items where
--    there does not exist a warehouse
--        that doesn't have this item
SELECT *
FROM items i
WHERE NOT EXISTS (
    SELECT 1
    FROM warehouses w
    WHERE NOT EXISTS (
        SELECT 1
        FROM stocks s
        WHERE s.w_id = w.w_id AND s.i_id = i.i_id
    )
);

---

CREATE INDEX i_i_price ON items(i_price);

GRANT UPDATE ON stocks TO jinwei;

CREATE OR REPLACE FUNCTION myage(dob DATE)
RETURNS INTEGER AS $$
BEGIN
    RETURN DATE_PRINT('year', CURRENT_DATE) - DATE_PART('year', dob)::INTEGER;
END;
$$ LANGUAGE plpgsql;

-- trigger function + trigger
CREATE OR REPLACE FUNCTION log_change()
RETURNS TRIGGER AS $$
DECLARE
    action_text TEXT;
BEGIN
    IF TG_OP = 'INSERT' THEN
        action_text := 'INSERT';
    ELSIF TG_OP = 'DELETE' THEN
        action_text := 'DELETE';
        NEW := OLD;
    ELSE
        RETURN NEW;
    END IF;

    INSERT INTO log VALUES (current_timestamp, action_text, TG_TABLE_NAME, NEW::TEXT);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER triglogstock
AFTER INSERT OR DELETE ON stocks
FOR EACH ROW
EXECUTE FUNCTION log_change();

Group By / Window Functions

SELECT
    ROUND( PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY c. population ASC) ::NUMERIC, 0) AS percentile 25,
    ROUND( PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY c. population ASC) ::NUMERIC, 0) AS percentile 50,
    ROUND( PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY c. population ASC) ::NUMERIC, 0) AS percentile 75,
    ROUND( PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY c. population ASC) ::NUMERIC, 0) AS percentile 95
FROM country c;

WITHIN GROUP is an "inline" GROUP BY. For postgres, it is only for SELECT.

row_number() vs rank() vs dense_rank()

SALARY | ROW_NUMBER | RANK | DENSE_RANK
1000   | 1          | 1    | 1
1500   | 2          | 2    | 2
1500   | 3          | 2    | 2
2000   | 4          | 4    | 3
2200   | 5          | 5    | 4
2500   | 6          | 6    | 5
2500   | 7          | 6    | 5
2500   | 8          | 6    | 5
3000   | 9          | 9    | 6

OVER (
    [ <PARTITION BY clause> ]
    [ <ORDER BY clause> ]
    [ <ROWS or RANGE clause> ]
)

LAG(expression [,offset [,default_value]])
OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)
LEAD(expression [,offset [,default_value]])
OVER (
    [PARTITION BY partition_expression, ... ]
    ORDER BY sort_expression [ASC | DESC], ...
)

FIRST_VALUE ( value anyelement )
LAST_VALUE ( value anyelement )
NTH_VALUE ( value anyelement, n integer )

[[CS4221 Window Functions Practice]]

The ROW clause does it by specifying a fixed number of rows that precede or follow the current row.

The RANGE (only for numeric and date/time) clause, on the other hand, limits the rows logically by the value; it specifies the range of values in relation to the value of the current row.

ORDER BY date
ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING

ORDER BY price
RANGE BETWEEN 5.0 PRECEDING AND CURRENT ROW

aggregate is over the group
OVER(...) clause changes the aggregate to be over the row / range clause specified
empty OVER() clause makes the window span the entire table
- can be used to bypass limitations of group by and aggregate functions

+-----------+-------+---------+
| badge_num | name  | surname |
+-----------+-------+---------+
|         1 | John  | Smith   |
|         2 | Mark  | Pence   |
|         3 | Steve | Smith   |
|         4 | Bob   | Smith   |
+-----------+-------+---------+

SELECT surname, COUNT(*)
FROM employees
GROUP BY surname;

+---------+----------+
| surname | COUNT(*) |
+---------+----------+
| Smith   |        3 |
| Pence   |        1 |
+---------+----------+

-- empty OVER() clause
SELECT surname, COUNT(*) OVER()
FROM employees
GROUP BY surname;

+---------+-----------------+
| surname | COUNT(*) OVER() |
+---------+-----------------+
| Smith   |               2 |
| Pence   |               2 |
+---------+-----------------+

https://stackoverflow.com/a/61254720

Window Functions can only be used in SELECT and ORDER BY :/

SELECT c.name, e.continent,
    ROW_NUMBER() OVER w_continent AS rank_continent,
    ROW_NUMBER() OVER w_world AS rank_world
FROM country c INNER JOIN encompasses e ON c.code=e.country
-- define the window separately
WINDOW
    w_continent AS (PARTITION BY  e.continent ORDER BY (c.population*e.percentage) DESC),
    w_world AS (ORDER BY c.population DESC)
ORDER BY rank_world;

for ORDER BY, the default frame clause is: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
otherwise (i.e. SELECT), the default frame clause is: RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

NOTE: integers will perform integer division! You have to convert one of them to decimal first.

select 1::decimal / 3;

Percentage based on condition

select
    t.request_at as Day,
    sum(case when t.status = 'cancelled_by_driver' or t.status = 'cancelled_by_client' then 1 else 0 end)::decimal / count(*) as "Cancellation Rate"
from trips t
inner join users d on t.driver_id = d.users_id
inner join users c on t.client_id = c.users_id
where d.banned = 'No' and c.banned = 'No'
and t.request_at between '2013-10-01' and '2013-10-03'
group by t.request_at
;

All X where all Y (e.g. all customers who bought all products)

select
    distinct c1.customer_id
from customer c1
where not exists (
    select 1
    from product p1
    where not exists (
        select 1
        from customer c2
        where c1.customer_id = c2.customer_id and c2.product_key = p1.product_key
    )
)
;

OLTP vs OLAP

OLTP = Online Transaction Processing

many short and simple transactions involving updates and point queries
frequent updates
queries only touch one or a few records
data must be up-to-date and consistent at all times!
- concurrency is the biggest performance concern

OLAP = Online Analytic Processing

long transactions involving complex queries
- often different queries!
updates are infrequent (once a quarter / year)
- can manually clean the data
queries touch a large amount of data (often are aggregated results)
approximate / slightly incorrect answers is fine

It is not practical to do both OLTP and OLAP on the same database system!

Data Warehouse Notes from Kimball

Book: "The Data Warehouse Toolkit Third Edition: The Definitive Guide to Dimensional Modeling" by Ralph Kimball and Margy Ross

Kimball Chapter 1
Kimball Case 1 (Retail)

Designing a data warehouse requires one to think as half a (1) administrator, and half a (2) business analyst.

Information as assets

Crudely split into two forms:

operational systems of records: where the data is put in
- turn the wheels of the organisation
- almost always one record at a time
- repeatedly perform same tasks
data warehouse: where the data is retrieved
- watch the wheels of the organisation
- data analysis is performed here
- requires hundreds or thousands of rows compressed into an answer set
- questions / queries are constantly changing

If a data warehouse is just a copy of the operational system, the usability and performance is heavily affected.

Goals of data warehouse

must make an organisation's information easily accessible
- must be understandable: intuitive and obvious to the business user (not merely the developer!)
- needs to be labelled meaningfully
- simple and easy to use, with minimal wait times
must present the organisation's information consistently
- must be credible
- carefully assembled from different sources, cleaned, quality assured, and released only after these facts
- information from different business processes should match
- if two performance measures have the same name, then they must mean the same thing!
- all data must be accounted for and complete
must be adaptive and resilient to change
- changes to the data warehouse should be graceful, and not invalidate existing data or applications
must be a secure bastion that protects our information assets (security)
must serve as the foundation for improved decision making
- data warehouse = decision supporting system
must be accepted by the business community to be successful
- data warehouse usage is sometimes optional (unlike a new operational system)
- more to do with simplicity than anything else

Non-goals of data warehouse

should make the business users want to use it
technology is just a means to the end

Data Source Systems

Would be ideal if source systems were reengineered with consistent views

Data Staging

cleaning data
- correcting misspellings
- resolving domain conflicts
- dealing with missing elements
- parsing into standard formats
combining data from multiple sources
deduplicating data
assigning warehouse keys

Dominated by simple operations like sorting + sequential processing

Possible to use normalised relational DB to handle data staging, but this would mean you have to ETL again into the final data warehouse. If doing this, these tables must be hidden from the end user!

Data Presentation

This is the area that the business community has access to.

It is a series of integrated data marts. Each data mart presents the data from a single business process.

Dimension Modelling

Normalised models (3NF, BCNF, ...) are too complicated for data warehouse queries!

A dimension model contains the same information as a normalised model, but packages the data in a format whose design goals are: user understandability, query performance, and resilience to change.

Bus Network is the best architecture for being robust and integrated.

Star Schema: one fact table + several dimension tables

Data Access Tools

Approximately 80 to 90 percent of the potential users will use pre-built applications to interface with the data, i.e. won't be writing relational queries.

Facts

Additivity is a crucial property of a fact: the rows of the column can be added together (e.g. dollars)

Semi-additive facts can be added only along some dimensions

Non-additive facts cannot be added at all: must use counts or averages to summarise the facts (or print billions of rows .-.)

Some non-additive facts like percentages and ratios can be solved by storing the numerator and denominators separately

The most useful facts are numeric and additive.

Textual Facts

In most cases, it is a description of something, and is from a discrete list (treat as enum). Should be placed into dimensions for better correlation + space-saving

Unless the text is unique for every row in the fact table, it belongs in the dimension table.

A truly textual fact (e.g. user input text) is rare because it is almost impossible to analyse (...but is this still true for modern text analysis and ML analysis?)

Fact tables typically take up 90+% of the total space (a lot of rows, but little columns) => should be careful

Fact Tables

transaction
periodic snapshot
accumulating snapshot

Must avoid null keys

Most business processes can be represented with less than 15 dimensions (if you have more than 25, look to combine the dimensions!) Perfectly correlated attributes should be in the same dimension

Use surrogate keys (autogenerated meaningless keys). Except date dimension! (let Jan 1 of first year be index value of 1, etc...). This allows for partitioning of the fact table by date

some stuff are not unique over long periods of time
- e.g. US SSN are not unique

Dimension Tables

not uncommon to have 50 to 100 attributes!

typically relatively small number of rows (far fewer than 1 million rows)

Serves as primary source of query constraints, groupings, report labels

e.g. if a user wants to see "dollar sales by week by brand"
- brand, week, brand must be available as dimension attributes

If a price change, use a new surrogate key. Thus, preserving historical data!

Four Step Dimensional Design Process

select the business process to model
- business process != business department / function
  - e.g. orders dimension, instead of sales / marketing dimensions
  - ensures better consistency
declare the grain of the business process
- specifying exactly what each fact table row represents
- answers the question: "how do you describe a single row in the fact table?", e.g.
  - a line item on a bill received from a doctor
  - a monthly snapshot for each bank account
- use lowest possible grain because: even though nobody wants to see individual low-level rows, queries need to cut through the details in very precise ways
choose the dimensions that apply to each fact table row
- if the grain is chosen well in step 2, this should be clear
identify the numeric facts that will populate each fact table row (fact table)

Dimensions

Date Dimension (excluding time)

Even 10 years worth of dates is only about 3650 rows (very small!)

include stuff like day of week, day, month, quarters as distinct columns, for better comparisons, e.g. comparing all day 25s across various months
include holiday indicators with the holiday itself (not just YES or NO!)
include weekend indicators
- enables things like comparing holiday weekends VS non-holiday weekends
include selling seasons: e.g. Christmas, Easter, ..., None (even special outside events like Labour Strike)
- note that promotions are normally kept in a separate Promotions dimension
why not just use SQL's date functions?
- joins are cheap
- number of rows is very low
- business people might not know these join capabilities
- we are storing a lot of stuff not computable by date functions
- SQL-based date key is 8 bytes (integer is only 4 bytes)! -> saved in the facts table

Time Dimension

"Date and time are almost completely independent." By combining them, the number of rows increase a lot. (~5million rows for 10 years)

Promotion Dimension

This is a causal dimension because it describes factors thought to cause a change in product sales

Managers are interested in determining if promotions are effective

need to detect "lift" (sales increase compared to baseline)
did pre-promotion sales drop (cancelling the gain in sales during promotions)
did other products drop in sales
is there market growth after the promotions
whether the promotion is profitable?

"The various possible causal conditions are highly correlated. A temporary price reduction usually is associated with an ad and perhaps an end-aisle display. Coupons often are associated with ads. For this reason, it makes sense to create one row in the promotion dimension for each combination of promotion conditions that occurs."

It is possible to separate different causal mechanisms (price reductions, ads, displays, coupons) into separate dimensions (there are tradeoffs)

How to capture products that did not sell during promotions?? (there would be no rows) "In the case of the promotion coverage fact table, we’d load one row in the fact table for each product on promotion in a store each day (or week, since many retail promotions are a week in duration) regardless of whether the product sold or not." This is a factless fact table.

How to store time dimension?

Should it be just stored as an attribute?

Should it be stored with date, under a datetime dimension?

Should it be stored as its own time dimension?

What is the right granularity for the fact table?

"We need the most finely grained data in the presentation area so that users can ask the most precise questions possible. Because users’ requirements are unpredictable and constantly changing, we must provide access to the exquisite details so that they can be rolled up to address the questions of the moment."

"The secret to query flexibility is building the fact tables at the most granular level."

SQL Tuning

Steps to SQL Tuning

Use EXPLAIN command
- see if anything is funny
- are indexes used?
- are the joins in an optimal order?
Look at pg_statistic, pg_stats to see if the frequencies / statistics make sense
- pg_stats is a view of pg_statistic
- the statistics might be outdated
- to update the statistics, run ANALYZE;
Use VACUUM / VACUUM FULL (defragmentation)
Use ANALYZE (gather and update statistics)

 SELECT n.nspname AS schemaname,
    c.relname AS tablename,
    a.attname,
    s.stainherit AS inherited,
    s.stanullfrac AS null_frac,
    s.stawidth AS avg_width,
    s.stadistinct AS n_distinct,
        CASE
            WHEN (s.stakind1 = 1) THEN s.stavalues1
            WHEN (s.stakind2 = 1) THEN s.stavalues2
            WHEN (s.stakind3 = 1) THEN s.stavalues3
            WHEN (s.stakind4 = 1) THEN s.stavalues4
            WHEN (s.stakind5 = 1) THEN s.stavalues5
            ELSE NULL::anyarray
        END AS most_common_vals,
        CASE
            WHEN (s.stakind1 = 1) THEN s.stanumbers1
            WHEN (s.stakind2 = 1) THEN s.stanumbers2
            WHEN (s.stakind3 = 1) THEN s.stanumbers3
            WHEN (s.stakind4 = 1) THEN s.stanumbers4
            WHEN (s.stakind5 = 1) THEN s.stanumbers5
            ELSE NULL::real[]
        END AS most_common_freqs,
        CASE
            WHEN (s.stakind1 = 2) THEN s.stavalues1
            WHEN (s.stakind2 = 2) THEN s.stavalues2
            WHEN (s.stakind3 = 2) THEN s.stavalues3
            WHEN (s.stakind4 = 2) THEN s.stavalues4
            WHEN (s.stakind5 = 2) THEN s.stavalues5
            ELSE NULL::anyarray
        END AS histogram_bounds,
        CASE
            WHEN (s.stakind1 = 3) THEN s.stanumbers1[1]
            WHEN (s.stakind2 = 3) THEN s.stanumbers2[1]
            WHEN (s.stakind3 = 3) THEN s.stanumbers3[1]
            WHEN (s.stakind4 = 3) THEN s.stanumbers4[1]
            WHEN (s.stakind5 = 3) THEN s.stanumbers5[1]
            ELSE NULL::real
        END AS correlation,
        CASE
            WHEN (s.stakind1 = 4) THEN s.stavalues1
            WHEN (s.stakind2 = 4) THEN s.stavalues2
            WHEN (s.stakind3 = 4) THEN s.stavalues3
            WHEN (s.stakind4 = 4) THEN s.stavalues4
            WHEN (s.stakind5 = 4) THEN s.stavalues5
            ELSE NULL::anyarray
        END AS most_common_elems,
        CASE
            WHEN (s.stakind1 = 4) THEN s.stanumbers1
            WHEN (s.stakind2 = 4) THEN s.stanumbers2
            WHEN (s.stakind3 = 4) THEN s.stanumbers3
            WHEN (s.stakind4 = 4) THEN s.stanumbers4
            WHEN (s.stakind5 = 4) THEN s.stanumbers5
            ELSE NULL::real[]
        END AS most_common_elem_freqs,
        CASE
            WHEN (s.stakind1 = 5) THEN s.stanumbers1
            WHEN (s.stakind2 = 5) THEN s.stanumbers2
            WHEN (s.stakind3 = 5) THEN s.stanumbers3
            WHEN (s.stakind4 = 5) THEN s.stanumbers4
            WHEN (s.stakind5 = 5) THEN s.stanumbers5
            ELSE NULL::real[]
        END AS elem_count_histogram
   FROM (((pg_statistic s
     JOIN pg_class c ON ((c.oid = s.starelid)))
     JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum))))
     LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
  WHERE ((NOT a.attisdropped) AND has_column_privilege(c.oid, a.attnum, 'select'::text) AND ((c.relrowsecurity = false) OR (NOT row_security_active(c.oid))));

Explain

need to minimise planning time (otherwise there's no benefit to planning)
- some older versions of postgres used to limit the planning time to 50% of the slowest query plan

EXPLAIN shows 849.87..849.88.

the first one is the startup cost
- startup cost = cost before any output is produced
the second one is the total cost
- this is the one that matters
transmission costs are excluded

The cost is in arbitrary units, but is roughly proportional to the actual execution time. More importantly, it is used to relatively compare the different plans.

NOTE: the query planner is re-planning the query every time a query is made. This is required because the data / stats will keep changing. Even with PREPARE statements, the query planner may re-plan the queries (for modern postgres versions).

Calling ANALYZE; (different from EXPLAIN (ANALYZE)) can be used to collect statistics, but while it is doing this, it will affect the query performance of other queries. However, this must be manually called. Future optimisations: use reinforce learning to estimate the best times to run statistics collections?

Explain Analyze (Explain Analyse)

EXPLAIN (ANALYZE)

Creates the query plan AND actually executes it.

This allows it to include information from an actual execution to detect bad estimations.

NOTES:

need to multiply actual time with loops to get the actual time
there is overhead when doing this
it is a good idea to run this multiple times and take averages

(Intentionally) Slow Queries

Many joins to the same table
CTEs
nested queries
useless WHERE conditions: 1=1, 22=11*2, ...

^ depends heavily on the optimiser

Indexes

-- UNIQUE checks for duplicate values
CREATE [ UNIQUE ] INDEX [ name ] ON table name
[ USING method ] -- btree (default), hash, gist, ...
( { columnname | ( expression ) } )
[ WHERE predicate ]

What internal SQL queries are called to check if a value (to be added) is unique?

Order of attributes in index matter (for B+ tree indexes) for multicolumn indexes!!

Generalized Inverted is for text searching

PostgreSQL has partial indexes, i.e. only index the rows that satisfy some predicate

-- partial index example
CREATE INDEX stocks_s_qty ON stocks(s_qty) WHERE s_qty>600;

Clustering Data

In PostgreSQL, indexes do not affect the data in the file (e.g. order). One reason: there are indexes so what should you order the data by? Some DBs like Oracle can do this?

but, CLUSTER <table> BY <index>. This is useful for data analytics. If the table is updated, this operation has to be run again. PostgreSQL doesn't automatically re-cluster it, so it isn't that useful for transactional DBs.

Bitmap-Heap & Bitmap-Index Scan

https://stackoverflow.com/questions/55651068/why-is-bitmap-scan-faster-than-index-scan-when-fetching-a-moderately-large-perce

Index-Only Scan

PostgreSQL lets you add extra attributes to an index

e.g. an index of (w_id, i_id) has the value of s_qty included

create index stocks_w_id_i_id on stocks(w_id, i_id) include (s_qty);

-- it allows this query to be index-only:
select s_qty
from stocks
where w_id = 1 and i_id = 1

Selectivity & Scans

Not selective: sequential scan
Fairly selective (approx < 15%): bitmap-heap scan
Very selective (approx < 0.1%): index scan
- because we still need to retrieve the data
Index-only scan (all required data can be found in the index data structure)
- don't even need to access the actual data pages

^ these ratios are always changing (and dependent on many factors)

Clustering the table using an index can also change the scan used! (but not very practical outside of data analytics context)

Temporary Tables

Temporary tables do not have indexes when they're copied. PostgreSQL doesn't make statistics for them.

Joins

Nested-Loop joins are at worse block-based, not row-based and not page-based.

Nested Loop Join with Inner Sequential Scan
- outer table is smaller (to fit in the memory)
Nested Loop Join with Materialised Inner Sequential Scan
- can be materialised if there are several iterations of the inner loop
Nested Loop Join with Inner Index Scan
Hash Join
- only good if the hashed table is tiny, ideally fits in-memory
- in general, the optimiser will choose a Hash Join
Merge Join
- if there are no statistics, PostgreSQL might do a Merge Join
  - note: ANALYZE can be used to gather and update statistics
- requires data to be sorted
Semi-Join
- if the inner table is only used for including results
Anti-Join
- if the inner table is only used for excluding results, checks that a row doesn't match any of the rows in the inner table
Outer Join

IN, EXISTS, = ANY, <> ALL, NOT IN, OUTER JOIN are optimised fairly differently (even though they're fairly similar)

PostgreSQL sucks at NOT IN (try using EXISTS or something else for better performance)

Join Order is chosen by the optimiser.

Optimising Performance Notes

basically, try to give the DBMS as much information as possible in order to let it plan better

build statistics
add indexes (or not)

Normalised / Denormalised Schemas

Normalised
- for data integrity / consistency
  - using primary keys & foreign keys
- tradeoff for slower queries (because need to JOIN, which are expensive!)
Denormalised
- insertions, deletions and updates are more complicated + some constraints cannot be maintained
  - need to do manual propagation using triggers
- some select queries might be faster

Views

Views are just used as a subquery, it is purely for convenience, 0 performance difference.

CREATE VIEW vall AS
SELECT *
FROM warehouses w
NATURAL JOIN stocks s
NATURAL JOIN item i
;

-- still needs to join!!
EXPLAIN ANALYZE SELECT * FROM vall v
WHERE v.w name='Agimba'
;

Materialised Views

Middle ground between normalised and denormalised schema. Updates are costly, but they are manually triggered using REFRESH (or using automatic triggers). So, the cost can be controlled.

Unfortunately, currently, PostgreSQL does not do any optimisation to propagate updates (the original SELECT query is ran again; i.e. it does not store any partial state to only update the changed rows).

There is an actual table being created with the results of that query.

CREATE MATERIALIZED VIEW mvall AS
SELECT *
FROM ...
;

-- MUST manually refresh
REFRESH MATERIALIZED VIEW mvall;

Materialised views can be indexed too.

Prepare

Need to replan the query plan every time because data can change between queries.

Now, PostgreSQL still runs the optimiser and tries to see if there's a better query (if there are changes to the DB -> data / indexes / stats).

PREPARE q AS
SELECT s.i_id
FROM stocks s
WHERE s.s_qty > 500;

EXECUTE ANALYZE EXECUTE q;

DEALLOCATE q;

Optimiser Hints

Some DBMSs (e.g. MariaDB) let you use hints (e.g. tell the DBMS to use a certain index). But, not recommended unless the statistics don't change.

SELECT *
FROM warehouses
USE INDEX (i_i_price) WHERE i.i_price < 100
;

IGNORE INDEX

FORCE INDEX

-- force a certain JOIN order
SELECT *
FROM warehouses w STRAIGHT_JOIN stocks s ON ...
;

https://mariadb.com/kb/en/optimizer-hints/

PostgreSQL doesn't have this.

Other Reasons Why Queries are Slow

poor database configuration
- increase work_mem
tuples are scattered, tables and indexes are bloated
- VACUUM, CLUSTER, VACUUM FULL, reindexing
missing indexes
- CREATE INDEX
PostgreSQL chose a bad plan
- ANALYZE

Hard-tune the queries as a last resort and at every users' current and future risk.

Entity Relationship Diagrams

Very natural way to represent the world.

ER diagrams can provide a global view of the schemas.

An object-oriented model uses internal ID to identify an object.
A value-oriented model uses a combination of values to identify an object.
- the ER model is this
- in ER models, objects can be uniquely identified by the combination of values

Null-ary relationship (relationship set with 0 entity sets) is degenerate: it is either the empty set or a single term. Seldom, if ever, used.

ER diagrams can be automatically converted to SQL DDL tables, except:

domains (type) of attributes
choose PK if there are several candidate keys
- others become UNIQUE NOT NULL
write extra CHECK constraints for stuff that cannot be represented / translated from ER

Relationship Sets

Key is the combination of the keys of all entity sets involved. These individual keys are also foreign keys to the entity sets.

Exceptions:

in one-to-many relationship sets:
- change the PK to be the key of the entity set with ONE in the relationship
- or use UNIQUE constraints on the individual keys
- ^ doesn't enforce mandatory
in mandatory one-to-one relationship sets:
- merge the relationship set and the entity set into one table
- AND use the primary key of the entity set
weak entity set
- merge the relationship set and the entity set into one table
- AND use the primary key of the entire weak entity set (combination of weak guy + strong guy)

The stuff generated by DBMS tools are Logical Diagrams representing the implemented models, NOT ER Diagrams!!

XML Technologies

XML has no types. Handle by checking if it looks like a string?
XML is case sensitive! (unlike HTML)
Can be zipped / compressed very easily!

<!-- XML declaration in the prologue !-->

<?xml version="version_number"
    encoding="encoding_declaration"
    standalone="standalone_status" ?>

XML document is well-formed if it complies with the XML basic rules. (not DTD!!)

<?xml version="1.0" standalone="no" ?>
<!DOCTYPE html PUBLIC
"=//W3C//DTD XHTML 1.0 Transitional //EN"
"https://www.w3.org/TR/xhtml1/DTD/xhtml1=transitional.dtd">

Can use XML in (some) Relational DBMS by using the XML domain.

http://exist-db.org/exist/apps/homepage/index.html

XML Structure

An XML document is a labelled (there are annotations attached to each node), unranked (no priori bound on the number of children of a node), ordered (there is an order between children of a node) tree.

Logic might differ based on programming language / library :/

XML Entities

<: <
>: >
&: &
': '
": "

More can be defined in the DTD too.

XML Namespace

<p:custDataLst></p>

The name space is custDataLst

Name space is defined in a URL at the start of the XML document.

XML Literal Section

aka CDATA (Character Data) section

Anything in this section will be implemented as regular text, not markup.

<![CDATA[
    <!-- freely use symbols like <, > etc here -->

    if ((i < 5) && (j > 6))
        printf("error");
]]>

An alternative is to use stuff like < instead.

CData explanation: https://stackoverflow.com/a/2784200

XML Typing

Document Type Definition
XML Schema
RELAX NG (REgular LAnguage for XML Next Generation)

Document Type Definition (DTD)

XML is schema-less by default, but you can include a Document Type Definition (DTD) in the prologue.

can be external or internal
- can be specified externally using an URI
indicates that the XML is valid (follows the DTD)
- different from well-formed

<!DOCTYPE email [
    <!ELEMENT email (header, body)>
    <!ELEMENT header (from, to, cc?)>
    <!ELEMENT to (#PCDATA)>
    <!ELEMENT from (#PCDATA)>
    <!ELEMENT cc (#PCDATA)>
    <!ELEMENT body (paragraph*)>
    <!ELEMENT paragraph (#PCDATA)>
]>
<email>
    <header>
        <from>x@gmail.com</from>
        <to>y@gmail.com</to>
    </header>
    <body />
</email>

XML Schema

More modern alternative to DTD.

<?xml version=”1.0”?>
<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”>
    <xs : element name=”note”>
        <xs : complexType>
            <xs : sequence>
                <xs:element name=”to” type=”xs:string” minOccurs=’1’ maxOccurs=’1’/>
                <xs : element name=”from” type=”xs : string”/>
                <xs : element name=”heading” type=”xs : string”/>
                <xs : element name=”body” type=”xs : string”/>
            </xs : sequence>
        </xs : complexType>
    </xs : element>
</xs : schema>

XPath

Navigation for the XML tree

Simple SELECT query from one table

Every single XML tree starts with a special root node, which has ONE single child node: the actual root node.

XPath Paths

Absolute paths start with an /, starts from the root node (but it starts from the children of the current node in short syntax). Relative paths don't!

Shorthand syntax is sometimes ambiguous.

<!-- full syntax -->
/child::mondial/child::country

<!-- shorthand syntax (BANNED in cs4221) -->
/mondial/country

axis::nodetest[predicate]

axis: specifies the tree relationship between the nodes selected by the location step and the context node
- self, child, descendant, parent, ancestor, ...
- preceding, following uses document-semantics, not tree-semantics
node test: specifies the node type and expanded-name of the nodes selected by the location step
zero or more predicates: arbitrary expressions to further refine the set of nodes selected by the location step

XPath Duplicates

Removes duplicates from the result
BUT if A is a result and A's child is ALSO a result, both are returned!!

XPath Following, Preceding

following: AFTER the closing of the element
preceding: BEFORE the opening of the element
if nested: neither!

These follow document-semantics, ignore the tree.

XPath Predicates

[attribute::car_code]: element has an attribute car_code
[attribute::car_code="AL"]: element has an attribute car_code of value AL
- has >, <, <=, >=, !=
[child::country]: element has a child element country (can be a path)
[child::country="Singapore"]: element has a child element country (can be a path)
| is a union-kinda operator
- expresses disjunction in path expressions
- /descendant::D | /descendant::C returns descendants that are D or C
and, or, not()

Predicates can be used along any path.

XPath Functions

count(<xpath expression>)
node::name()
node::text()

(XQuery has more functions)

XQuery

"JOIN" kind of query

XQuery 1.0 is a strict superset of XPath 2.0

lets you combine several results
reformat results

FLOWR expressions

for: selects a sequence of nodes
let: binds a sequence to a variable
order by <expression> [ascending / descending]: sort
- defaults to ascending
where: filter
group by <expression>: group by
return: returns the result (gets evaluated once for every node)

Note that XQuery performs auto type casting.

<!-- for -->

<results>
{
for $x in doc("example.xml")/child::R/child::A/child::*
order by data($x/attribute::a) descending
where $x/child::text() > 2
    return <result>{$x/attribute::a}</result>
}
</results>

<!-- let -->

<results>
{
let $x := doc("example.xml")/child::R/child::A/child::*
return <result>{$x}</result>
}
</results>

<!-- group by -->

<results>
{
for $x in doc("example.xml")/descendant::*
let $y := name($x)
group by $y
return <result><element>{$y}</element><count>{count($x)}</count></result>
}
</results>

<!-- join -->

<results>
{
for $x in doc("example.xml")/child::R/child::A/child::*
for $y in doc("example.xml")/child::R/child::A/child::*
where name($x) = name($y) and $x < $y
return <result>{$x} {$y} </result>
}
</results>

XQuery Functions

(some here in XPath too)

count(<expr>)
node::name()
node::text()
max(<expr>)
distint-values(<expr>)
- e.g. distinct-values(doc("data.xml")/child::restaurants/...)
<expr> div <expr>: division, don't use /
- e.g. 4 div 2 = 2

XSLT

Extensible Stylesheet Language Transformations

(not tested)

XSLT uses a subset of Xpath

Interactive / transformative language. Used as a general purpose XML processing language.

Other XML-based markup languages

SVG
- vector image
XBRL
- eXtensible Business Reporting Language
- used for business / accounting
HTML5
and many many more

Dependencies

functional dependencies: generalisation of primary keys
inclusion dependencies: generalisation of foreign keys
primary keys
unique constraints
foreign key constraints
join dependencies

Functional Dependencies

$X \to Y \iff$ if two tuples agree on X-values, they must agree on the Y-values.

trivial $\iff$ $Y \subseteq X$
non-trivial $\iff$ $Y \nsubseteq X$
completely non-trivial $\iff Y \ne \emptyset$ and $Y \cap X = \emptyset$
- Y is not empty, and X and Y has no intersection
superkey = set of attributes that uniquely determines the entire tuple
candidate key = minimal superkey
prime attribute = attribute in some candidate key
closure of $\Sigma$ is $\Sigma+$
two sets of functional dependencies are equivalent $\iff$ their closures are equal
closure of $S \subseteq \Sigma$ is the set of all attributes that are functionally dependent on S
- $\{ A \in R \mid \exists (S -> \{A\}) \in \Sigma+ \}$
- computable by fixed point iteration

Compact Minimal Cover

Compact Minimal Cover = Canonical Cover

Minimal conditions:

RHS is minimal
- $X \to \{A\}$ form
LHS is minimal
- does not exist a $Y \to \{A\}$ in $\Sigma+$ such that Y is a proper subset of X
set itself is minimal
- no FD in the set can be derived from the other FDs

Every set of functional dependencies has a minimal cover (also known as a minimal basis).

Every set of functional dependencies has a compact minimal cover.

Compact conditions:

there are no different functional dependencies with the same LHS

Algorithm:

Simplify RHS of every FD by breaking it into multiple FD
Simplify LHS by removing redundant attributes on LHS
- redundant if obtainable from other FDs
- if some other $X \cup Y$ is a subset of this LHS: set LHS = LHS - Y
Remove FDs obtainable through Armstrong (mostly transitive)
- pretend this FD does not exist
- if we can still derive it, it is useless
- draw hypergraph for fast computation
Regroup all FD with same LHS
- only for compact minimal cover

Multi-valued dependencies

(informal definition)

let tuple be $(X, Y, R - X - Y)$
then, if $(a, b, c)$ and $(a, d, e)$ exists in some instance $r$
- $(a, b, e)$ and $(a, d, c)$ must exist in $r$

Each X-value in $r$ is consistently associated with one set of Y-values.

trivial $\iff$ $Y = R - X$ or $Y \subseteq X$

Inclusion Dependencies

A generalisation of foreign key constraint.

Instances $r_1$ and $r_2$ satisfy the inclusion dependency $R_1[X] \subseteq R_2[Y]$ $\iff \pi_X(r_1) \subseteq \pi_Y(r_2)$

(informal definition)

For all tuples in $r_1$ , there exists a tuple in $r_2$ whereby $t_1.x = t_2.y$

Join Dependencies

Trivial if one of the relations $R_k$ is equal to $R$ .

Armstrong Axioms

Functional Dependency Axioms

Reflexivity: $(Y \subseteq X) \implies (X \to Y)$
- trivial FD
Augmentation: $(X \to Y) \implies (X \cup Z \to Y \cup Z)$
- not completely non-trivial FD
Transitivity: $(X \to Y \land Y \to Z) \implies (X \to Z)$
- redundant FD

These 3 (above) form a sound and complete system.

Weak Reflexivity: $X \to \emptyset$
Weak Augmentation: $(X \to Y) \implies (X \cup Z \to Y)$
Union: $(X \to Y \land X \to Z) \implies (X \to Y \cup Z)$

Multi-valued Dependency Axioms

Complementation: $(X \twoheadrightarrow Y) \implies (X \twoheadrightarrow R - X - Y)$
Augmentation: $((X \twoheadrightarrow Y) \land (V \subseteq W)) \implies (X \cup W \twoheadrightarrow Y \cup V)$
Transitivity: $((X \twoheadrightarrow Y) \land (Y \twoheadrightarrow Z)) \implies (X \twoheadrightarrow Z - Y)$
- $Z-Y$ accounts for $Y \cap Z$
Replication / Promotion: $(X \to Y) \implies (X \twoheadrightarrow Y)$
- FD is special case of MVD
Coalescence: $((X \twoheadrightarrow Y) \land (W \to Z) \land (Z \subseteq Y) \land (W \cap Y = \emptyset)) \implies (X \to Z)$
- special case if Z = Y: $((X \twoheadrightarrow Y) \land (W \to Y) \land (W \cap Y = \emptyset)) \implies (X \to Y)$

These 5 (above) form a sound and complete system.

Multi-valued Union: $((X \twoheadrightarrow Y) \land (X \twoheadrightarrow Z)) \implies (X \twoheadrightarrow Y \cup Z)$
Multi-valued Intersection: $((X \twoheadrightarrow Y) \land (X \twoheadrightarrow Z)) \implies (X \twoheadrightarrow Y \cap Z)$
Multi-valued Difference: $((X \twoheadrightarrow Y) \land (X \twoheadrightarrow Z)) \implies (X \twoheadrightarrow Y - Z)$

Normal Forms

Used to maintain functional dependencies

important notes:

$\Sigma$ is sufficient for all normal forms
or: it only has to satisfy one of the rules

5NF $\subset$ 4NF $\subset$ BCNF $\subset$ 3NF $\subset$ 2NF $\subset$ 1NF

5NF $\ne$ 4NF $\ne$ BCNF $\ne$ 3NF $\ne$ 2NF $\ne$ 1NF

2NF

Non-prime attributes are fully dependent on each candidate key.

2NF $\iff$ for all functional dependency $X \to \{A\} \in \Sigma+$

$X \to \{A\}$ is trivial or
A is a prime attribute or
X is not a proper subset of a candidate key
- if X is a proper subset, it is dependent on a candidate key

3NF

Fixes BCNF's issue of not preserving some FDs

3NF $\iff$ for all functional dependency $X \to \{A\} \in \Sigma+$

$X \to \{A\}$ is trivial or
X is a superkey or
A is a prime attribute
- avoids the problem of losing some FDs in BCNF

* most 3NF is also BCNF

BCNF

No attribute is transitively dependent on any candidate key.

BCNF $\iff$ for all functional dependency $X \to \{A\} \in \Sigma+$

$X \to \{A\}$ is trivial or
X is a superkey

4NF (Multi-Valued Dependencies)

4NF $\iff$ for all multi-valued dependency $X \twoheadrightarrow Y \in \Sigma+$

$X \twoheadrightarrow Y$ is trivial or
- $Y = R - X$ or
- $Y \subset X$
X is a superkey

5NF (Join Dependencies)

5NF $\iff$ for all join dependencies $*(\pi_{R_1}, ..., \pi_{R_n}(R)) \in \Sigma+$ , where $R_i \subset R$

the join dependency is trivial or
- exists $R_i = R$
each $R_i$ is a superkey of R

Normalisation

Decomposition
Synthesis Method
Entity-Relationship Approach
- design with ERD
- might not be fully normalised

Decomposition & Synthesis Method uses only FD => might not match the entities

A probably better way (that is used in practice):

use ERD first
then, decompose / synthesis each of the individual tables that require normalisation
then, fix multi-valued dependencies & join dependencies

Losslessness

The natural join of all fragments is equivalent to the original relation.

The decomposition is lossless if the full outer join of the two tables on the primary = foreign key reconstitutes the original table.

Otherwise, lossy join (too many rows OR missing rows)

To check, use:

Simplified Chase (binary decomposition)
Distinguished Chase (any number of fragments)
Relational Algebra:
- $R = R_1 \cup R_2$ AND
  - $R_1 \cap R_2 \to R_1$
  - $R_1 \cap R_2 \to R_2$

Projected Functional Dependency

For a fragment $R' \subset R$ , its functional dependencies $\Sigma' \subset \Sigma$ is $\{ X \to Y \in \Sigma+ \mid X \subset R' \land Y \subset R' \}$

* note the Sigma +

Dependency Preservation

$\Sigma+ = (\Sigma_1 \cup ... \cup \Sigma_n)+$

To check, use

attribute closure algorithm
chase algorithm
Armstrong axioms ._. or eyeball
- check closures for each decomposed relation
- union the closures
- Armstrong / eyeball to try and find all in $\Sigma+$

Decomposition Algorithm

Works for any of the normal forms.

First, compute:

attribute covers
candidate keys
check if R is in the normal form

Let $X \to Y$ be a functional dependency that violates some normal form. Decompose into 2:

$R_1 = X^+$
$R_2 = (R - X^+) \cup X$

Then, check whether the projected functional dependencies are in BCNF. Repeat if necessary.

Guarantees

guarantees lossless decomposition
does not guarantee dependency preserving

4NF Decomposition

Let $X \twoheadrightarrow Y$ be a 4NF violation. Decompose into 2:

$R_1 = X \cup Y$
$R_2 = R - (Y - X)$

Any relation can be non-loss decomposed into an equivalent collection of 4NF relations.

not dependency preserving
- might not exist a lossless dependency preserving decomposition in 4NF
- may exist, but not reachable by binary decomposition
does not always find all keys

Synthesis Algorithm

compute attribute covers
compute candidate keys
check if R is in 3NF
compute the (compact) minimal cover
- ideally compact
For each function dependency $X \to Y$ in the cover, create a relation $R_i = X \cup Y$ unless it already exists or is a subset of some other relation
Then, if none of the created relations contains (subset) one of the candidate keys
- create a relation with that key
- this handles the case where some attributes are missing in all FDs (these attributes are definitely in all keys)

Guarantees

guarantees lossless decomposition
guarantees 3NF
guarantees dependency preserving
- by construction
often is BCNF

Another 4NF Decomposition Algorithm

Normalise the relation R into a set of 3NF and/or BCNF relations
For each relation not in 4NF, if all attributes belong to the same key and there exists non-trivial multi-valued dependencies in the relation, then decompose the relation into 2 smaller relations (don't if you will lose functional dependencies)

The Chase Algorithm

Returns definite yes or definite no.

Chase Algorithm is sound and complete.

For functional and multi-valued dependencies, the algorithm always terminates.

If the answer is no, the final state produced is the counter-example to what's being tested.

Simple Chase

Start with two rows of all the attributes
1. row 1: a1, b1, c1, ...
2. row 2: a2, b2, c2, ...
To chase A -> D or A ->> D: (prime the algorithm with this query)
1. Set row 2's LHS values (A) to a1, b2, c2, ...
For each functional dependency
1. regular FD: LHS -> RHS
  1. for each pair of same LHS, set RHS to be the same
2. multi-valued FD: LHS ->> RHS
  1. let OTHER = R - LHS - RHS
  2. for all pair with same LHS
    1. create 2 new copies of the pair with same LHS, but swap their OTHER and RHS with the corresponding row
    2. i.e. a1, b1, c1, d1 + a1, b2, c2, d2 (A ->> BC)
    3. add these 2 rows => a1, b2, c2, d1 + a1, b1, c1, d2
Stop when the primed FD is satisfied or nothing else to do
1. If nothing else to do: this is the counterexample
2. Regular FD: Stop when all rows with same LHS, has the same RHS (of the primed query)
3. Multi-valued FD: satisfies the FD

Simple Chasing Lossless

Only works for binary decomposition

let $R = X \cup Y \cup Z$
let $R_1 = X \cup Y$
let $R_2 = X \cup Z$
Use Simplified Chase to decide X ->> Y (or X ->> Z)

Distinguished Chase

Regular FD

Checking X -> Y

row 1: all distinguished
row 2: distinguish all values in X

Chase until all variables in column Y are distinguished (in all rows)

Multi-Valued FD

Check X ->> Y

row 1: distinguish all values in $X \cup Y$
row 2: distinguish all values in $X \cup (R - X - Y)$

Chase until there's a row of distinguished

Join Dependency / Lossless Decomposition

create a row for each of the decomposed tables
distinguish the columns in that row that are in the corresponding table

Chase until there's a row of distinguished

Compact minimal cover
Attribute closure