Generic platforms for implementation and management of database applications.
INSERT INTO items (i_id, i_im_id, i_name, i_price)
VALUES (501, '58158' 'Bread', 520)
;
DELETE FROM stocks
WHERE w_id = 394
;
UPDATE warehouses
SET s_city = 'Johor Bahru'
WHERE s_city = 'Nusajaya' AND w_country = 'Malaysia'
;
-- UPDATE .. SET .. FROM .. WHERE .. ;
UPDATE target_table
SET customer = subquery.customer,
address = subquery.address,
partn = subquery.partn
FROM (SELECT address_id, customer, address, partn
FROM /* big hairy SQL */ ...) AS subquery
WHERE dummy.address_id=subquery.address_id
;
-- https://stackoverflow.com/a/6258586
SELECT SUM(s_qty)
FROM items i NATURAL JOIN stocks s
WHERE i.i_name = 'Aspirin'
;
SELECT *
FROM stocks s
WHERE s.s_qty BETWEEN 0 AND 10;
WHERE w.w_city LIKE 'Si%_';
-- % = .*
-- _ = .
-- Extermal Query: Getting the maximum / minimum.
-- DON'T do `ORDER BY LIMIT 1`!! because there might be more than one answer
SELECT *
FROM stocks s1
WHERE s1.s_qty = ALL (
SELECT MAX(s2.s_qty) FROM stocks s2;
);
UNION ALL is non-duplication elimination version
UNION is duplication elimination version (may be expensive)
-- select items that are in stock in every warehouse
-- select all items where
-- there does not exist a warehouse
-- that doesn't have this item
SELECT *
FROM items i
WHERE NOT EXISTS (
SELECT 1
FROM warehouses w
WHERE NOT EXISTS (
SELECT 1
FROM stocks s
WHERE s.w_id = w.w_id AND s.i_id = i.i_id
)
);
---
CREATE INDEX i_i_price ON items(i_price);
GRANT UPDATE ON stocks TO jinwei;
CREATE OR REPLACE FUNCTION myage(dob DATE)
RETURNS INTEGER AS $$
BEGIN
RETURN DATE_PRINT('year', CURRENT_DATE) - DATE_PART('year', dob)::INTEGER;
END;
$$ LANGUAGE plpgsql;
-- trigger function + trigger
CREATE OR REPLACE FUNCTION log_change()
RETURNS TRIGGER AS $$
DECLARE
action_text TEXT;
BEGIN
IF TG_OP = 'INSERT' THEN
action_text := 'INSERT';
ELSIF TG_OP = 'DELETE' THEN
action_text := 'DELETE';
NEW := OLD;
ELSE
RETURN NEW;
END IF;
INSERT INTO log VALUES (current_timestamp, action_text, TG_TABLE_NAME, NEW::TEXT);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE TRIGGER triglogstock
AFTER INSERT OR DELETE ON stocks
FOR EACH ROW
EXECUTE FUNCTION log_change();
SELECT
ROUND( PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY c. population ASC) ::NUMERIC, 0) AS percentile 25,
ROUND( PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY c. population ASC) ::NUMERIC, 0) AS percentile 50,
ROUND( PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY c. population ASC) ::NUMERIC, 0) AS percentile 75,
ROUND( PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY c. population ASC) ::NUMERIC, 0) AS percentile 95
FROM country c;
WITHIN GROUP
is an "inline" GROUP BY
. For postgres, it is only for SELECT
.
row_number()
vs rank()
vs dense_rank()
SALARY | ROW_NUMBER | RANK | DENSE_RANK
1000 | 1 | 1 | 1
1500 | 2 | 2 | 2
1500 | 3 | 2 | 2
2000 | 4 | 4 | 3
2200 | 5 | 5 | 4
2500 | 6 | 6 | 5
2500 | 7 | 6 | 5
2500 | 8 | 6 | 5
3000 | 9 | 9 | 6
OVER (
[ <PARTITION BY clause> ]
[ <ORDER BY clause> ]
[ <ROWS or RANGE clause> ]
)
LAG(expression [,offset [,default_value]])
OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
LEAD(expression [,offset [,default_value]])
OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
FIRST_VALUE ( value anyelement )
LAST_VALUE ( value anyelement )
NTH_VALUE ( value anyelement, n integer )
[[CS4221 Window Functions Practice]]
The ROW clause does it by specifying a fixed number of rows that precede or follow the current row.
The RANGE (only for numeric and date/time) clause, on the other hand, limits the rows logically by the value; it specifies the range of values in relation to the value of the current row.
ORDER BY date
ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING
ORDER BY price
RANGE BETWEEN 5.0 PRECEDING AND CURRENT ROW
OVER(...)
clause changes the aggregate to be over the row / range clause specifiedOVER()
clause makes the window span the entire table
+-----------+-------+---------+
| badge_num | name | surname |
+-----------+-------+---------+
| 1 | John | Smith |
| 2 | Mark | Pence |
| 3 | Steve | Smith |
| 4 | Bob | Smith |
+-----------+-------+---------+
SELECT surname, COUNT(*)
FROM employees
GROUP BY surname;
+---------+----------+
| surname | COUNT(*) |
+---------+----------+
| Smith | 3 |
| Pence | 1 |
+---------+----------+
-- empty OVER() clause
SELECT surname, COUNT(*) OVER()
FROM employees
GROUP BY surname;
+---------+-----------------+
| surname | COUNT(*) OVER() |
+---------+-----------------+
| Smith | 2 |
| Pence | 2 |
+---------+-----------------+
https://stackoverflow.com/a/61254720
Window Functions can only be used in SELECT
and ORDER BY
:/
SELECT c.name, e.continent,
ROW_NUMBER() OVER w_continent AS rank_continent,
ROW_NUMBER() OVER w_world AS rank_world
FROM country c INNER JOIN encompasses e ON c.code=e.country
-- define the window separately
WINDOW
w_continent AS (PARTITION BY e.continent ORDER BY (c.population*e.percentage) DESC),
w_world AS (ORDER BY c.population DESC)
ORDER BY rank_world;
ORDER BY
, the default frame clause is: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
SELECT
), the default frame clause is: RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
NOTE: integers will perform integer division! You have to convert one of them to decimal first.
select 1::decimal / 3;
Percentage based on condition
select
t.request_at as Day,
sum(case when t.status = 'cancelled_by_driver' or t.status = 'cancelled_by_client' then 1 else 0 end)::decimal / count(*) as "Cancellation Rate"
from trips t
inner join users d on t.driver_id = d.users_id
inner join users c on t.client_id = c.users_id
where d.banned = 'No' and c.banned = 'No'
and t.request_at between '2013-10-01' and '2013-10-03'
group by t.request_at
;
All X where all Y (e.g. all customers who bought all products)
select
distinct c1.customer_id
from customer c1
where not exists (
select 1
from product p1
where not exists (
select 1
from customer c2
where c1.customer_id = c2.customer_id and c2.product_key = p1.product_key
)
)
;
OLTP = Online Transaction Processing
OLAP = Online Analytic Processing
It is not practical to do both OLTP and OLAP on the same database system!
Book: "The Data Warehouse Toolkit Third Edition: The Definitive Guide to Dimensional Modeling" by Ralph Kimball and Margy Ross
Designing a data warehouse requires one to think as half a (1) administrator, and half a (2) business analyst.
Crudely split into two forms:
If a data warehouse is just a copy of the operational system, the usability and performance is heavily affected.
Would be ideal if source systems were reengineered with consistent views
Dominated by simple operations like sorting + sequential processing
Possible to use normalised relational DB to handle data staging, but this would mean you have to ETL again into the final data warehouse. If doing this, these tables must be hidden from the end user!
This is the area that the business community has access to.
It is a series of integrated data marts. Each data mart presents the data from a single business process.
Normalised models (3NF, BCNF, ...) are too complicated for data warehouse queries!
A dimension model contains the same information as a normalised model, but packages the data in a format whose design goals are: user understandability, query performance, and resilience to change.
Bus Network is the best architecture for being robust and integrated.
Star Schema: one fact table + several dimension tables
Approximately 80 to 90 percent of the potential users will use pre-built applications to interface with the data, i.e. won't be writing relational queries.
Additivity is a crucial property of a fact: the rows of the column can be added together (e.g. dollars)
Semi-additive facts can be added only along some dimensions
Non-additive facts cannot be added at all: must use counts or averages to summarise the facts (or print billions of rows .-.)
Some non-additive facts like percentages and ratios can be solved by storing the numerator and denominators separately
The most useful facts are numeric and additive.
In most cases, it is a description of something, and is from a discrete list (treat as enum). Should be placed into dimensions for better correlation + space-saving
Unless the text is unique for every row in the fact table, it belongs in the dimension table.
A truly textual fact (e.g. user input text) is rare because it is almost impossible to analyse (...but is this still true for modern text analysis and ML analysis?)
Fact tables typically take up 90+% of the total space (a lot of rows, but little columns) => should be careful
Must avoid null keys
Most business processes can be represented with less than 15 dimensions (if you have more than 25, look to combine the dimensions!) Perfectly correlated attributes should be in the same dimension
Use surrogate keys (autogenerated meaningless keys). Except date dimension! (let Jan 1 of first year be index value of 1, etc...). This allows for partitioning of the fact table by date
not uncommon to have 50 to 100 attributes!
typically relatively small number of rows (far fewer than 1 million rows)
Serves as primary source of query constraints, groupings, report labels
If a price change, use a new surrogate key. Thus, preserving historical data!
use lowest possible grain because: even though nobody wants to see individual low-level rows, queries need to cut through the details in very precise ways
Even 10 years worth of dates is only about 3650 rows (very small!)
"Date and time are almost completely independent." By combining them, the number of rows increase a lot. (~5million rows for 10 years)
This is a causal dimension because it describes factors thought to cause a change in product sales
Managers are interested in determining if promotions are effective
"The various possible causal conditions are highly correlated. A temporary price reduction usually is associated with an ad and perhaps an end-aisle display. Coupons often are associated with ads. For this reason, it makes sense to create one row in the promotion dimension for each combination of promotion conditions that occurs."
It is possible to separate different causal mechanisms (price reductions, ads, displays, coupons) into separate dimensions (there are tradeoffs)
How to capture products that did not sell during promotions?? (there would be no rows) "In the case of the promotion coverage fact table, we’d load one row in the fact table for each product on promotion in a store each day (or week, since many retail promotions are a week in duration) regardless of whether the product sold or not." This is a factless fact table.
How to store time dimension?
- Should it be just stored as an attribute?
- Should it be stored with date, under a
datetime
dimension?- Should it be stored as its own
time
dimension?
What is the right granularity for the fact table?
"We need the most finely grained data in the presentation area so that users can ask the most precise questions possible. Because users’ requirements are unpredictable and constantly changing, we must provide access to the exquisite details so that they can be rolled up to address the questions of the moment."
"The secret to query flexibility is building the fact tables at the most granular level."
EXPLAIN
command
pg_statistic
, pg_stats
to see if the frequencies / statistics make sense
pg_stats
is a view of pg_statistic
ANALYZE;
VACUUM
/ VACUUM FULL
(defragmentation)ANALYZE
(gather and update statistics) SELECT n.nspname AS schemaname,
c.relname AS tablename,
a.attname,
s.stainherit AS inherited,
s.stanullfrac AS null_frac,
s.stawidth AS avg_width,
s.stadistinct AS n_distinct,
CASE
WHEN (s.stakind1 = 1) THEN s.stavalues1
WHEN (s.stakind2 = 1) THEN s.stavalues2
WHEN (s.stakind3 = 1) THEN s.stavalues3
WHEN (s.stakind4 = 1) THEN s.stavalues4
WHEN (s.stakind5 = 1) THEN s.stavalues5
ELSE NULL::anyarray
END AS most_common_vals,
CASE
WHEN (s.stakind1 = 1) THEN s.stanumbers1
WHEN (s.stakind2 = 1) THEN s.stanumbers2
WHEN (s.stakind3 = 1) THEN s.stanumbers3
WHEN (s.stakind4 = 1) THEN s.stanumbers4
WHEN (s.stakind5 = 1) THEN s.stanumbers5
ELSE NULL::real[]
END AS most_common_freqs,
CASE
WHEN (s.stakind1 = 2) THEN s.stavalues1
WHEN (s.stakind2 = 2) THEN s.stavalues2
WHEN (s.stakind3 = 2) THEN s.stavalues3
WHEN (s.stakind4 = 2) THEN s.stavalues4
WHEN (s.stakind5 = 2) THEN s.stavalues5
ELSE NULL::anyarray
END AS histogram_bounds,
CASE
WHEN (s.stakind1 = 3) THEN s.stanumbers1[1]
WHEN (s.stakind2 = 3) THEN s.stanumbers2[1]
WHEN (s.stakind3 = 3) THEN s.stanumbers3[1]
WHEN (s.stakind4 = 3) THEN s.stanumbers4[1]
WHEN (s.stakind5 = 3) THEN s.stanumbers5[1]
ELSE NULL::real
END AS correlation,
CASE
WHEN (s.stakind1 = 4) THEN s.stavalues1
WHEN (s.stakind2 = 4) THEN s.stavalues2
WHEN (s.stakind3 = 4) THEN s.stavalues3
WHEN (s.stakind4 = 4) THEN s.stavalues4
WHEN (s.stakind5 = 4) THEN s.stavalues5
ELSE NULL::anyarray
END AS most_common_elems,
CASE
WHEN (s.stakind1 = 4) THEN s.stanumbers1
WHEN (s.stakind2 = 4) THEN s.stanumbers2
WHEN (s.stakind3 = 4) THEN s.stanumbers3
WHEN (s.stakind4 = 4) THEN s.stanumbers4
WHEN (s.stakind5 = 4) THEN s.stanumbers5
ELSE NULL::real[]
END AS most_common_elem_freqs,
CASE
WHEN (s.stakind1 = 5) THEN s.stanumbers1
WHEN (s.stakind2 = 5) THEN s.stanumbers2
WHEN (s.stakind3 = 5) THEN s.stanumbers3
WHEN (s.stakind4 = 5) THEN s.stanumbers4
WHEN (s.stakind5 = 5) THEN s.stanumbers5
ELSE NULL::real[]
END AS elem_count_histogram
FROM (((pg_statistic s
JOIN pg_class c ON ((c.oid = s.starelid)))
JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum))))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
WHERE ((NOT a.attisdropped) AND has_column_privilege(c.oid, a.attnum, 'select'::text) AND ((c.relrowsecurity = false) OR (NOT row_security_active(c.oid))));
EXPLAIN
shows 849.87..849.88
.
The cost is in arbitrary units, but is roughly proportional to the actual execution time. More importantly, it is used to relatively compare the different plans.
NOTE: the query planner is re-planning the query every time a query is made. This is required because the data / stats will keep changing. Even with PREPARE
statements, the query planner may re-plan the queries (for modern postgres versions).
Calling ANALYZE;
(different from EXPLAIN (ANALYZE)
) can be used to collect statistics, but while it is doing this, it will affect the query performance of other queries. However, this must be manually called. Future optimisations: use reinforce learning to estimate the best times to run statistics collections?
EXPLAIN (ANALYZE)
Creates the query plan AND actually executes it.
This allows it to include information from an actual execution to detect bad estimations.
NOTES:
actual time
with loops
to get the actual timeWHERE
conditions: 1=1
, 22=11*2
, ...^ depends heavily on the optimiser
-- UNIQUE checks for duplicate values
CREATE [ UNIQUE ] INDEX [ name ] ON table name
[ USING method ] -- btree (default), hash, gist, ...
( { columnname | ( expression ) } )
[ WHERE predicate ]
What internal SQL queries are called to check if a value (to be added) is unique?
Order of attributes in index matter (for B+ tree indexes) for multicolumn indexes!!
Generalized Inverted
is for text searching
PostgreSQL has partial indexes, i.e. only index the rows that satisfy some predicate
-- partial index example
CREATE INDEX stocks_s_qty ON stocks(s_qty) WHERE s_qty>600;
In PostgreSQL, indexes do not affect the data in the file (e.g. order). One reason: there are indexes so what should you order the data by? Some DBs like Oracle can do this?
but, CLUSTER <table> BY <index>
. This is useful for data analytics. If the table is updated, this operation has to be run again. PostgreSQL doesn't automatically re-cluster it, so it isn't that useful for transactional DBs.
PostgreSQL lets you add extra attributes to an index
e.g. an index of (w_id, i_id)
has the value of s_qty
included
create index stocks_w_id_i_id on stocks(w_id, i_id) include (s_qty);
-- it allows this query to be index-only:
select s_qty
from stocks
where w_id = 1 and i_id = 1
^ these ratios are always changing (and dependent on many factors)
Clustering the table using an index can also change the scan used! (but not very practical outside of data analytics context)
Temporary tables do not have indexes when they're copied. PostgreSQL doesn't make statistics for them.
Nested-Loop joins are at worse block-based, not row-based and not page-based.
ANALYZE
can be used to gather and update statisticsIN
, EXISTS
, = ANY
, <> ALL
, NOT IN
, OUTER JOIN
are optimised fairly differently (even though they're fairly similar)
PostgreSQL sucks at NOT IN
(try using EXISTS
or something else for better performance)
Join Order is chosen by the optimiser.
basically, try to give the DBMS as much information as possible in order to let it plan better
Views are just used as a subquery, it is purely for convenience, 0 performance difference.
CREATE VIEW vall AS
SELECT *
FROM warehouses w
NATURAL JOIN stocks s
NATURAL JOIN item i
;
-- still needs to join!!
EXPLAIN ANALYZE SELECT * FROM vall v
WHERE v.w name='Agimba'
;
Middle ground between normalised and denormalised schema. Updates are costly, but they are manually triggered using REFRESH (or using automatic triggers). So, the cost can be controlled.
Unfortunately, currently, PostgreSQL does not do any optimisation to propagate updates (the original SELECT query is ran again; i.e. it does not store any partial state to only update the changed rows).
There is an actual table being created with the results of that query.
CREATE MATERIALIZED VIEW mvall AS
SELECT *
FROM ...
;
-- MUST manually refresh
REFRESH MATERIALIZED VIEW mvall;
Materialised views can be indexed too.
Need to replan the query plan every time because data can change between queries.
Now, PostgreSQL still runs the optimiser and tries to see if there's a better query (if there are changes to the DB -> data / indexes / stats).
PREPARE q AS
SELECT s.i_id
FROM stocks s
WHERE s.s_qty > 500;
EXECUTE ANALYZE EXECUTE q;
DEALLOCATE q;
Some DBMSs (e.g. MariaDB) let you use hints (e.g. tell the DBMS to use a certain index). But, not recommended unless the statistics don't change.
SELECT *
FROM warehouses
USE INDEX (i_i_price) WHERE i.i_price < 100
;
IGNORE INDEX
FORCE INDEX
-- force a certain JOIN order
SELECT *
FROM warehouses w STRAIGHT_JOIN stocks s ON ...
;
https://mariadb.com/kb/en/optimizer-hints/
PostgreSQL doesn't have this.
work_mem
VACUUM
, CLUSTER
, VACUUM FULL
, reindexingCREATE INDEX
ANALYZE
Hard-tune the queries as a last resort and at every users' current and future risk.
Very natural way to represent the world.
ER diagrams can provide a global view of the schemas.
Null-ary relationship (relationship set with 0 entity sets) is degenerate: it is either the empty set or a single term. Seldom, if ever, used.
ER diagrams can be automatically converted to SQL DDL tables, except:
UNIQUE NOT NULL
Key is the combination of the keys of all entity sets involved. These individual keys are also foreign keys to the entity sets.
Exceptions:
The stuff generated by DBMS tools are Logical Diagrams representing the implemented models, NOT ER Diagrams!!
<!-- XML declaration in the prologue !-->
<?xml version="version_number"
encoding="encoding_declaration"
standalone="standalone_status" ?>
XML document is well-formed if it complies with the XML basic rules. (not DTD!!)
<?xml version="1.0" standalone="no" ?>
<!DOCTYPE html PUBLIC
"=//W3C//DTD XHTML 1.0 Transitional //EN"
"https://www.w3.org/TR/xhtml1/DTD/xhtml1=transitional.dtd">
Can use XML in (some) Relational DBMS by using the XML
domain.
http://exist-db.org/exist/apps/homepage/index.html
An XML document is a labelled (there are annotations attached to each node), unranked (no priori bound on the number of children of a node), ordered (there is an order between children of a node) tree.
Logic might differ based on programming language / library :/
<
: <
>
: >
&
: &
'
: '
"
: "
More can be defined in the DTD too.
<p:custDataLst></p>
The name space is custDataLst
Name space is defined in a URL at the start of the XML document.
aka CDATA (Character Data) section
Anything in this section will be implemented as regular text, not markup.
<![CDATA[
<!-- freely use symbols like <, > etc here -->
if ((i < 5) && (j > 6))
printf("error");
]]>
An alternative is to use stuff like <
instead.
CData explanation: https://stackoverflow.com/a/2784200
XML is schema-less by default, but you can include a Document Type Definition (DTD) in the prologue.
<!DOCTYPE email [
<!ELEMENT email (header, body)>
<!ELEMENT header (from, to, cc?)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT cc (#PCDATA)>
<!ELEMENT body (paragraph*)>
<!ELEMENT paragraph (#PCDATA)>
]>
<email>
<header>
<from>x@gmail.com</from>
<to>y@gmail.com</to>
</header>
<body />
</email>
More modern alternative to DTD.
<?xml version=”1.0”?>
<xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”>
<xs : element name=”note”>
<xs : complexType>
<xs : sequence>
<xs:element name=”to” type=”xs:string” minOccurs=’1’ maxOccurs=’1’/>
<xs : element name=”from” type=”xs : string”/>
<xs : element name=”heading” type=”xs : string”/>
<xs : element name=”body” type=”xs : string”/>
</xs : sequence>
</xs : complexType>
</xs : element>
</xs : schema>
Navigation for the XML tree
Simple
SELECT
query from one table
Every single XML tree starts with a special root node, which has ONE single child node: the actual root node.
Absolute paths start with an /
, starts from the root node (but it starts from the children of the current node in short syntax). Relative paths don't!
Shorthand syntax is sometimes ambiguous.
<!-- full syntax -->
/child::mondial/child::country
<!-- shorthand syntax (BANNED in cs4221) -->
/mondial/country
axis::nodetest[predicate]
following
: AFTER the closing of the elementpreceding
: BEFORE the opening of the elementThese follow document-semantics, ignore the tree.
[attribute::car_code]
: element has an attribute car_code
[attribute::car_code="AL"]
: element has an attribute car_code
of value AL
>, <, <=, >=, !=
[child::country]
: element has a child element country
(can be a path)[child::country="Singapore"]
: element has a child element country
(can be a path)|
is a union-kinda operator
/descendant::D | /descendant::C
returns descendants that are D or Cand
, or
, not()
Predicates can be used along any path.
count(<xpath expression>)
node::name()
node::text()
(XQuery has more functions)
"JOIN" kind of query
XQuery 1.0 is a strict superset of XPath 2.0
FLOWR expressions
for
: selects a sequence of nodeslet
: binds a sequence to a variableorder by <expression> [ascending / descending]
: sort
where
: filtergroup by <expression>
: group byreturn
: returns the result (gets evaluated once for every node)Note that XQuery performs auto type casting.
<!-- for -->
<results>
{
for $x in doc("example.xml")/child::R/child::A/child::*
order by data($x/attribute::a) descending
where $x/child::text() > 2
return <result>{$x/attribute::a}</result>
}
</results>
<!-- let -->
<results>
{
let $x := doc("example.xml")/child::R/child::A/child::*
return <result>{$x}</result>
}
</results>
<!-- group by -->
<results>
{
for $x in doc("example.xml")/descendant::*
let $y := name($x)
group by $y
return <result><element>{$y}</element><count>{count($x)}</count></result>
}
</results>
<!-- join -->
<results>
{
for $x in doc("example.xml")/child::R/child::A/child::*
for $y in doc("example.xml")/child::R/child::A/child::*
where name($x) = name($y) and $x < $y
return <result>{$x} {$y} </result>
}
</results>
(some here in XPath too)
count(<expr>)
node::name()
node::text()
max(<expr>)
distint-values(<expr>)
distinct-values(doc("data.xml")/child::restaurants/...)
<expr> div <expr>
: division, don't use /
4 div 2 = 2
Extensible Stylesheet Language Transformations
(not tested)
XSLT uses a subset of Xpath
Interactive / transformative language. Used as a general purpose XML processing language.
if two tuples agree on X-values, they must agree on the Y-values.
Compact Minimal Cover = Canonical Cover
Minimal conditions:
Every set of functional dependencies has a minimal cover (also known as a minimal basis).
Every set of functional dependencies has a compact minimal cover.
Compact conditions:
Algorithm:
(informal definition)
Each X-value in is consistently associated with one set of Y-values.
A generalisation of foreign key constraint.
Instances and satisfy the inclusion dependency
(informal definition)
For all tuples in , there exists a tuple in whereby
Trivial if one of the relations is equal to .
These 3 (above) form a sound and complete system.
These 5 (above) form a sound and complete system.
Used to maintain functional dependencies
important notes:
5NF 4NF BCNF 3NF 2NF 1NF
5NF 4NF BCNF 3NF 2NF 1NF
Non-prime attributes are fully dependent on each candidate key.
2NF for all functional dependency
Fixes BCNF's issue of not preserving some FDs
3NF for all functional dependency
* most 3NF is also BCNF
No attribute is transitively dependent on any candidate key.
BCNF for all functional dependency
4NF for all multi-valued dependency
5NF for all join dependencies , where
Decomposition & Synthesis Method uses only FD => might not match the entities
A probably better way (that is used in practice):
The natural join of all fragments is equivalent to the original relation.
The decomposition is lossless if the full outer join of the two tables on the primary = foreign key reconstitutes the original table.
Otherwise, lossy join (too many rows OR missing rows)
To check, use:
For a fragment , its functional dependencies is
* note the Sigma +
To check, use
Works for any of the normal forms.
First, compute:
Let be a functional dependency that violates some normal form. Decompose into 2:
Then, check whether the projected functional dependencies are in BCNF. Repeat if necessary.
Guarantees
Let be a 4NF violation. Decompose into 2:
Any relation can be non-loss decomposed into an equivalent collection of 4NF relations.
Guarantees
Returns definite yes or definite no.
Chase Algorithm is sound and complete.
For functional and multi-valued dependencies, the algorithm always terminates.
If the answer is no, the final state produced is the counter-example to what's being tested.
Only works for binary decomposition
Checking X -> Y
Chase until all variables in column Y are distinguished (in all rows)
Check X ->> Y
Chase until there's a row of distinguished
Chase until there's a row of distinguished