Database Systems

Summary

Summary

Table of Contents

1. Databases

Data vs Information

File processing systems vs databases

File processing systems

Database systems

2. Database development lifecycle

database development lifecycle

3. Conceptual Design

Entities and relationships

Constraints

Weak Entity

Ternary relationships

ternary-relationship

Attributes

4. Translating ER diagrams

Relational Data Model

Keys

Primary key

Foreign key

Integrity constraints

ER to Relational model

db-design-cycle

multi-valued-attribute

Example

Conceptual design associative-entity

Logical design

Employee(ssn PK, name, age)
Department(did PK, dname, budget)
Works_In(ssn PFK, did PFK, since)

Physical Design

Employee(ssn CHAR(11) PK, 
         name VARCHAR(20), 
         age INTEGER)
Department(did INTEGER PK, 
           dname VARCHAR(20),
           budget DECIMAL(10, 2))
Works_In(ssn CHAR(11) PFK, 
         did INTEGER PFK, 
         since DATE)

Implementation

CREATE TABLE Employee (
    ssn CHAR(11), 
    name VARCHAR(20),
    age INTEGER,
    PRIMARY KEY (ssn)
);

CREATE TABLE Department (
    did INTEGER,
    dname VARCHAR(20),
    budget DECIMAL(10, 2),
    PRIMARY KEY (did)
);

CREATE TABLE Works_In (
    ssn CHAR(11),
    did INTEGER,
    since DATE
    PRIMARY KEY (ssn, did),
    FOREIGN KEY (ssn) REFERENCES Employee,
    FOREIGN KEY (did) REFERENCES Department
);

Example: key and participation constraints

Conceptual model example-er-model

Implementation

CREATE TABLE Department (
    did INTEGER, 
    dname VARCHAR, 
    budget DECIMAL(10, 2),
    since DATE, 
    ssn CHAR(11) NOT NULL,
    PRIMARY KEY (did),
    FOREIGN KEY (ssn) REFERENCES Employee
        ON DELETE NO ACTION
);

Example: Weak Entity

Logical design

Dependent(dname PK, age, cost, ssn PFK)

Implementation

CREATE TABLE Dependent (
    dname CHAR(20) NOT NULL,
    age INTEGER NULL,
    cost DECIMAL(7, 2) NOT NULL,
    ssn CHAR(11) NOT NULL,
    PRIMARY KEY (dname, ssn),
    FOREIGN KEY (ssn) REFERENCES Employee
        ON DELETE CASCADE
);

5. MySQL Workbench

Conceptual to logical checklist

  1. Flatten composite and multi-valued attributes, or introduce lookup table for multi-valued attributes when number of values is unknown or variable.
  2. Resolve many-many relationships
  3. Add foreign keys at crows foot end of relationship (many side)

Convert from Logical

Binary relationships

Unary relationships

Ternary relationship

6. Relational Algebra

Basic Operations

Compound Operators

8, 9. SQL

SQL

SELECT statement overview

# List columns/expressions returned from query
SELECT [ALL|DISTINCT] select_expr [, select_expr ...]
# tables from which data is obtained
FROM table_references
# filtering conditions
WHERE where_condition
# categorisation of results
GROUP BY { col_name | expr } [ASC | DESC], ...
# filtering conditions for categories.  Can only be used with a GROUP BY
HAVING where_condition
# sort
ORDER BY {col_name | expr | position} [ASC | DESC]
# limit which rows are returned
LIMIT {[offset, ] row_count | row_count OFFSET offset}];

LIKE Clause

LIKE "<regex>"

e.g.

SELECT * FROM Customer
# any values starting with "a"
WHERE CustomerName LIKE 'a%'
# any values starting with "a", at least 3 values long
WHERE CustomerName LIKE 'a_%_%'
# any values starting with "a", ends with "o"
WHERE CustomerName LIKE 'a%o'

Aggregate Functions

MySQL Group By Functions

Order By

ORDER BY XXX ASC/DESC

Joins

# Cross product
SELECT * FROM R1, R2;

# Natural join: join tables on keys.  Attributes must have same name
SELECT * 
FROM R1 NATURAL JOIN R2;

# Inner join: only retain matching rows
SELECT *
FROM R1 INNER JOIN R2
    ON R1.id = R2.id;
k
# Outer join:: left/right; includes records that don't match from the outer table
SELECT *
FROM R1 LEFT OUTER JOIN R2
    ON R1.id = R2.id;

Comparison and logical operators

Set operators

Other Statements

INSERT Statement

# insert single record
INSERT INTO NewEmployee SELECT * FROM Employee;

# insert multiple records
INSERT INTO Employee VALUES
    (DEFAULT, "A", "address A", "2012-02-12", NULL, "S"),
    (DEFAULT, "B", "address B", "2012-02-12", NULL, "R");

# insert specific columns
INSERT INTO Employee
    (Name, Address, DateHired, EmployeeType)
    VALUES
        ("D", "address D", "2012-02-12", "C"),
        ("E", "address E", "2012-02-12", "C");

UPDATE

UPDATE Salaried
    SET AnnualSalary = 
        CASE
            WHEN AnnualSalary <= 100000
            THEN AnnualSalary * 1.1
            ELSE AnnualSalary * 1.05
        END;

REPLACE

DELETE

DELETE FROM Employee
    WHERE Name = "Grace";

Views

CREATE VIEW nameOfView AS validSqlStatement

More DDL

ALTER

ALTER TABLE TableName ADD AttributeName AttributeType;
ALTER TABLE TABLENAME DROP AttributeName;

RENAME

RENAME TABLE CurrentName TO NewName;

Approach for writing SQL

  1. Use design as a map to help format queries
  2. Create skeleton SELECT statement as a template
  3. Fill in parts to build query

10. Storage and Indexing

Files in DBMS

Indexes

unclustered-index

Clustered vs Unclustered

clustered-index-zoom

unclustered-index-zoom

Primary vs Secondary

Single Key vs Composite Key

Index Type

hash-based-index

tree-based-index

11. Query Processing

Query Processing

Relational Operations

Selection

Projection

External Merge Sort

external-merge-sort

Sort-based Projection

Cost =  ReadTable +             # read entire table, keep projected attributes
        WriteProjectedPages +   # write back to disk 
        SortingCost +           # external merge sort 
        ReadProjectedPages      # discard adjacent duplicates

WriteProjectedPages = NPages(R) * PF
SortingCost = 2*NumPasses*ReadProjectedPages

Hash-based projection

projection-hashing

Cost = ReadTable + WriteProjectedPages + ReadProjectedPages

12. Query Processing II

Joins

Simple nested loop join

for each tuple r in R:
  for each tuple s in S:
    if ri = sj:
       add <r, s> to the result
Cost(SNLJ) = NPages(Outer) + NTuples(Outer) * NPages(Inner)

Page-Oriented Nested Loop Join

for each page b_r in R:
  for each page b_s in S:
    for each tuple r in b_r:
      for each tuple s in b_s:
        if r_i = s_j, add <r,s> to the result
Cost(PNLJ) = NPages(Outer) + NPages(Outer) * NPages(Inner)

Block Nested Loop Join

for each block in R:
  for each page in S:
    for each matching tuple r in R-block, s in S-page:
      add `<r, s>` to result
\[\text{NBlocks(Outer)} = \lceil\frac{\text{NPages(Outer)}}{B-2}\rceil\]
Cost(BNLJ) = NPages(Outer) + NBlocks(Outer)*NPages(Inner)

Sort-Merge Join

Cost(SMJ) = Sort(Outer) + Sort(Inner)     # Sort inputs
          + NPages(Outer) + NPages(Inner) # Merge inputs

Sort(R) = external sort cost = 2*NumPasses*NPages(R)

Hash Join

hash-join

Cost(HJ) = 2*NPages(Outer) + 2*NPages(Inner)  # Create partitions
            + NPages(Outer) + NPages(Inner)   # Match partitions

General Joins

13. Query Optimisation

Query plan

query-plan

Query optimisation steps

  1. break query into blocks
  2. convert block to relational algebra
  3. consider alternative query plans for each block
  4. select plan with lowest estimated cost

Break query into blocks

Relational algebra equivalences

Cascade

\[\sigma_{c_1 \wedge ... \wedge c_n}(R) = \sigma_{c_1}(...(\sigma_{c_n}(R)))\]

Commute

\[\sigma_{c_1}(\sigma_{c_2}(R)) = \sigma_{c_2}(\sigma_{c_1}(R))\]

Cascade

\[\pi_{a_1}(R) = \pi_{a_1}(...(\pi_{a_n}(R)))\]

Where $a_i$ is a set of attributes of R and $a_i \subseteq a_i+1$ for $i = 1..n-1$

Associative \(R\bowtie (S \bowtie T) = (R\bowtie S)\bowtie T\)

Commutative \(R\bowtie S = S\bowtie R\)

Statistics and Catalogs

Cost Estimation

\[\text{ResultSize} = \Pi_{j=1...k}\text{NTuples}(R_j)\Pi_{i=1...n} RF_i\]

Reduction Factors

Condition Reduction Factor
col = value $\frac{1}{\text{NKeys(col)}}$
col > value $\frac{\text{High(col)}-\text{value}}{\text{High(col)}-\text{Low(col)}}$
col < value $\frac{\text{value}-\text{Low(col)}}{\text{High(col)}-\text{Low(col)}}$
Col_A = Col_B (join) $\frac{1}{\max{(\text{NKeys}(Col_A), \text{NKeys}(Col_B))}}$
no info Use magic number 1/10

14. Query Optimisation 2

Single-relation plans

Multi-relation plans

  1. select order of relations
  2. select join algorithm for each join
  3. for each input relation select access method (heap scan, …)

left-deep-join-tree

15. Normalisation

Motivation for Normalisation

denormalised-data

Functional dependency

Consider A(X PK, Y PK, Z, D):

Steps in Normalisation

Wikipedia: normalisation

Normalisation vs Denormalisation

16. Database Administration

Capacity Planning

Transaction load

Storage requirements

Backup and Recovery

Motivation

Failure

Some failure categories include:

Backup Taxonomy

Physical vs Logical

Online vs Offline Backup

Full vs Incremental Backup

Onsite vs Offsite Backup

Backup Policy

backup-policy

17. Transactions

Transactions

transaction: logical unit of work

Defining a unit of work

START TRANSACTION;  # also BEGIN
  # <SQL statements>
COMMIT;          # commit whole transaction
ROLLBACK;       # undoes everything

ACID: Transaction properties

Concurrent Access

lost-update-problem

uncommitted-data-problem

Serialisability

Concurrency Control Methods

Locking

Lock Granularity

Lock types

Deadlock

Timestamping

Optimistic

Transaction Log

18. Data Warehousing

Data Warehousing

Motivation

Analytical Queries

DW Characteristics

data-warehouse-architecture

Dimensional Modelling

e.g. How much revenue did product G generate in the last 3 months, broken down by month for south easter sales region, by individual stores, broken down by _promotions, compared to estimates and to the previous product version

Dimensional model

fact-table

star-schema

Designing a dimensional model

  1. choose a business process (subject)
  2. choose measured facts (usually numeric, additive quantities)
  3. choose granularity of fact table
  4. choose dimensions
  5. complete dimension tables

Embedded Hierarchies

embedded-hierarchy

Snowflake schema

snowflake-schema

Denormalisation

19. Distributed Databases

Distributed Databases

distributed-database

cluster-manager

Advantages

workload-distributed

Disadvantages

Objectives and Trade-offs

Distribution Options

Data Replication

Synchronous Update

Asynchronous Update

Horizontal Paritioning

Vertical partitioning

Comparing configurations

comparing-distributed-db-configurations

Functions of a distributed DBMS

distributed-dbms-architecture

Date’s 12 Commandments for Distributed Databases

Independence

  1. Local site independence: each local site can act as an independent, autonomous, DBMS. Each site is responsible for security, concurrency, control, backup, recovery.
  2. Central site independence: no site in the network relies on a central site, or on any other site. Each site has the same capabilities.
  3. Failure independence: system is not affected by node failures. The system is in continuous operation, even in the case of a node failure, or expansion of the network.
  4. Hardware independence
  5. Operating System independence
  6. Network independence
  7. Database independence

Transparency

  1. Location transparency: user doesn’t need to know location of data to retrieve it
  2. Fragmentation transparency: data fragmentation is transparent to the user, who sees only one logical database.
  3. Replication transparency: user sees only one logical database. DDBMS transparently selects database fragments to access.
  4. Distributed query processing: query may be executed at several different sites. Query optimisation is performed transparently by DDBMS.
  5. Distributed transaction processing: transaction may update data at several sites, and transaction is executed transparently.

20. NoSQL

Motivation for NoSQL

NoSQL Goals

Object-relational impedance mismatch

normalised-invoice-eg

Wiki

Various types of mismatch, e.g.

3Vs of Big Data

Big Data Characterisitics

schema-on-read

NoSQL Features

NoSQL Variants - Aggregate oriented

nosql-variants

Key-value stores

key-value-store

Document database

document-database

Column family

column-family

NoSQL Variants - Graph databases

CAP Theorem

cap-theorem

cap-theorem-2

ACID vs BASE


Edit this page.