INTRODUCTION

According to (Elmasri & Navathe, 1994), the normalization process,
as first proposed by Codd (1972), takes a relation schema through a series
of tests to "certify" whether or not it belongs to a certain **normal
form.** Initially, Codd proposed three normal forms, which he called
**first,
second,** and **third normal form.** A stronger definition of 3NF
was proposed later by Boyce and Codd and is known as **Boyce-Codd normal
form (BCNF).** All these normal forms are based on the functional dependencies
among the attributes of a relation. Later, a **fourth normal form (4NF)**
and **a fifth normal form (5NF) **were proposed, based on the concepts
or multivalued dependencies and join dependencies, respectively.

**Normalization of data** can be looked on as a process during which
unsatisfactory relation schemas are decomposed by breaking up their attributes
into smaller relation schemas that possess desirable properties. One objective
of the original normalization process is to ensure that the update anomalies
do not occur.

Normal forms provide database designers with:

· A series of tests that can be carried
out on individual relation schema so that the relational database can be
**normalized**
to any degree. When a test fails, the relation violating that test must
be decomposed into relations that individually meet the normalization tests.

· The dependency preservation property,
which ensures that all *functional dependencies* are represented in
some of the individual resulting relations.

DEFINITIONS

A relation is defined as a *set of tuples*. By definition, all
elements of a set are distinct; hence, all tuples in a relation must also
be distinct. This means that no two tuples can have the same combination
of values for *all* their attributes.

Any set of attributes of a relation schema is called a **superkey**.
Every relation has at least one superkey—the set of all its attributes.
A **key **is a *minimal superkey*, i.e., a superkey from which
we cannot remove any attribute and still have the uniqueness constraint
hold.

In general, a relation schema may have more than one key. In this case,
each of the keys is called a **candidate key.** It is common to designate
one of the candidate keys as the **primary key **of the relation. A
**foreign
key** is a key in a relation R but it's __not__ a key (just an attribute)
in other relation R' of the same schema.

**Integrity Constraints:** the **entity integrity constraint **states
that no primary key value can be null. This is because the primary key
value is used to identify individual tuples in a relation; having null
values for the primary key implies that we cannot identify some tuples.
The **referential integrity constraint** is specified between two relations
and is used to maintain the consistency among tuples of the two relations.
Informally, the referential integrity constraint states that a tuple in
one relation that refers to another relation must refer to an *existing
tuple* in that relation.

An attribute of a relation schema R is called a **prime attribute**
of the relation R if it is a member of *any key* of the relation R.
An attribute is called **nonprime** if it is not a prime attribute—that
is, if it is not a member of any candidate key.

A **functional dependency,** denoted by X->Y, between two sets of
attributes X and Y that are subsets of R specifies a constraints on the
possible tuples that can form a relation instance of R.

NORMAL FORMS

**First Normal Form (1NF)**

**First normal form** is now considered to be part of the formal
definition of a relation; historically, it was defined to disallow *multivalued
attributes, composite attributes, and their combinations.* It states
that the domains of attributes must include only
*atomic*
(simple, indivisible) *values* and that the value of any attribute
in a tuple must be a single *value* from the domain of that attribute.

__Practical Rule__^{1}:
*"Eliminate
Repeating Groups,"* i.e., make a separate table for each set of related
attributes, and give each table a primary key.

Formal Definition^{2}:
A relation is in **first normal form** (1NF) if and only if all underlying
simple domains contain atomic values only.

**Second Normal Form (2NF)**

**Second normal form** is based on the concept of *fully functional
dependency.* A functional X->Y is a **fully functional dependency**
is removal of any attribute A from X means that the dependency does not
hold any more. A relation schema is in **2NF**
if every nonprime attribute in relation is *fully functionally dependent*
on the primary key of the relation. It also can be restated as:
a
relation schema is in **2NF** if every nonprime attribute in relation
is not partially dependent on *any key* of the relation.

__Practical Rule__^{1}:
*"Eliminate
Redundant Data,"* i.e., if an attribute depends on only part of a multivalued
key, remove it to a separate table.

Formal Definition^{2}:
A relation is in **second normal form** (2NF) if and only if it
is in 1NF and every nonkey attribute is fully dependent on the primary
key.

**Third Normal Form (3NF)**

**Third normal form** is based on the concept of *transitive dependency.*
A functional dependency X->Y in a relation is a **transitive dependency
**if
there is a set of attributes Z that is *not a subset* of any key of
the relation, and both X->Z and Z->Y hold. In other words, a
relation is in **3NF **if, whenever a functional dependency X->A holds
in the relation, either (a) X is a superkey of the relation, or (b) A is
a prime attribute of the relation.

__Practical Rule__^{1}:
*"Eliminate
Columns not Dependent on Key,"* i.e., if attributes do not contribute
to a description of a key, remove them to a separate table.

Formal Definition^{2}:
A relation is in **third normal form** (3NF) if and only if it is in
2NF and every nonkey attribute is nontransitively dependent on the primary
key.

**Boyce-Codd Normal Form (BCNF)**

**Boyce-Codd normal form** is stricter than 3NF, meaning that every
relation in BCNF is also in 3NF; however, a relation in 3NF is *not necessarily
*in
BCNF. A relation schema is an **BCNF** if whenever
a functional dependency X->A holds in the relation, then X is a superkey
of the relation. The only difference between BCNF and 3NF is that
condition (b) of 3NF, which allows A to be prime if X is not a superkey,
is absent from BCNF.

Formal Definition^{2}:
A relation is in **Boyce/Codd normal form** (BCNF) if and only if every
determinant is a candidate key. [A determinant is any attribute on which
some other attribute is (fully) functionally dependent.]

**Fourth Normal Form (4NF)**

Multivalued dependencies are a consequence of first normal form, which disallowed an attribute in a tuple to have a set of values. If we have two or more multivalued independent attributes in the same relation schema, we get into a problem of having to repeat every value of one of the attributes with every value of the other attribute to keep the relation instances consistent.

**Fourth normal form** is based on multivalued dependencies, which
is violated when a relation has undesirable multivalued dependencies, and
hence can be used to identify and decompose such relations. A
relation scheme R is in **4NF** with respect to a set of dependencies
F is, for every *nontrivial* multivalued dependency X->F, X is a superkey
for R.

__Practical Rule__^{1}:
*"Isolate
Independent Multiple Relationships,"* i.e., no table may contain two
or more 1:n or n:m relationships that are not directly related.

Formal Definition^{2}:
A relation R is in **fourth normal form** (4NF) if and only if, whenever
there exists a multivalued dependency in the R, say A->>B, then all attributes
of R are also functionally dependent on A.

**Fifth Normal Form (5NF)**

In some cases there may be no losses join decomposition into two relation
schemas but there may be a losses join decomposition into more than two
relation schemas. These cases are handled by the *join dependency*
and **fifth normal form**, and it’s important to note that these cases
occur very rarely and are difficult to detect in practice.

__Practical Rule__^{1}:
*"Isolate
Semantically Related Multiple Relationships,"* i.e., there may be practical
constraints on information that justify separating logically related many-to-many
relationships.

Formal Definition^{2}:
A relation R is in **fifth normal form** (5NF)—also called projection-join
normal form (PJNF)—if and only if every join dependency in R is a consequence
of the candidate keys of R.

**A join dependency** (JD) specified on a relations schema R, specifies
a constraint on instances of R. The constraint states that *every legal
instance* of R should have a losses join decomposition into sub-relations
of R, that when reunited make the entire relation R. A
relation schema R is in **fifth normal form (5NF)** (or **project-join
normal form (PJNF))** with respect to a set F of functional, multivalued,
and join dependencies if, for every nontrivial join dependency JD(R1, R2,
…, Rn) in F (implied by F), every Ri is a superkey of R.

**Domain Key Normal Form (DKNF)**

We can also always define stricter forms that take into account additional
types of dependencies and constraints. The idea behind **domain-key normal
form** is to specify, (theoretically, at least) the "ultimate normal
form" that takes into account all possible dependencies and constraints.
A
relation is said to be in **DKNF** if all constraints and dependencies
that should hold on the relation can be enforced simply by enforcing the
domain constraints and the key constraints specified on the relation.

For a relation in DKNF, it becomes very straightforward to enforce the constraints by simply checking that each attribute value in a tuple is of the appropriate domain and that every key constraint on the relation is enforced. However, it seems unlikely that complex constraints can be included in a DKNF relation; hence, its practical utility is limited.

__Notes:__

1. Most of the modern DBMS systems offer "some kind" of DKNF normalization, by giving the designer the opportunity to assign domain and specific properties (such as key specification) to each attribute of a relation part of a schema. However, that's NOT a guarantee that the resulted relation is in DKNF.

2. The relationship between the 7 levels of normalization (1 through 5, plus BCNF and DKNF) can intuitively be represented as layered, concentric circles, with the largest circle as the 1NF, then a smaller, inside circle as the 2NF, and so on, the smallest circle being the DKNF. This is because a relation is defined as being in a given normal form (let's say 2NF) if and only if it is already in the immediately previous normal form (i.e., 1NF) and satisfies additional requirements.

REFERENCES

Elmasri, R., & Navathe, S. (1994). *Fundamentals of Database Systems*.
2nd ed. Redwood City, CA: The Benjamin/Cummings Publishing
Co. pp. 143 – 144, 401, 407 – 409, 435, 438, 440, 442 - 443.

^{1}Rettig, Marc. (1995). *Database Programming
& Design.* Miller Freeman, Inc.

^{2}Date, C. J. (1990). *An Introduction
to Database Systems*. 5th ed. Volume I. Reading, MA: Addison-Wesley
Publishing Company.