Normalization
Normalization is a design technique
that is widely used as a guide in
designing relational databases.
Normalization is essentially a two step
process that puts data into tabular form
by removing repeating groups and then
removes duplicated data from the
relational tables.
Normalization theory is based on the
concepts of normal forms. A
relational table is said to be a
particular normal form if it satisfied a
certain set of constraints. There are
currently five normal forms that have
been defined. In this section, we will
cover the first three normal forms that
were defined by E. F. Codd.
Basic Concepts
The goal of normalization is to
create a set of relational tables that
are free of redundant data and that can
be consistently and correctly modified.
This means that all tables in a
relational database should be in the
third normal form (3NF). A relational
table is in 3NF if and only if all
non-key columns are (a) mutually
independent and (b) fully dependent upon
the primary key. Mutual independence
means that no non-key column is
dependent upon any combination of the
other columns. The first two normal
forms are intermediate steps to achieve
the goal of having all tables in 3NF. In
order to better understand the 2NF and
higher forms, it is necessary to
understand the concepts of functional
dependencies and lossless decomposition.
Functional Dependencies
The concept of functional
dependencies is the basis for the first
three normal forms. A column, Y, of the
relational table R is said to be
functionally dependent upon column X
of R if and only if each value of X in R
is associated with precisely one value
of Y at any given time. X and Y may be
composite. Saying that column Y is
functionally dependent upon X is the
same as saying the values of column X
identify the values of column Y. If
column X is a primary key, then all
columns in the relational table R must
be functionally dependent upon X.
A short-hand notation for describing
a functional dependency is:
R.x —>; R.y
which can be read as in the
relational table named R, column x
functionally determines (identifies)
column y.
Full functional dependence
applies to tables with composite keys.
Column Y in relational table R is fully
functional on X of R if it is
functionally dependent on X and not
functionally dependent upon any subset
of X. Full functional dependence means
that when a primary key is composite,
made of two or more columns, then the
other columns must be identified by the
entire key and not just some of the
columns that make up the key.
Overview
Simply stated, normalization is the
process of removing redundant data from
relational tables by decomposing
(splitting) a relational table into
smaller tables by projection. The goal
is to have only primary keys on the left
hand side of a functional dependency. In
order to be correct, decomposition must
be lossless. That is, the new tables can
be recombined by a natural join to
recreate the original table without
creating any spurious or redundant data.
Sample Data
Data taken from Date [Date90] is used
to illustrate the process of
normalization. A company obtains parts
from a number of suppliers. Each
supplier is located in one city. A city
can have more than one supplier located
there and each city has a status code
associated with it. Each supplier may
provide many parts. The company creates
a simple relational table to store this
information that can be expressed in
relational notation as:
FIRST (s#, status, city, p#, qty)
where
| s# |
supplier identifcation
number (this is the primary
key) |
| status |
status code assigned to city
|
| city | name
of city where supplier is
located |
| p# | part
number of part supplied
|
| qty> |
quantity of parts supplied
to date |
In order to uniquely
associate quantity supplied (qty)
with part (p#) and supplier (s#), a
composite primary key composed of s#
and p# is used.
First Normal Form
A relational table,
by definition, is in first normal
form. All values of the columns are
atomic. That is, they contain no
repeating values. Figure1 shows the
table FIRST in 1NF.
Figure 1: Table
in 1NF
Although the table
FIRST is in 1NF it contains
redundant data. For example,
information about the supplier's
location and the location's status
have to be repeated for every part
supplied. Redundancy causes what are
called update anomalies.
Update anomalies are problems that
arise when information is inserted,
deleted, or updated. For example,
the following anomalies could occur
in FIRST:
- INSERT. The fact that a certain
supplier (s5) is located in a
particular city (Athens) cannot
be added until they supplied a
part.
- DELETE. If a row is deleted,
then not only is the information
about quantity and part lost but
also information about the
supplier.
- UPDATE. If supplier s1 moved
from London to New York, then
six rows would have to be
updated with this new
information.
Second Normal Form
The definition of
second normal form states that only
tables with composite primary keys
can be in 1NF but not in 2NF.
A relational table is in
second normal form 2NF if it is in
1NF and every non-key column is
fully dependent upon the primary
key.
That is, every
non-key column must be dependent
upon the entire primary key. FIRST
is in 1NF but not in 2NF because
status and city are functionally
dependent upon only on the column s#
of the composite key (s#, p#). This
can be illustrated by listing the
functional dependencies in the
table:
|
s#
| —> city,
status |
|
city
| —> status |
|
(s#,p#)
| —> qty |
The process for
transforming a 1NF table to 2NF is:
- Identify any determinants other
than the composite key, and the
columns they determine.
- Create and name a new table for
each determinant and the unique
columns it determines.
- Move the determined columns from
the original table to the new
table. The determinate becomes
the primary key of the new
table.
- Delete the columns you just
moved from the original table
except for the determinate which
will serve as a foreign key.
- The original table may be
renamed to maintain semantic
meaning.
To transform FIRST into
2NF we move the columns s#, status, and
city to a new table called SECOND. The
column s# becomes the primary key of
this new table. The results are shown
below in Figure 2.
Figure 2: Tables in
2NF
Tables in 2NF but not in
3NF still contain modification
anomalies. In the example of SECOND,
they are:
INSERT. The fact that a
particular city has a certain status
(Rome has a status of 50) cannot be
inserted until there is a supplier
in the city.
DELETE. Deleting any row in
SUPPLIER destroys the status
information about the city as well
as the association between supplier
and city.
Third Normal Form
The third normal form
requires that all columns in a
relational table are dependent only upon
the primary key. A more formal
definition is:
A relational table is in third
normal form (3NF) if it is already
in 2NF and every non-key column is
non transitively dependent upon its
primary key. In other words, all
nonkey attributes are functionally
dependent only upon the primary key.
Table PARTS is already
in 3NF. The non-key column, qty, is
fully dependent upon the primary key
(s#, p#). SUPPLIER is in 2NF but not in
3NF because it contains a transitive
dependency. A transitive dependency
is occurs when a non-key column that is
a determinant of the primary key is the
determinate of other columns. The
concept of a transitive dependency can
be illustrated by showing the functional
dependencies in SUPPLIER:
|
SUPPLIER.s#
| —>
SUPPLIER.status |
|
SUPPLIER.s#
| —> SUPPLIER.city
|
|
SUPPLIER.city
| —> SUPPLIER.status
|
Note that
SUPPLIER.status is determined both
by the primary key s# and the
non-key column city. The process of
transforming a table into 3NF is:
Identify any
determinants, other the
primary key, and the columns
they determine.
- Create and name a new table for
each determinant and the unique
columns it determines.
- Move the determined columns from
the original table to the new
table. The determinate becomes
the primary key of the new
table.
Delete the
columns you just moved from
the original table except
for the determinate which
will serve as a foreign key.
- The original table may be
renamed to maintain semantic
meaning.
To transform
SUPPLIER into 3NF, we create a new
table called CITY_STATUS and move
the columns city and status into it.
Status is deleted from the original
table, city is left behind to serve
as a foreign key to CITY_STATUS, and
the original table is renamed to
SUPPLIER_CITY to reflect its
semantic meaning. The results are
shown in Figure 3 below.
Figure 3:
Tables in 3NF

The results of
putting the original table into 3NF
has created three tables. These can
be represented in "psuedo-SQL" as:
PARTS (#s, p#, qty)
Primary Key (s#,#p)
Foreign Key (s#) references
SUPPLIER_CITY.s#
SUPPLIER_CITY(s#,
city)
Primary Key (s#)
Foreign Key (city) references
CITY_STATUS.city
CITY_STATUS (city,
status)
Primary Key (city)
Advantages of Third
Normal Form
The advantage of
having relational tables in 3NF is
that it eliminates redundant data
which in turn saves space and
reduces manipulation anomalies. For
example, the improvements to our
sample database are:
INSERT. Facts
about the status of a city, Rome
has a status of 50, can be added
even though there is not
supplier in that city. Likewise,
facts about new suppliers can be
added even though they have not
yet supplied parts.
DELETE. Information about
parts supplied can be deleted
without destroying information
about a supplier or a city.
UPDATE. Changing the location of
a supplier or the status of a
city requires modifying only one
row.
To go beyond Third Normal Form, let's look at Advanced
Normalization |