Identifying Data Objects and Relationships
In order to begin constructing the basic
model, the modeler must analyze the information
gathered during the requirements analysis for
the purpose of:
- classifying data objects as either
entities or attributes
- identifying and defining relationships
between entities
- naming and defining identified entities,
attributes, and relationships
- documenting this information in the data
document
To accomplish these goals the modeler must
analyze narratives from users, notes from
meeting, policy and procedure documents, and, if
lucky, design documents from the current
information system.
Although it is easy to define the basic
constructs of the ER model, it is not an easy
task to distinguish their roles in building the
data model. What makes an object an entity or
attribute? For example, given the statement
"employees work on projects". Should employees
be classified as an entity or attribute? Very
often, the correct answer depends upon the
requirements of the database. In some cases,
employee would be an entity, in some it would be
an attribute.
While the definitions of the constructs in
the ER Model are simple, the model does not
address the fundamental issue of how to identify
them. Some commonly given guidelines are:
- entities contain descriptive information
- attributes either identify or describe
entities
- relationships are associations between
entities
These guidelines are discussed in more detail
below.
- Entities
- Attributes
- Validating Attributes
- Derived Attributes and Code Values
- Relationships
- Naming Data Objects
- Object Definition
- Recording Information in Design Document
Entities
There are various definitions of an entity:
"Any distinguishable person, place,
thing, event, or concept, about which
information is kept" [BRUC92]
"A thing which can be distinctly
identified" [CHEN76]
"Any distinguishable object that is to be
represented in a database" [DATE86]
"...anything about which we store
information (e.g. supplier, machine tool,
employee, utility pole, airline seat, etc.).
For each entity type, certain attributes are
stored". [MART89]
These definitions contain common themes about entities:
- an entity is a "thing", "concept" or,
object". However, entities can sometimes
represent the relationships between two
or more objects. This type of entity is
known as an associative entity.
- entities are objects which contain
descriptive information. If an data
object you have identified is described
by other objects, then it is an entity.
If there is no descriptive information
associated with the item, it is not an
entity. Whether or not a data object is
an entity may depend upon the
organization or activity being modeled.
- an entity represents many things which
share properties. They are not single
things. For example, King Lear and
Hamlet are both plays which share common
attributes such as name, author, and
cast of characters. The entity
describing these things would be PLAY,
with King Lear and Hamlet being
instances of the entity.
- entities which share common properties
are candidates for being converted to
generalization hierarchies (See below)
- entities should not be used to
distinguish between time periods. For
example, the entities 1st Quarter
Profits, 2nd Quarter Profits, etc.
should be collapsed into a single entity
called Profits. An attribute specifying
the time period would be used to
categorize by time
- not every thing the users want to
collect information about will be an
entity. A complex concept may require
more than one entity to represent it.
Others "things" users think important
may not be entities.
Attributes
Attributes are data objects that
either identify or describe entities. Attributes
that identify entities are called key
attributes. Attributes that describe an
entity are called non-key attributes. Key
attributes will be discussed in detail in a
latter section.
The process for identifying attributes is
similar except now you want to look for and
extract those names that appear to be
descriptive noun phrases.
Validating Attributes
Attribute values should be atomic,
that is, present a single fact. Having
disaggregated data allows simpler programming,
greater reusability of data, and easier
implementation of changes. Normalization also
depends upon the "single fact" rule being
followed. Common types of violations include:
- simple aggregation - a common example is
Person Name which concatenates first
name, middle initial, and last name.
Another is Address which concatenates,
street address, city, and zip code. When
dealing with such attributes, you need
to find out if there are good reasons
for decomposing them. For example, do
the end-users want to use the person's
first name in a form letter? Do they
want to sort by zip code?
- complex codes - these are attributes
whose values are codes composed of
concatenated pieces of information. An
example is the code attached to
automobiles and trucks. The code
represents over 10 different pieces of
information about the vehicle. Unless
part of an industry standard, these
codes have no meaning to the end user.
They are very difficult to process and
update.
- text blocks - these are free-form text
fields. While they have a legitimate
use, an over reliance on them may
indicate that some data requirements are
not met by the model.
- mixed domains - this is where a value of
an attribute can have different meaning
under different conditions
Derived Attributes and Code Values
Two areas where data modeling experts
disagree is whether derived attributes and
attributes whose values are codes should be
permitted in the data model.
Derived attributes are those created by a
formula or by a summary operation on other
attributes. Arguments against including derived
data are based on the premise that derived data
should not be stored in a database and therefore
should not be included in the data model. The
arguments in favor are:
- derived data is often important to both
managers and users and therefore should
be included in the data model
- it is just as important, perhaps more
so, to document derived attributes just
as you would other attributes
- including derived attributes in the data
model does not imply how they will be
implemented
A coded value uses one or more letters or
numbers to represent a fact. For example, the
value Gender might use the letters "M" and "F"
as values rather than "Male" and "Female". Those
who are against this practice cite that codes
have no intuitive meaning to the end-users and
add complexity to processing data. Those in
favor argue that many organizations have a long
history of using coded attributes, that codes
save space, and improve flexibility in that
values can be easily added or modified by means
of look-up tables.
Relationships
Relationships are associations between
entities. Typically, a relationship is indicated
by a verb connecting two or more entities. For
example:
employees are assigned to projects
As relationships are identified they should
be classified in terms of cardinality,
optionality, direction, and dependence. As a
result of defining the relationships, some
relationships may be dropped and new
relationships added. Cardinality quantifies the
relationships between entities by measuring how
many instances of one entity are related to a
single instance of another. To determine the
cardinality, assume the existence of an instance
of one of the entities. Then determine how many
specific instances of the second entity could be
related to the first. Repeat this analysis
reversing the entities. For example:
employees may be assigned to no
more than three projects at a time; every
project has at least two employees assigned
to it.
Here the cardinality of the relationship from
employees to projects is three; from projects to
employees, the cardinality is two. Therefore,
this relationship can be classified as a
many-to-many relationship.
If a relationship can have a cardinality of
zero, it is an optional relationship. If it must
have a cardinality of at least one, the
relationship is mandatory. Optional
relationships are typically indicated by the
conditional tense. For example:
an employee may be assigned to a
project
Mandatory relationships, on the other hand,
are indicated by words such as must have. For
example:
a student must register for at
least three course each semester
In the case of the specific relationship form
(1:1 and 1:M), there is always a parent entity
and a child entity. In one-to-many
relationships, the parent is always the entity
with the cardinality of one. In one-to-one
relationships, the choice of the parent entity
must be made in the context of the business
being modeled. If a decision cannot be made, the
choice is arbitrary.
Naming Data Objects
The names should have the following
properties:
- unique
- have meaning to the end-user
- contain the minimum number of words
needed to uniquely and accurately
describe the object
For entities and attributes, names are
singular nouns while relationship names are
typically verbs.
Some authors advise against using
abbreviations or acronyms because they might
lead to confusion about what they mean. Other
believe using abbreviations or acronyms are
acceptable provided that they are universally
used and understood within the organization.
You should also take care to identify and
resolve synonyms for entities and attributes.
This can happen in large projects where
different departments use different terms for
the same thing.
Object Definition
Complete and accurate definitions are
important to make sure that all parties involved
in the modeling of the data know exactly what
concepts the objects are representing.
Definitions should use terms familiar to the
user and should precisely explain what the
object represents and the role it plays in the
enterprise. Some authors recommend having the
end-users provide the definitions. If acronyms,
or terms not universally understood are used in
the definition, then these should be defined .
While defining objects, the modeler should be
careful to resolve any instances where a single
entity is actually representing two different
concepts (homonyms) or where two different
entities are actually representing the same
"thing" (synonyms). This situation typically
arises because individuals or organizations may
think about an event or process in terms of
their own function.
An example of a homonym would be a case where
the Marketing Department defines the entity
MARKET in terms of geographical regions while
the Sales Departments thinks of this entity in
terms of demographics. Unless resolved, the
result would be an entity with two different
meanings and properties.
Conversely, an example of a synonym would be
the Service Department may have identified an
entity called CUSTOMER while the Help Desk has
identified the entity CONTACT. In reality, they
may mean the same thing, a person who contacts
or calls the organization for assistance with a
problem. The resolution of synonyms is important
in order to avoid redundancy and to avoid
possible consistency or integrity problems.
Some examples of definitions are:
| Employee | A
person who works for and is paid by
the organization. |
| Est_Time | The
number of hours a project manager
estimates that project will require
to complete. Estimated time is
critical for scheduling a project
and for tracking project time
variances. |
| Assigned |
Employees in the organization may be
assigned to work on no more than
three projects at a time. Every
project will have at least two
employees assigned to it at any
given time. |
Recording Information in
Design Document
The design document records
detailed information about each object used
in the model. As you name, define, and
describe objects, this information should be
placed in this document. If you are not
using an automated design tool, the document
can be done on paper or with a word
processor. There is no standard for the
organization of this document but the
document should include information about
names, definitions, and, for attributes,
domains.
Two documents used in the
IDEF1X method of modeling are useful for
keeping track of objects. These are the
ENTITY-ENTITY matrix and the
ENTITY-ATTRIBUTE matrix.
The ENTITY-ENTITY matrix is
a two-dimensional array for indicating
relationships between entities. The names of
all identified entities are listed along
both axes. As relationships are first
identified, an "X" is placed in the
intersecting points where any of the two
axes meet to indicate a possible
relationship between the entities involved.
As the relationship is further classified,
the "X" is replaced with the notation
indicating cardinality.
The ENTITY-ATTRIBUTE matrix
is used to indicate the assignment of
attributes to entities. It is similar in
form to the ENTITY-ENTITY matrix except
attribute names are listed on the rows.
Figure 1 shows examples of
an ENTITY-ENTITY matrix and an
ENTITY-ATTRIBUTE matrix.
Figure 1:
Summary
The first step in creating
the data model is to analyze the information
gathered during the requirements analysis
with the goal of identifying and classifying
data objects and relationships. Our next step is
to develop the Basic
Schema. |