Database Systems

Database Systems Design, Implementation, and Management, 6e
Chapter 1: Database Systems

ISBN: 061921323X Author: Peter Rob, Carlos M. Coronel
copyright © 2005 Course Technology
Database Systems
Learning Objectives

In this chapter, you will learn:
* The difference between data and information
* What a database is, about different types of databases, and why they are valuable assets for decision making
* Why database design is important
* How modern databases evolved from files and file systems
* About flaws in file system data management
* How a database system differs from a file system, and how a DBMS functions within the database system

Preview

Good decisions require good information, derived from raw facts known as data. Data are likely to be managed most efficiently when they are stored in a database. In this chapter you will learn what a database is, what it does, and why it yields better results than some other data management methods. You will also begin to appreciate why database design is so important.

Databases evolved from computer file systems. Although file system data management is now largely outmoded, understanding the basic characteristics of these file systems is important because they are the source of some serious data management limitations. If you understand a file system’s flaws, you are more likely to understand why and how specific database properties are especially useful, and to employ these features appropriately. Finally, you will see how basic database features help eliminate most of the file system’s data management shortcomings.
1.1 Data vs. Information

To better understand what drives the design of databases, you must understand the difference between data and information. Data are raw facts. The word “raw” is used to indicate that the facts have not yet been processed to reveal their meaning. For example, suppose that ROBCOR Company tracks all sales for its two divisions through invoices. Each of the invoices contains raw facts such as these:

invoice number = 300124 invoice date = 12-JAN-2004 sales amount = $125.98

Further suppose that ROBCOR’s two divisions have generated 1,380,456 and 1,453,907 invoices, respectively, between the first quarter of 1999 and the first quarter of 2004. Therefore, ROBCOR’s raw data include 2,834,363 invoice numbers, 2,834,363 invoice dates, and 2,834,363 sales amounts. Given such an abundant data environment, coupled with employee data for each of the two divisions for each of these quarters, how likely is it that ROBCOR’s managers can draw useful conclusions about sales productivity per employee for each of the two divisions? Examining these 2,834,363 invoices individually will merely overwhelm ROBCOR’s managers. On the other hand, if they process these facts to yield total sales per quarter for each of the two divisions, and then divide these quarterly sales summaries by the quarterly employee count, as shown in Figure 1.1, ROBCOR’s managers will have information, that is, data processed to reveal meaning. These results make it quite plain that the employees of Division 1 have greater sales productivity per employee than those of Division 2. Moreover, it’s easy to see that the sales productivity gap is widening. Such information can then be used as the foundation for decision making. Data processing may be as simple as organizing the data to reveal patterns or as complex as statistical modeling to make forecasts or draw inferences.

Figure 1.1 Sales per Employee for Each of Robcor's Two Divisions

[Click to enlarge]

Our era is called the “information age.” This term recognizes that the production of accurate, relevant, and timely information is the key to good decision making. In turn, good decision making is the key to business survival in a global market.

Let’s summarize some key points:

* Data constitute the building blocks of information.
* Information is produced by processing data.
* Information is used to reveal the meaning of data.
* Accurate, relevant, and timely information is the key to good decision making.
* Good decision making is the key to organizational survival in a global environment.

Timely and useful information requires accurate data. Such data must be generated properly, and it must be stored properly in a format that is easy to access and process. And, like any basic resource, the data environment must be managed carefully. Data management is a discipline that focuses on the proper generation, storage, and retrieval of data. Given the crucial role played by data, it should not surprise you that data management is a core activity for any business, government agency, service organization, or charity.
1.2 Introducing the Database and the DBMS

Efficient data management typically requires the use of a computer database. A database is a shared, integrated computer structure that houses a collection of:

* End user data, that is, raw facts of interest to the end user.
* Metadata, or data about data, through which the data are integrated and managed.

The metadata provide a description of the data characteristics and the set of relationships that link the data found within the database. In a sense, a database resembles a very well-organized electronic filing cabinet in which powerful software, known as a database management system, helps manage the cabinet’s contents. A database management system (DBMS) is a collection of programs that manages the database structure and controls access to the data stored in the database. The DBMS makes it possible to share the data in the database among multiple applications or users.

Because data are the crucial raw material from which information is derived, in our information-based society we must have a good way of managing such data. As you will discover in this book, the DBMS helps make data management much more efficient and effective. The following points are worth stressing:

* The DBMS helps create an environment in which end users have better access to more and better-managed data than they did before the DBMS became the data management standard. Such access makes it possible for end users to respond quickly to changes in their environment. The availability of data, combined with the tools that transform data into usable information, empowers end users to make quick, informed decisions that can make the difference between success and failure in the global economy.
* Wider access to well-managed data promotes an integrated view of the organization’s operations and a clearer view of the “big picture.” It becomes much easier to see how actions in one segment of the company affect other segments.
* The probability of data inconsistency is greatly reduced in a properly designed database that is managed through a DBMS. Better data make it possible to generate better information, on which better decisions are based.
* The DBMS makes it possible to produce quick answers to ad hoc queries. (A query is a question, and an ad hoc query is a spur-of-the-moment question.) For example, end users, when dealing with large amounts of sales data, might want quick answers to questions (ad hoc queries) such as:
o What was the dollar volume of sales by product during the past six months?
o What is the sales bonus figure for each of our salespeople during the past three months?
o How many of our customers have credit balances of $3,000 or more?

The advantages of using a DBMS are not limited to the few we have just listed. In fact, you will discover many more advantages as you learn more about the technical details of databases and their proper design.

Figure 1.2 illustrates that the DBMS serves as the intermediary between the user and the database by translating user requests into the complex code required to fulfill those requests. The DBMS hides much of the database’s internal complexity from the application programs that use the database. The application program might be written by a programmer using a programming language such as COBOL, Visual Basic, or C++, or it might be created through a DBMS utility program.

Figure 1.2 The DBMS Manages the Interaction Between the End User and the Database

[Click to enlarge]
1.2.1 Types of Databases

A DBMS can support many different types of databases. Databases can be classified according to the number of users, the database site location(s), and the expected type and extent of use.

The number of users determines whether the database is classified as single-user or multiuser. A single-user database supports only one user at a time. In other words, if user A is using the database, users B and C must wait until user A has completed his/her database work. If a single-user database runs on a personal computer, it is also called a desktop database. In contrast, a multiuser database supports multiple users at the same time. If the multiuser database supports a relatively small number of users (usually fewer than 50) or a specific department within an organization, it is called a workgroup database. If the database is used by the entire organization and supports many users (more than 50, usually hundreds) across many departments, the database is known as an enterprise database.

The database site location might also be used to classify the database. For example, a database that supports data located at a single site is called a centralized database. A database that supports data distributed across several different sites is called a distributed database. The extent to which a database can be distributed and the way in which such distribution is managed is addressed in detail in Chapter 10, “Distributed Database Management Systems.”

The most popular way of classifying databases today, however, is based on how they will be used, and on the time sensitivity of the information gathered from them. For example, transactions such as product or service sales, payments, and supply purchases reflect critical day-to-day operations. Such transactions are time-critical and must be recorded accurately and immediately. A database that is primarily designed to support a company’s day-to-day operations is classified as a transactional database or a production database. In contrast, a data warehouse database focuses primarily on the storage of data used to generate information required to make tactical or strategic decisions. Such decisions typically require extensive “data massaging” (data manipulation) to extract information from historical data to formulate pricing decisions, sales forecasts, market positioning, and so on. Because most decision support information is based on historical data, the time factor is not likely to be as critical as for transactional databases. Additionally, the data warehouse database can store complex data derived from many sources. To make it easier to retrieve such complex data, the data warehouse database structure is quite different from that of a transaction-oriented database. We cover the design, implementation, and use of data warehouse databases and their “offspring” in detail in Chapter 12, “The Data Warehouse.”
NOTE

Most of the database design, implementation, and management issues addressed in this book are based on production (transaction) databases. Our focus on production databases is based on two considerations. First, they are the databases you most frequently encounter in common activities such as enrolling in a class, registering your car, buying a product, or making a bank deposit or withdrawal. Second, data warehouse databases derive most of their data from production databases, and if production databases are poorly designed, the data warehouse databases based on them will lose their reliability and value as well.
1.3 Why Database Design is Important

A good database does not just happen; the structure of its contents must be designed carefully. In fact, database design is such a crucial aspect of working with databases that most of this book is dedicated to the development of good database design techniques. Even a good DBMS will perform poorly with a badly designed database.
NOTE

Keep in mind that the DBMS is the database software you buy commercially; you do not have the option of making design changes to it. Therefore, when we speak of database design, we mean the design of the database structure that will be used to store and manage data, rather than the design of the DBMS software. Once the database design is completed, the DBMS handles all the complicated activities required to translate the designer’s view of the structures that are usable by the computer.

Proper database design requires the database designer to precisely identify the database’s expected use. Designing a transactional database emphasizes data integrity, data consistency, and operational speed. The design of a data warehouse database recognizes the use of historical and aggregated data. Designing a database to be used in a centralized, single-user environment requires a different approach from that used in the design of a distributed, multiuser database. In this book we will emphasize the design of transactional, centralized, single-user, and multiuser databases. However, we will also examine critical issues confronting the designer of distributed and data warehouse databases in Chapter 10 and Chapter 12.

A well-designed database facilitates data management and becomes a valuable information generator. A poorly designed database will likely become a breeding ground for redundant data, that is, unnecessarily duplicated data. Redundant data are often the source of difficult-to-trace information errors. A database contains redundant data when the same data about the same entity are kept in different locations. For example, data redundancy exists when a customer’s telephone number is stored in a customer file, a sales agent file, and an invoice file. If the customer’s phone number changes, the correction might not be made in all the locations where it occurs. Therefore, the existence of redundant data can produce uncorrected data entries, and you probably won’t know which value is the correct one. Reports might yield different results, depending on which version of the data was used. In short, uncontrolled data redundancies are typical of a poorly designed database.

A poorly designed database tends to generate errors that are likely to lead to bad decisions, and bad decisions can lead to the failure of an organization. Database design is simply too important to be left to luck. That’s why college students study database design, why organizations of all types and sizes send personnel to database design seminars, and why database design consultants often make an excellent living. In this book, we take a practical approach to database design. If you want to learn how to design a useful database, this book will point you in the right direction. If you want to develop a thorough understanding of the outer limits of database theory, this is not the book for you.

In addition to including the material necessary to build an appropriate database vocabulary and to understand basic database concepts, we have also developed two complete applications through the logical design stage. (The logical design is the structural blueprint of the database.) We believe that exposure to the practical aspects of the design process will properly set the stage for your own successful designs.
1.4 The Historical Roots of the Database: Files and File Systems

Although file systems as a way of managing data are now largely obsolete, there are several good reasons for studying them in some detail:

* An understanding of the relatively simple characteristics of file systems makes the complexity of database design easier to understand.
* An awareness of the problems that plagued file systems can help you avoid such pitfalls with DBMS software.
* If you intend to convert an obsolete file system to a database system, knowledge of the file system’s basic limitations will be useful.

In the recent past, a manager of almost any small organization was (and sometimes still is) able to keep track of necessary data by using a manual file system. Such a file system was traditionally composed of a collection of file folders, each properly tagged and kept in a filing cabinet. Organization of the data within the file folders was determined by the data’s expected use. Ideally, the contents of each file folder were logically related. For example, a file folder in a doctor’s office might contain patient data, one file folder for each patient. All the data in that file folder described only that particular patient’s medical history. Similarly, a personnel manager might organize personnel data by category of employment (clerical, technical, sales, administrative, etc.). Therefore, a file folder labeled “Technical” would contain data pertaining to only those people whose duties were properly classified as technical.

As long as a data collection was relatively small and an organization’s managers had few reporting requirements, the manual system served its role well as a data repository. However, as organizations grew and as reporting requirements became more complex, keeping track of data in a manual file system became more difficult. In fact, finding and using data in growing collections of file folders became such a time-consuming and cumbersome task that it became less and less likely that such data would ever generate useful information. Consider just these few questions that a retail business owner might want to answer:

* What products sold well during the past week, month, quarter, or year?
* What is the current daily, weekly, monthly, quarterly, or yearly sales dollar volume?
* How do the current period’s sales compare to those of last week, last month, or last year?
* Were the various cost categories increasing, decreasing, or remaining stable during the past week, month, quarter, or year?
* Did sales show trends that could change the inventory requirements?

The list of questions tends to be long and growing!

Unfortunately, report generation from a manual file system can be slow and cumbersome. In fact, some business managers faced government-imposed reporting requirements that required weeks of intensive effort each quarter, even when a well-designed manual system was used. Consequently, the pressure built to design a computer-based system that would track data and produce required reports.

The conversion from a manual file system to a matching computer file system could be technically complex. (Because we are accustomed to today’s relatively user-friendly computer interfaces, we have already forgotten how painfully hostile computers used to be!) Consequently, a new kind of professional, known as a data processing (DP) specialist, had to be hired or “grown” from the current staff. The DP specialist created the necessary computer file structures, often wrote the software that managed the data within those structures, and designed the application programs that produced reports based on the file data. Thus, numerous homegrown computerized file systems were born.

Initially, the computer files within the file system were similar to the manual files. A simple example of a customer data file for a small insurance company is shown in Figure 1.3. (You will discover later that the file structure shown in Figure 1.3, although typically found in early file systems, is unsatisfactory for a database. The search for solutions to structure-induced problems helped pave the way for the development of different data storage concepts and methods that are discussed later in this chapter.)

Figure 1.3 Contents of the CUSTOMER File

[Click to enlarge]
ADDITIONAL CD CONTENT

The databases used in the chapters are available on the student CD bundled with this book. Each chapter’s databases are stored in a folder that bears the chapter label. For example, this chapter’s databases are stored in the Databases\Student\Ch01 folder. Throughout the book, Additional CD Content boxes will highlight related material located on the CD.

We have used a Microsoft Access database to simulate the file structures shown in Figures Figure 1.3 and Figure 1.4. For example, the file contents shown in Figures Figure 1.3 and Figure 1.4 are shown in the Ch01_Text database as CUSTOMER_INIT and AGENT_INIT.

The description of computer files requires a specialized vocabulary. Every discipline develops its own jargon to enable its practitioners to communicate very precisely. The basic file vocabulary summary shown in Table 1.1 will help you understand subsequent discussions more easily.

Table 1.1 Basic File Terminology
TERM DEFINITION
Data
“Raw” facts, such as a telephone number, a birth date, a customer name, and a year-to-date (YTD) sales value. Data have little meaning unless they have been organized in some logical manner. The smallest piece of data that can be “recognized” by the computer is a single character, such as the letter A, the number 5, or a symbol such as /. A single character requires one byte of computer storage.
Field
A character or group of characters (alphabetic or numeric) that has a specific meaning. A field is used to define and store data.
Record
A logically connected set of one or more fields that describes a person, place, or thing. For example, the fields that constitute a record for a customer named J.D. Rudd might consist of J.D. Rudd’s name, address, phone number, date of birth, credit limit, and unpaid balance.
File
A collection of related records. For example, a file might contain data about vendors of ROBCOR Company; or a file might contain the records for the students currently enrolled at Gigantic University.

Using the proper file terminology given in Table 1.1, we can identify the file components shown in Figure 1.3. Note that the CUSTOMER file shown in Figure 1.3 contains 10 records. Each record is composed of nine fields: C_NAME, C_PHONE, C_ADDRESS, C_ZlP, A_NAME, A_PHONE, TP, AMT, and REN. The 10 records are stored in a named file. Because the file in Figure 1.3 contains customer data, its filename is CUSTOMER.

Using the CUSTOMER file’s contents, the DP specialist wrote programs that produced very useful reports for the sales department:

* Monthly summaries that showed the types and amounts of insurance sold by each agent. (Such reports might be used to analyze each agent’s productivity.)
* Monthly checks to determine which customers must be contacted for renewal.
* Reports that analyzed the ratios of insurance types sold by each agent.
* Periodic customer contact letters designed to summarize coverage and to provide various customer relations bonuses.

As time went on, additional programs were written to produce new reports. Although it took some time to specify the report contents and to write the programs that produced the reports, the sales department manager did not miss the old manual system—using the computer saved much time and effort. The reports were impressive, and the ability to perform complex data searches yielded the information needed to make sound decisions.

Then the sales department created a file named SALES, which helped track daily sales efforts. Additional files were created as needed to produce even more useful reports. In fact, the sales department’s success was so obvious that the personnel department manager demanded access to the DP specialist to automate payroll processing and other personnel functions. Consequently, the DP specialist was asked to create the AGENT file shown in Figure 1.4. The data in the AGENT file were used to write checks, keep track of taxes paid, summarize insurance coverage, and so on.

Figure 1.4 Contents of the AGENT File

[Click to enlarge]

As the number of files increased, a small file system, like the one shown in Figure 1.5, evolved. Each file in the system used its own application programs to store, retrieve, and modify data. And each file was owned by the individual or the department that commissioned its creation.

Figure 1.5 A Simple File System

[Click to enlarge]

As the file system grew, the demand for the DP specialist’s programming skills grew even faster. Therefore, the DP specialist was authorized to hire additional programmers. The size of the file system also required a larger, more complex computer. The new computer and the additional programming staff caused the DP specialist to spend less time programming and more time managing technical and human resources. Therefore, the DP specialist’s job evolved into that of a data processing (DP) manager, who supervised a DP department. In spite of these organizational changes, however, the DP department’s primary activity remained programming, and the DP manager inevitably spent much time as a supervising senior programmer and program troubleshooter.
1.5 Problems with File System Data Management

Although the file system method of organizing and managing data was a definite improvement over a manual system, many problems and limitations became evident in this approach, as described in this section. However, as you learn about these problems, remember that the file system served a useful purpose in data management for over two decades, a very long timespan in the computer era. Our critique serves two major purposes:

1. Understanding the shortcomings of the file system enables us to understand the development of modern databases.
2. Many of the problems we will catalog are not unique to file systems. Failure to understand such problems is likely to lead to their duplication in a database environment, even though database technology makes it relatively easy to avoid them.

The first and most glaring problem with the file system approach is that even the simplest data-retrieval task requires extensive programming in a third-generation language (3GL). A 3GL requires the programmer to specify both what must be done and how it is to be done. Examples of 3GLs include Common Business-Oriented Language (COBOL), Beginner’s All-purpose Symbolic Instruction Code (BASIC), and FORmula TRANslation (FORTRAN). As you will learn in upcoming chapters, more modern databases use a fourth-generation language (4GL). A 4GL allows the user to specify what must be done without specifying how it must be done.

Programming in a 3GL can be a time-consuming, high-skill activity. Because the simple file system depicted in Figure 1.5 is quite different from the way the computer physically stores the data on disk, the programmer must be familiar with the physical file structure, that is, how and where the files are stored in the computer. Therefore, every file reference in a program requires the programmer to define access paths to the data. Such access paths use complex coding to establish the precise location of the various file and system components and their data characteristics. As file systems become more complex, the access paths become difficult to manage and tend to produce system malfunctions. For example, if you were working in a COBOL environment with the CUSTOMER file illustrated in Figure 1.3, you would need to include the following lines in your program:

21. SELECT CUSTOMER-DATA-FILE.
22. ASSIGN TO “C:\DATAFILES\CUSTOMER.DAT”
23. ORGANIZATION IS RECORD SEQUENTIAL.
24. DATA DIVISION.
25. FILE SECTION.
26. FD CUSTOMER-DATA-FILE
27. LABEL RECORDS ARE STANDARD.
28. 01 CUST-DATA-REC.
29. 05 C_NAME PIC X(20).
30. 05 C_PHONE PIC X(10).
31. 05 C_ADDRESS PIC X(30).
32. 05 C_ZIP PIC X(5).
33. 05 A_NAME PIC X(20).
34. 05 A_PHONE PIC X(10).
35. 05 TP PIC X(2).
36. 05 AMT PIC 9(3)V99.
37. 05 REN PIC X(11).

Even if you don’t know how to write COBOL programs, the brief excerpt shown here illustrates that the programmer must predefine the physical location of the file (line 22) and also the data access path characteristics (line 23). If the CUSTOMER data file location is changed, all the programs using it must be changed to reflect the new location.

The need to write 3GL programs to produce even the simplest reports makes ad hoc queries impossible. Harried DP specialists or DP managers who work with mature file systems often receive numerous requests for new reports. They are often forced to say that the report will be ready “next week” or even “next month.” If you need the information now, getting it next week or next month will not serve your information needs well.

Another problem, related to the need for extensive programming, is that as the number of files in the system expands, system administration becomes difficult. Each file must have its own file management system, composed of programs that allow the user to:

* Create the file structure.
* Add data to the file.
* Delete data from the file.
* Modify the data contained in the file.
* List the file contents.

Even a simple file system of only 20 files requires 5 × 20 = 100 file management programs. If each of the files is accessed by 10 different reporting programs, an additional 20 × 10 = 200 programs must be written. Because ad hoc queries are not possible, the file reporting programs can multiply quickly. And, because each department in the organization owns its data by creating its own files, the number of files can multiply rapidly.

Making changes in an existing structure can be difficult in a file system environment. For example, changing just one field in the original CUSTOMER file (such as changing the AMT field in the preceding COBOL example to handle amounts over 999.99) requires a program that:

1. Puts the new file structure into a special portion of the computer’s memory known as a buffer.
2. Opens the original file, using a different buffer.
3. Reads a record from the original file.
4. Transforms the original data to conform to the new structure’s storage requirements.
5. Writes the transformed data into the new file structure.
6. Deletes the original file.
7. Modifies all the programs that use the CUSTOMER file to fit the revised file structure.

In fact, any file structure change, no matter how minor, forces modifications in all the programs that use the data in that file. Modifications are likely to produce errors (bugs), and additional time can be spent using a debugging process to find those errors. The advice “think before you do” is especially valid in the file system environment, because all data access programs are subject to change when even minor changes in the file structure are made.

Another fault is that security features such as effective password protection, locking out parts of files or parts of the system itself, and other measures designed to safeguard data confidentiality are difficult to program and are, therefore, often omitted in a file system environment. Even when an attempt is made to improve system and data security, the security devices tend to be limited in scope and effectiveness.

To summarize the limitations of file system data management so far:

* It requires extensive programming.
* System administration can be complex and difficult.
* It is difficult to make changes to existing structures.
* Security features are likely to be inadequate.

These limitations in turn lead to problems of structural and data dependency.
1.5.1 Structural and Data Dependence

In the preceding sections you saw how a change in any file’s structure, such as the addition or deletion of a field, requires the modification of all programs using that file. Such modifications are required because the file system exhibits structural dependence; that is, access to a file is dependent on its structure.

Even changes in file data characteristics, such as changing a field from integer to decimal, require changes in all programs that access the file. Because all data access programs are subject to change when any of the file’s data characteristics change, the file system is said to exhibit data dependence. (Conversely, structural independence exists when it is possible to make changes in the database structure without affecting the application program’s ability to access the data.)

The practical significance of data dependence is the difference between the logical data format (how the human being views the data) and the physical data format (how the computer “sees” the data). Any program that accesses a file system’s file must not only tell the computer what to do, but also how to do it. Consequently, each program must contain lines that specify the opening of a specific file type, its record specification, and its field definitions. Data dependence thus makes the file system extremely cumbersome from a programming and data management point of view.
1.5.2 Field Definitions and Naming Conventions

At first glance the CUSTOMER file shown in Figure 1.3 appears to have served its purpose well: requested reports usually could be generated. But suppose we want to create a customer phone directory based on the data stored in the CUSTOMER file. Storing the customer name as a single field turns out to be a liability because the directory must break up the field contents to list the last names, first names, and initials in alphabetical order. Or suppose we want to get a customer listing by area code. Including the area code in the phone number field is inefficient.

Similarly, producing a listing of customers by city is a more difficult task than is necessary. From the user’s point of view, a much better (more flexible) record definition would be one that anticipates reporting requirements by breaking up fields into their component parts. Thus the CUSTOMER file’s fields might be listed as shown in Table 1.2.

Table 1.2 Sample CUSTOMER File Fields
FIELD CONTENTS SAMPLE ENTRY
CUS_LNAME
Customer last name
Ramas
CUS_FNAME
Customer first name
Alfred
CUS_INITIAL
Customer initial
A
CUS_AREACODE
Customer area code
615
CUS_PHONE
Customer phone
234-5678
CUS_ADDRESS
Customer street address or box number
123 Green Meadow Lane
CUS_CITY
Customer city
Murfreesboro
CUS_STATE
Customer state
TN
CUS_ZIP
Customer zip code
37130

Selecting proper field names is also important. For example, make sure that the field names are reasonably descriptive. As you examine the file structure shown in Figure 1.3, it is not obvious that the field name REN represents the customer’s insurance renewal date. Using the field name CUS_RENEW_DATE would be better for two reasons. First, the prefix CUS can be used as an indicator of the field’s origin, which is the CUSTOMER file. Therefore, we know that the field in question yields a CUSTOMER property. Second, the RENEW_DATE portion of the field name is more descriptive of the field’s contents. With proper naming conventions, the file structure becomes self-documenting. That is, by simply looking at the field names we can determine which files the fields belong to and what information the fields are likely to contain.

Some software packages place restrictions on the length of field names, so it is wise to be as descriptive as possible within those restrictions. In addition, remember that very long field names make it difficult to fit more than a few fields on a page, thus making output spacing a problem. For example, the field name CUSTOMER_ INSURANCE_RENEWAL_DATE, while being self-documenting, is less desirable than CUS_RENEW_DATE.

Another problem in Figure 1.3’s CUSTOMER file is the difficulty of finding desired data efficiently. Note that the CUSTOMER file currently does not have a unique record identifier. For example, it is possible to have several customers named John B. Smith. Consequently, the addition of a CUS_ACCOUNT field that contains a unique customer account number would be appropriate.

Although we have just criticized the field definitions and naming conventions shown in the file structure of Figure 1.3, the points we raised are not unique to file systems. Because such conventions will prove to be important later, we want you to be introduced to them early. You will revisit field definitions and naming conventions when you learn about database design in Chapter 4, “Entity Relationship (ER) Modeling,” when you learn about database implementation issues in Chapter 8, “The Database Design Life Cycle,” and when you see an actual database design implemented in Appendixes D and E (“The University Lab” design and implementation). Regardless of the data environment, the design—whether it involves a file system or a database— must always reflect the designer’s documentation needs and the end user’s reporting and processing requirements. Both types of needs are served best by adhering to proper field definitions and naming conventions.
NOTE

Keep in mind that no naming convention can fit all requirements for all systems. Some words or phrases are reserved for the DBMS’s internal use. For example, the name ORDER generates an error in some DBMSs. Similarly, your DBMS might interpret a dash (-) as a command to subtract. Therefore, the field CUS-NAME would be interpreted as a command to subtract the NAME field from the CUS field. Because neither field exists, you would get an error message. On the other hand, CUS_NAME would work fine, because it uses an underscore.
1.5.3 Data Redundancy

The file system’s structure and lack of security make it difficult to pool data. The organizational structure promotes data ownership, thus promoting the storage of the same basic data in different locations. (Database professionals use the term islands of information to label such scattered data locations.) Because it is unlikely that data stored in different locations will always be updated consistently, the islands of information often contain different versions of the same data. For example, in Figures Figure 1.3 and Figure 1.4, the agent names and phone numbers occur in both the CUSTOMER and the AGENT files. You need only one correct copy of the agent names and phone numbers. Having them occur in more than one place unnecessarily produces data redundancy. (In short, a condition of data redundancy exists when the data environment contains redundant—unnecessarily duplicated—data).

Uncontrolled data redundancy sets the stage for:

1.

Data inconsistency.Data inconsistency exists when different and conflicting versions of the same data appear in different places. For example, suppose we change an agent’s phone number or address in the AGENT file. If we forget to make corresponding changes in the CUSTOMER file, the files contain different data for the same agent. Reports yield inconsistent results, depending on which version of the data is used. Data that display data inconsistency are also referred to as data that lack data integrity.

Data entry errors are much more likely to occur when complex entries (such as 10-digit phone numbers) are made in several different files and/or recur frequently in one or more files. In fact, the CUSTOMER file shown in Figure 1.3 contains just such an entry error: the third record in the CUSTOMER file has a transposed digit in the agent’s phone number (615-882-2144 rather than 615-882-1244).

It is possible to enter a nonexistent sales agent’s name and phone number into the CUSTOMER file, but customers are not likely to be impressed if the insurance agency supplies the name and phone number of an agent who does not exist. And should the personnel manager allow a nonexistent agent to accrue bonuses and benefits? In fact, a data entry error such as an incorrectly spelled name or an incorrect phone number yields the same kind of data integrity problems.
2.

Data anomalies. The dictionary defines “anomaly” as an abnormality. Ideally, a field value change should be made only in a single place. Data redundancy, however, fosters an abnormal condition by forcing field value changes in many different locations. Look at the CUSTOMER file in Figure 1.3. If agent Leah F. Hahn decides to get married and move, the agent name might change, and the address and phone will most likely change. Instead of making just a single name and/or phone/address change in a single file (AGENT), we must make the change each time that agent’s name, phone number, and address occur in the CUSTOMER file, too. We could be faced with the prospect of making hundreds of corrections, one for each of the customers served by that agent! The same problem occurs when an agent decides to quit. Each customer served by that agent must be assigned a new agent. Any change in any field value must be correctly made in many places to maintain data integrity. A data anomaly develops when all the required changes in the redundant data are not made successfully. The data anomalies found in Figure 1.3 are commonly defined as:
* Modification anomalies. If agent Leah F. Hahn has a new phone number, that new number must be entered in each of the CUSTOMER file records in which Ms. Hahn’s phone number is shown. In this case, only three changes must be made. In a large file system, such changes might occur in hundreds or even thousands of records. Clearly, the potential for data inconsistencies is great.
* Insertion anomalies. To add each new customer in the CUSTOMER file, we must also add the corresponding agent data. If we add several hundred new customers, we must also enter several hundred agent names and telephone numbers. Again, the potential for creating data inconsistencies is great.
* Deletion anomalies. If agent Alex B. Alby quits and is deleted from the payroll, all the customers in the CUSTOMER file refer to a nonexistent agent. To resolve this problem, we must modify all records in which Mr. Alby’s name and phone number appear.

1.6 Database Systems

The problems inherent in file systems make using a database system very desirable. Unlike the file system, with its many separate and unrelated files, the database consists of logically related data stored in a single logical data repository. (The “logical” label reflects the fact that, although the data repository appears to be a single unit to the end user, its contents may actually be physically distributed among multiple data storage facilities and/or locations.) Because the database’s data repository is a single logical unit, the database represents a major change in the way end user data are stored, accessed, and managed. The database’s DBMS, shown in Figure 1.6, provides numerous advantages over file system management, shown in Figure 1.5, by making it possible to eliminate most of the file system’s data inconsistency, data anomalies, data dependency, and structural dependency problems. Better yet, the current generation of DBMS software stores not only the data structures, but also the relationships between those structures and the access paths to those structures, all in a central location. The current generation of DBMS software also takes care of defining, storing, and managing all the required access paths to those components.

Remember that the DBMS is just one of several crucial components of a database system. Perhaps it is even appropriate to refer to the DBMS as the database system’s heart. However, just as it takes more than a heart alone to make a human being function, it takes more than a DBMS to make a database system function. In the sections that follow, you’ll learn what a database system is, what its components are, and how the DBMS fits into the database system picture.

Figure 1.6 Contrasting Database and File Systems

[Click to enlarge]
1.6.1 The Database System Environment

The term database system refers to an organization of components that define and regulate the collection, storage, management, and use of data within a database environment. From a general management point of view, the database system is composed of the five major parts shown in Figure 1.7: hardware, software, people, procedures, and data.

Let’s take a closer look at the five components shown in Figure 1.7:

1. Hardware. Hardware refers to all the system’s physical devices. The database system’s main and most easily identified hardware component is the computer, which might be a microcomputer, a minicomputer, or a mainframe computer. The hardware also includes all of the computer peripherals, which are the physical devices that control computer input and output, such as keyboards, mice, modems, and printers. Hardware also includes devices that are used to connect two or more computers, thereby producing a computer network. Networks are an essential part of modern database systems, because data are likely to be accessed from a local network, from remote locations such as airplane reservation systems and automatic teller machines, or over the Internet, or all of the above.
2. Software. Software refers to the collection of programs used by the computers within the database system. Although the most readily identified software is the DBMS itself, to make the database system function fully it takes three types of software: operating system software, DBMS software, and application programs and utilities.
* Operating system software manages all hardware components and makes it possible for all other software to run on the computers. Examples of operating system software include DOS, Microsoft Windows versions 95 / 98 / ME / NT / 2000 / XP, Linux, MacOS; UNIX, and MVS.
* DBMS software manages the database within the database system. Some examples of DBMS software include Microsoft Access and SQL Server, Oracle Corporation’s Oracle, and IBM’s DB2.

Figure 1.7 The Database System Environment

[Click to enlarge]

* Application programs and utility software are used to access and manipulate the data in the DBMS and to manage the computer environment in which data access and manipulation take place. Application programs are most commonly used to access the data found within the database to generate reports, tabulations, and other information to facilitate decision making. Utilities are the software tools used to help manage the database system’s computer components. For example, all the major DBMS vendors now provide GUI interfaces to help database administrators create database structures, control database access, and monitor database operations.
3. People. This component includes all users of the database system. On the basis of primary job functions, we can identify five types of users in a database system: systems administrators, database administrators, database designers, systems analysts/programmers, and end users. The members of each user type perform both unique and complementary functions:
* Systems administrators oversee the database system’s general operations.
* Database administrators, also known as DBAs, manage the DBMS’s use and ensure that the database is functioning properly. The DBA’s role is sufficiently important to warrant a detailed exploration in Chapter 15, “Database Administration.”
* Database designers design the database structure. They are, in effect, the database architects. If the database design is poor, even the best application programmers and the most dedicated DBAs cannot produce a useful database environment. To use an old analogy, the finest bricklayers and carpenters cannot produce a good building from a bad blueprint. Because organizations strive to optimize their data resources, the database designer’s job description has expanded to cover new dimensions and growing responsibilities.
* Systems analysts and programmers design and implement the application programs. They design and create the data entry screens, reports, and procedures through which end users access and manipulate the database’s data.
* End users are the people who use the application programs to run the organization’s daily operations. For example, sales clerks, supervisors, managers, and directors are all classified as end users. High-level end users employ the information obtained from the database to make tactical and strategic business decisions.
4. Procedures. Procedures are the instructions and rules that govern the design and use of the database system. Procedures are a critical, although occasionally forgotten, component of the system. Procedures play an important role in a company, because they enforce the standards by which business is conducted within the organization and with customers. Procedures also are used to ensure that there is an organized way to monitor and audit both the data that enter the database and the information that is generated through the use of such data.
5. Data. The word “data” covers the collection of facts stored in the database. Because data are the raw material from which information is generated, the determination of which data are to be entered into the database and how such data are to be organized is a vital part of the database designer’s job.

The existence of a database system adds a new dimension to an organization’s management structure. Just how complex this managerial structure is depends on the organization’s size, its functions, and its corporate culture. Therefore, database systems can be created and managed at quite different levels of complexity and with widely varying adherence to precise standards. For example, compare a local movie rental system with a national insurance claims system. The movie rental system might be managed by two people, the hardware used is probably a single microcomputer, the procedures are probably simple, and the data volume will tend to be low. The national insurance claims system is likely to have at least one systems administrator, several full-time DBAs, and many designers and programmers; the hardware probably includes several mainframes at multiple locations throughout the United States; the procedures are likely to be numerous, complex, and rigorous; and the data volume will tend to be very high.

In addition to the fact that different levels of database system complexity are dictated by the organizational activities and the environment within which those activities take place, managers must also take another important fact into account: database solutions must be cost-effective as well as tactically and strategically effective. Producing a million-dollar solution to a thousand-dollar problem is hardly an example of good database system selection or of good database design and management. Finally, the database technology already in use is likely to affect the selection of a database system.
1.6.2 DBMS Functions

A DBMS performs several important functions that guarantee the integrity and consistency of the data in the database. Most of these functions are transparent to end users, and most can be achieved only through the use of a DBMS. They include data dictionary management, data storage management, data transformation and presentation, security management, multiuser access control, backup and recovery management, data integrity management, database access languages and application programming interfaces, and database communication interfaces.

1. Data dictionary management. The DBMS stores the definitions of the data elements and their relationships (metadata) in a data dictionary. In turn, all programs that access the data in the database work through the DBMS. The DBMS uses the data dictionary to look up the required data component structures and relationships, thus relieving us from having to code such complex relationships in each program. Additionally, any changes made in a database structure are automatically recorded in the data dictionary, thereby freeing us from having to modify all the programs that access the changed structure. In other words, the DBMS provides data abstraction, and it removes structural and data dependency from the system. For example, Figure 1.8 shows an example of how Microsoft Access presents the data definition for the CUSTOMER table. Note the definition of the field properties for the CUS_RENEW_DATE.

Figure 1.8 Illustrating Metadata with Microsoft Access

[Click to enlarge]

2. Data storage management. The DBMS creates and manages the complex structures required for data storage, thus relieving us from the difficult task of defining and programming the physical data characteristics. A modern DBMS system provides storage not only for the data, but also for related data entry forms or screen definitions, report definitions, data validation rules, procedural code, structures to handle video and picture formats, and so on. Data storage management is also important for database performance tuning. Performance tuning relates to the activities that make the database perform more efficiently in terms of storage and access speed. Although the user sees the database as a single data storage unit, the DBMS actually stores the database in multiple physical data files. (See Figure 1.9.) Such datafiles may even be stored on different storage media. Therefore the DBMS doesn’t have to wait for one disk request to finish before the next one starts. In other words, the DBMS can fulfill database requests concurrently.
3. Data transformation and presentation. The DBMS transforms entered data to conform to the data structures that are required to store the data. Therefore, the DBMS relieves us of the chore of making a distinction between the data logical format and the data physical format. By maintaining data independence, the DBMS translates logical requests into commands that physically locate and retrieve the requested data. That is, the DBMS formats the physically retrieved data to make it conform to the user’s logical expectations. In other words, a DBMS provides application programs with software independence and data abstraction. For example, imagine a enterprise database used by a multinational company. An end user in England would expect to enter data such as July 11, 2004 as “11/07/2004”. In contrast, the same date would be entered in the United States as “07/11/2004”. Regardless of the data presentation format, the DBMS must manage the date in the proper format for each country.

Figure 1.9 Illustrating Data Storage Management with Oracle

[Click to enlarge]

4. Security management. The DBMS creates a security system that enforces user security and data privacy within the database. Security rules determine which users can access the database, which data items each user may access, and which data operations (read, add, delete, or modify) the user may perform. This is especially important in multiuser database systems where many users can access the database simultaneously. Chapter 15, “Database Administration,” examines data security and privacy issues in greater detail. All database users are authenticated to the DBMS through the use of a user name and a password. The DBMS then uses this information to assign access priveleges to various database components such as queries and reports.
5. Multiuser access control. The DBMS creates the complex structures that allow multiple users to access the data. In order to provide data integrity and data consistency, the DBMS uses sophisticated algorithms to ensure that multiple users can access the database concurrently without compromising the integrity of the database. Chapter 9, “Transaction Management and Concurrency Control,” covers the details of the multiuser access control.
6. Backup and recovery management. The DBMS provides backup and data recovery procedures to ensure data safety and integrity. Current DBMS systems provide special utilities that allow the DBA to perform routine and special backup and restore procedures. Recovery management deals with the recovery of the database after a failure, such as a bad sector in the disk or a power failure. Such capability is critical to the preservation of the database’s integrity. Chapter 15 covers backup and recovery issues.
7. Data integrity management. The DBMS promotes and enforces integrity rules to eliminate data integrity problems, thus minimizing data redundancy and maximizing data consistency. The data relationships stored in the data dictionary are used to enforce data integrity. Ensuring data integrity is especially important in transaction-oriented database systems. Transaction and data integrity issues are addressed in Chapter 9.
8. Database access languages and application programming interfaces. The DBMS provides data access through a query language. A query language is a nonprocedural language—one that lets the user specify what must be done without having to specify how it is to be done. The DBMS’s query language contains two components: a data definition language (DDL) and a data manipulation language (DML). The DDL defines the structures in which the data are housed, and the DML allows end users to extract the data from the database. The DBMS also provides data access to programmers via procedural (3GL) languages such as COBOL, C, PASCAL, Visual Basic, and others. The DBMS also provides administrative utilities used by the DBA and the database designer to create, implement, monitor, and maintain the database.
9. Database communication interfaces. Current-generation DBMSs provide communications interfaces designed to allow the database to accept end-user requests within a computer network environment. For example, the DBMS might provide communications functions to access the database through the Internet, using Web browsers such as Netscape Navigator or Internet Explorer. In this environment, communications can be accomplished in several ways:
* End users can generate answers to queries by filling in screen forms through their preferred Web browser.
* The DBMS can automatically publish predefined reports on the Internet, using a Web format that enables any Web user to browse it.
* The DBMS can connect to third-party systems to distribute information via e-mail or other productivity applications such as Lotus Notes.

We examine database communication interfaces in greater detail in Chapter 10, “Distributed Database Management Systems,” Appendix F, “Client/Server Systems,” and especially in Chapters 13 and 14, “Databases in Electronic Commerce” and “Web Database Development.”
1.6.3 Managing the Database System: a Shift in Focus

As you can see, the DBMS handles many of the complex data management chores handled by the file system’s DP specialist, and it performs them with far greater finesse and sophistication. Better yet, the DBMS allows those chores to be performed without the tedious and time-consuming programming required in the file management system. In addition, the DBMS provides a framework in which strict procedures and standards can be enforced. Consequently, the role of the DP specialist or the DP manager changes from an emphasis on programming to a focus on the broader aspects of managing the organization’s data resources and on the administration of the complex database software itself.

Because the file system’s DP manager performs broader managerial functions in a database environment, (s)he might be promoted to systems administrator. Initially, while the database environment is simple, the systems administrator performs both managerial and hands-on database administration tasks. As the database environment expands, the system administrator usually appoints one or more database administrators who evaluate database designs, maintain the database software, assign the right to use the database(s) to selected individuals, and coordinate the development of applications based on the data resources. If the data resources require the use of more than one database, the systems administrator may appoint a database administrator to administer each of the databases.
1.6.4 Database Design and Modeling

The availability of a DBMS makes it possible to tackle far more sophisticated uses of the data resources, if the database is designed to make use of that available power. The kinds of data structures created within the database and the extent of the relationships among them play a powerful role in determining how effective the DBMS is. Therefore, database design becomes a crucial activity in the database environment.

You will learn in the next chapter that database design is made much simpler when you use models, which are simplified abstractions of real-world events or conditions. For example, such abstractions will enable us to explore the characteristics of entities and the relationships that can be created among such entities. If the models are not logically sound, the database designs derived from them will not deliver the database system’s promise of effective information drawn from an efficient database. We will, therefore, endeavor to develop good modeling techniques in this book. Good models yield good database designs that are the basis for good applications. Conversely, you cannot expect to build good applications with bad database design based on poor models.
Summary

Information is derived from data, which are usually stored in a database. To implement a database and to manage its contents you need commercial software known as a database management system (DBMS). Database design defines the database structure; the DBMS stores the facts about the structure in the database itself. The database thus contains the data you have collected and “data about data” known as metadata. Good database design is important because even a good DBMS will perform poorly with a poorly designed database.

Databases were preceded by file systems. Because file systems lack a DBMS, file management becomes difficult as the file system grows. Each file requires its own set of basic data management programs, and, because files are usually “owned” by those who commissioned them, the number of files tends to grow. Many of the files often contain redundant data, thus leading to data inconsistency, data anomalies, and a lack of data integrity. Because each file can be used by many application programs, a mature file system might have generated hundreds or even thousands of programs. Serious file system data management problems usually stem from data dependency—access to any file is dependent on data characteristic and storage formats; therefore, even a minor change in a data structure within a file requires that all the programs accessing that file must be modified, too.

Database management systems were developed to address the file system’s inherent weaknesses. Rather than depositing data within independent files, a DBMS presents the database to the end user as a single data repository. This arrangement promotes data sharing, thus at least potentially eliminating the “islands of information” problem. In addition, the DBMS enforces data integrity, eliminates redundancy, and promotes data security.

The most effective database designs use models, simplified abstractions of real-world events or conditions.
Key Terms

* ad hoc query
* centralized database
* data
* data anomaly
* data definition language (DDL)
* data dependence
* data dictionary
* data inconsistency
* data independence
* data integrity
* data management
* data manipulation language (DML)
* data processing (DP) manager
* data processing (DP) specialist
* data redundancy
* data warehouse database
* database
* database administrator (DBA)
* database design
* database management system
* (DBMS)
* database system
* desktop database
* distributed database
* enterprise database
* field
* file
* fourth-generation language (4GL)
* information
* islands of information
* logical data format
* logical design
* metadata
* multiuser database
* performance tuning
* physical data format
* production database
* query
* query language
* record
* redundant data
* single-user database
* structural dependence
* structural independence
* systems administrator
* third-generation language (3GL)
* transactional database
* workgroup database

Review Questions

1. Discuss each of the following terms:
1. data
2. field
3. record
4. file
2. What is data redundancy, and which characteristics of the file system can lead to it?
3. Discuss the lack of data independence in the file systems.
4. What is a DBMS, and what are its functions?
5. What is data independence, and why is it important?
6. Explain the difference between data and information.
7. Explain what data inconsistency is and why it occurs.
8. Discuss the different types of databases.
9. What are the main components of a database system environment?
10. What is metadata?
11. Explain why database design is important.
12. Discuss the differences between transaction databases and data warehouse databases.

Problems
ADDITIONAL CD CONTENT

The file structures you see in this problem set are simulated in a Microsoft Access database named Ch01_Problems. This database is located in the \Databases\Student\Ch01 folder located on the student CD that accompanies this book.

Given the file structure shown in Figure P1.1, answer Problems 1 through 4.

Figure P1.1 The File Structure for Problems 1–4

[Click to enlarge]

1.

How many records does the file contain, and how many fields are there per record?
2.

What problem would you encounter if you wanted to produce a listing by city? How would you solve this problem by altering the file structure?
3.

If you wanted to produce a listing of the file contents by last name, area code, city, state, or zip code, how would you alter the file structure?
4.

What data redundancies do you detect, and how could these redundancies lead to anomalies?
5.

Identify and discuss the serious data redundancy problems exhibited by the file structure shown in Figure P1.5.

Figure P1.5 The File Structure for Problems 5–8

[Click to enlarge]

6.

Looking at the EMP_NAME and EMP_PHONE contents in Figure P1.5, what change(s) would you recommend?
7.

Identify the different data sources in the file you examined in Problem 5.

Given your answer to Problem 7, what new files should you create to help eliminate the data redundancies found in the file shown in Figure P1.5?
8.

Identify and discuss the serious data redundancy problems exhibited by the file structure shown in Figure P1.9. (The file is meant to be used as a teacher class assignment schedule. One of the many problems with data redundancy is the likely occurrence of data inconsistencies—note that two different initials have been entered for the teacher named Maria Cordoza.)

Figure P1.9 The File Structure for Problems 9–10

[Click to enlarge]

9.

Given the file structure shown in P1.9, what problem(s) might you encounter if building KOM were deleted?

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License