Denormalization

Source: Wikipedia: Denormalization


Denormalization

From Wikipedia, the free encyclopedia
Jump to:navigation, search
This article does not cite any references or sources.
Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (May 2008)

Denormalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data[1][2]. In some cases, denormalization helps cover up the inefficiencies inherent in relational database software. A relational normalized database imposes a heavy access load over physical storage of data even if it is well tuned for high performance.

A normalized design will often store different but related pieces of information in separate logical tables (called relations). If these relations are stored physically as separate disk files, completing a database query that draws information from several relations (a join operation) can be slow. If many relations are joined, it may be prohibitively slow. There are two strategies for dealing with this. The preferred method is to keep the logical design normalized, but allow the database management system (DBMS) to store additional redundant information on disk to optimize query response. In this case it is the DBMS software's responsibility to ensure that any redundant copies are kept consistent. This method is often implemented in SQL as indexed views (Microsoft SQL Server) or materialized views (Oracle). A view represents information in a format convenient for querying, and the index ensures that queries against the view are optimized.

The more usual approach is to denormalize the logical data design. With care this can achieve a similar improvement in query response, but at a cost—it is now the database designer's responsibility to ensure that the denormalized database does not become inconsistent. This is done by creating rules in the database called constraints, that specify how the redundant copies of information must be kept synchronized. It is the increase in logical complexity of the database design and the added complexity of the additional constraints that make this approach hazardous. Moreover, constraints introduce a trade-off, speeding up reads (SELECT in SQL) while slowing down writes (INSERT, UPDATE, and DELETE). This means a denormalized database under heavy write load may actually offer worse performance than its functionally equivalent normalized counterpart.

A denormalized data model is not the same as a data model that has not been normalized, and denormalization should only take place after a satisfactory level of normalization has taken place and that any required constraints and/or rules have been created to deal with the inherent anomalies in the design. For example, all the relations are in third normal form and any relations with join and multi-valued dependencies are handled appropriately.

Examples of denormalization techniques include:

* materialized views, which may implement the following:
o storing the count of the "many" objects in a one-to-many relationship as an attribute of the "one" relation
o adding attributes to a relation from another relation with which it will be joined
* star schemas which are also known as fact-dimension models and have been extended to snowflake schemas
* prebuilt summarization or OLAP cubes

Denormalization techniques are often used to improve the scalability of Web applications[3].
[edit] References

1. ^ G. L. Sanders and S. K. Shin. Denormalization effects on performance of RDBMS. In Proceedings of the HICSS Conference, January 2001.
2. ^ S. K. Shin and G. L. Sanders. Denormalization strategies for data retrieval from data warehouses. Decision Support Systems, 42(1):267-282, October 2006.
3. ^ Z. Wei, J. Dejun, G. Pierre, C.-H. Chi and M. van Steen. Service-Oriented Data Denormalization for Scalable Web Applications. In Proceedings of the International World-Wide Web conference, April 2008.

[edit] See also

* Cache
* Scalability

[hide]
v • d • e
Topics in Database normalization
First normal form · Second normal form · Third normal form · Boyce-Codd normal form · Fourth normal form · Fifth normal form · Domain/key normal form · Sixth normal form
Denormalization
Retrieved from "http://en.wikipedia.org/wiki/Denormalization"
Categories: Database normalization


Source: [http://www.siue.edu/~dbock/cmis564/denormal.htm


Douglas B. Bock and John F. Schrage, Department of Computer Management and Information Systems, Southern Illinois University Edwardsville, published in the 1996 Proceedings of the Decision Sciences Institute, Orlando, Florida, November, 1996

BENEFITS OF DENORMALIZED RELATIONAL DATABASE TABLES

ABSTRACT

Heuristics for denormalizing relational database tables are examined with an objective of improving processing performance for data insertions, deletions and selection. Client-server applications necessitate consideration of denormalized database schemas as a means of achieving good system performance where the client-server interface is graphical (GUI) and the network capacity is limited by the network channel.

INTRODUCTION

Relational database table design efforts encompass both the conceptual and physical modeling levels of the three-schema architecture. Conceptual diagrams, either entity-relationship or object-oriented, are a precursor to designing relational table structures. CASE tools will generate relational database tables at least to the third normal form (3NF) based on conceptual models, but have not advanced to the point that they produce table structures that guarantee acceptable levels of system processing performance.

As firms move away from the mainframe toward cheaper client-server platforms, managers face a different set of issues in the development of information systems. One critical issue is the need to continue to provide a satisfactory level of system performance, usually reflected by system response time, for mission-critical, on-line, transaction processing systems.

A fully normalized database schema can fail to provide adequate system response time due to excessive table join operations. It is difficult to find formal guidance in the literature that outlines approaches to denormalizing a database schema, also termed usage analysis. This paper focuses on identifying heuristics or rules of thumb that designers may use when designing a relational schema.

Denormalization must balance the need for good system response time with the need to maintain data, while avoiding the various anomalies or problems associated with denormalized table structures. Denormalization goes hand-in-hand with the detailed analysis of critical transactions through view analysis. View analysis must include the specification of primary and secondary access paths for tables that comprise end-user views of the database. Additionally, denormalization should only be attempted under conditions that allow designers to collect detailed performance measurements for comparison with system performance requirements [1]. Further, designers should only denormalize during the physical design phase and never during conceptual modeling.

Relational database theory provides guidelines for achieving an idealized representation of data and data relationships. Conversely, client-server systems require physical database deign measures that optimize performance for specific target systems under less than ideal conditions [1]. The final database schema must be adjusted for characteristics of the environment such as hardware, software, and other constraints.

We recommend following three general guidelines to denormalization [1]. First, perform a detailed view analysis in order to identify situations where an excessive number of table joins appears to be required to produce a specific end-user view. While no hard rule exists for defining "excessive," any view requiring more than three joins should be considered as a candidate for denormalization. Beyond this, system performance testing by simulating the production environment is necessary to prove the need for denormalization.

Second, the designer should attempt to reduce the number of foreign keys in order to reduce index maintenance during insertions and deletions. Reducing foreign keys is closely related to reducing the number of relational tables.

Third, the ease of data maintenance provided by normalized table structures must also be provided by the denormalized schema. Thus, a satisfactory approach would not require excessive programming code (triggers) to maintain data integrity and consistency.

DENORMALIZING FIRST NORMAL FORM (1NF) TO UNNORMALIZED TABLES

There are more published formal guidelines for denormalizing 1NF than for any other normal form. This stems from the fact that repeating fields occur fairly often in business information systems; therefore, designers and relational database theoreticians have been forced to come to grips with the issue. It is helpful to examine an example problem for those new to the concept of normalization and denormalization. Consider a situation where a customer has several telephone numbers that must be tracked. The third normal form (3NF) solution and the denormalized table structure is given below with telephone number (Phone1, Phone2, Phone3, …) as a repeating field:

3NF:

CUSTOMER (CustomerId, CustomerName,…)

CUST_PHONE (CustomerId, Phone)

Denormalized:

CUSTOMER (CustomerId, CustomerName, Phone1, Phone2, Phone3, …)

Which approach is superior? The answer is that it depends on the processing and coding steps needed to store and retrieve the data in these tables. In the denormalized solution, code must be written (or a screen designed) to enable any new telephone number to be stored in any of the three Phone fields. Clearly this is not a difficult task and can be easily accomplished with modern CASE tools; still, this denormalized solution is usually only appropriate if one can guarantee that a customer will have a limited finite number of telephone numbers, or if the firm makes a management decision not store more than "X" number of telephone numbers.

Both solutions provide good data retrieval of telephone numbers as indicated by the following SQL statements. The denormalized solution requires a simpler, smaller index structure for the CustomerId key field. The normalized solution would require at least two indices for the Cust_Phone table - one on the composite key to ensure uniqueness integrity, and one on the CustomerId field to provide a fast primary access path to records.

Select * from Cust_Phone where CustomerId = '3344';

Select * from Customer where CustomerId = '3344';

If the customer name is also required in the query, then the denormalized solution is superior as no table joins are involved.

Select * from Cust_Phone CP, Customer C where CP.CustomerId = '3344' and CP.CustomerId = C.CustomerId;

The effort required to extract telephone numbers for a specific customer based on the customer name is also more difficult for the 3NF solution as a table join is required.

Select * from Cust_Phone CP, Customer C where C.CustomerName = 'Tom Thumb' and CP.CustomerId = C.CustomerId;

If the telephone number fields represent different types of telephones, for example, a voice line, a dedicated fax line, and a dedicated modem line, then the appropriate table structures are:

3NF:

CUSTOMER (CustomerId, CustomerName, … )

CUST_PHONE (CustomerId, Phone)

PHONE (Phone, PhoneType)

Denormalized:

CUSTOMER (CustomerId, CustomerName, VoicePhone, FaxPhone, ModemPhone, …)

Again the denormalized solution is simpler for data storage and retrieval and, as before, the only factor favoring a 3NF solution is the number of potential telephone numbers that an individual customer may have. The Phone table would be at least 30 to 50 percent the size of the Cust_Phone table. The decision to denormalize is most crucial when the Customer table is large in a client-server environment; for example, one hundred thousand customers, each having two or three telephone numbers. The join of Customer, Cust_phone, and Phone may be prohibitive in terms of processing efficiency.

DENORMALIZING TO SECOND NORMAL FORM (2NF) TO 1NF

The well-known order entry modeling problem involving Customers, Orders, and Items provides a realistic situation where denormalization is possible without significant data maintenance anomalies. Consider the following 3NF table structures.

3NF:

CUSTOMER (CustomerId, CustomerName,…)

ORDER (OrderId, OrderDate, DeliveryDate, Amount, CustomerId)

ORDERLINE (OrderId, ItemId, QtyOrdered, OrderPrice)

ITEM (ItemId, ItemDescription, CurrentPrice)

The many-to-many relationship between the Order and Item entities represents a business transaction that occurs repeatedly over time. Further, such transactions, once complete, are essentially "written in stone" since the transaction records would never be deleted. Even if an order is partially or fully deleted at the request of the customer, other tables not shown above will be used to record the deletion of an item or an order as a separate business transaction for archive purposes.

The OrderPrice field in the Orderline table represents the storage of data that is time-sensitive. The OrderPrice is stored in the Orderline table because this monetary value may differ from the value in the CurrentPrice field of the Item table since prices may change over time. While the OrderPrice field is functionally determined by the combination of ItemId and OrderDate, storing the OrderDate field in the Orderline table is not necessary, since the OrderPrice is recorded at the time that the transaction takes place. Therefore, while Orderline is not technically in 3NF, most designers would consider the above solution 3NF for all practical purposes. A true 3NF alternative solution would store price data in an ItemPriceHistory table, but such a solution is not central to the discussion of denormalization.

In the proposed denormalized 1NF solution shown below (the Customer and Order tables remain unchanged) the Item.ItemDescription field is duplicated in the Orderline table. This solution violates second normal form (2NF) since the ItemDescription field is fully determined by ItemId and is not determined by the full key (OrderId + ItemId). Again, this denormalized solution must be evaluated for the potential effect on data storage and retrieval.

Denormalized 1NF:

ORDERLINE (OrderId, ItemId, QtyOrdered, OrderPrice, ItemDescription)

ITEM (ItemId, ItemDescription, CurrentPrice)

Orderline records for a given order are added to the appropriate tables at the time that an order is made. Since the OrderPrice for a given item is retrieved from the CurrentPrice field of the Item table at the time that the sale takes place, the additional processing required to retrieve and store the ItemDescription value in the Orderline table is negligible.

The additional expense of storing Orderline.ItemDescription must be weighed against the processing required to produce various end-user views of these data tables. Consider the production of a printed invoice. The 3NF solution requires joining four tables to produce an invoice view of the data. The denormalized 1NF solution requires only the Customer, Order, and Orderline tables. Clearly, the Order table will be accessed by an indexed search based on the OrderId field. Similarly, the retrieval of records from the Orderline and Item tables may also be via indexes; still, the savings in processing time may offset the cost of extra storage.

An additional issue concerns the storage of consistent data values for the ItemDescription field in the Orderline and Item tables. Suppose, for example, an item description of a record in the Item table is changed from "Table" to "Table, Mahogany." Is there a need to also update corresponding records in the denormalized Orderline table? Clearly an argument can be made that these kinds of data maintenance transactions are unnecessary since the maintenance of the non-key ItemDescription data field in the Orderline table is not critical to processing the order.

DENORMALIZING THIRD NORMAL FORM (3NF) TO 2NF

An example for denormalizing from a 3NF to a 2NF solution can be found by extending the above example to include data for salespersons. The relationship between the Salesperson and Order entities is one-to-many (many orders can be processed by a single salesperson, but an order is normally associated with one and only one salesperson). The 3NF solution for the Salesperson and Order tables is given below, along with a denormalized 2NF solution.

3NF:

SALESPERSON (SalespersonId, SalespersonName,…)

ORDER (OrderId, OrderDate, DeliveryDate, Amount, SalespersonId)

Denormalized 2NF:

SALESPERSON (SalespersonId, SalespersonName,…)

ORDER (OrderId, OrderDate, DeliveryDate, Amount, SalespersonId, SalespersonName)

Note that the SalespersonId in the Order table is a foreign key linking the Order and Salesperson tables. Denormalizing the table structures by duplicating the SalespersonName in the Order table results in a solution that is 2NF because the non-key SalespersonName field is determined by the non-key Salesperson field. What is the effect of this denormalized solution?

By using view analysis for a typical printed invoice or order form, we may discover that most end-user views require printing of a salesperson's name, not their identification number on order invoices and order forms. Thus, the 3NF solution requires joining the Salesperson and Order two tables, in addition to the Customer and Orderline tables from the denormalized 1NF solution given in the preceding section, in order to meet processing requirements.

As before, a comparison of the 3NF solution and the denormalized 2NF solution reveals that the salesperson name could easily be recorded to the denormalized Order table at the time that the order transaction takes place. Once a sales transaction takes place, the probability of changing the salesperson credited with making the sales is very unlikely.

One should also question the need to maintain the consistency of data between the Order and Salesperson tables. In this situation, we find that the requirement to support name changes for salespeople is very small, and only occurs, for the most part, when a salesperson changes names due to marriage. Furthermore, the necessity to update the Order table in such a situation is a decision for management to make. An entirely conceivable notion is that such data maintenance activities may be ignored, since the important issue for salesperson's usually revolves around whether or not they get paid their commission, and the denormalized 2NF solution supports payroll activities as well as the production of standard end-user views of the database.

DENORMALIZING HIGHER NORMAL FORMS

The concept of denormalizing also applies to the higher order normal forms (fourth normal form - 4NF or fifth normal form - 5NF), but occurrences of the application of denormalization in these situations are rare. Recall that denormalization is used to improve processing efficiency, but should not be used where there is the risk of incurring excessive data maintenance problems. Denormalizing from 4NF to a lower normal form would almost always lead to excessive data maintenance problems. By definition, we normalize to 4NF to avoid the problem of having to add multiple records to a single table as a result of a single transaction. Data anomalies associated with 4NF violations only tend to arise when sets of binary relationships between three entities have been incorrectly modeled as a ternary relationship. The resulting 4NF solution, when modeled in the form of an E-R diagram usually results in two binary one-to-many relationships. If denormalization offers the promise of improving performance among the entities that are paired in these binary relationships, then the guidance given earlier under each of the individual 1NF, 2NF, and 3NF sections applies; thus, denormalization with 4NF would not require new heuristics.

While denormalization may also be used in 5NF modeling situations, the tables that result from the application of 5NF principles are rarely candidates for denormalization. This is because the number of tables required for data storage have already been minimized. In essence, the 5NF modeling problem is the mirror-image of the 4NF problem. A 5NF anomaly only arises when a database designer has modeled what should be a ternary relationship as a set of two or more binary relationships.

SUMMARY

This article has described situations where denormalization can lead to improved processing efficiency. The objective is to improve system response time without incurring a prohibitive amount of additional data maintenance requirements. This is especially important for client-server systems. Denormalization requires thorough system testing to prove the effect that denormalized table structures have on processing efficiency. Furthermore, unseen ad hoc data queries may be adversely affected by denormalized table structures. Denormalization must be accomplished in conjunction with a detailed analysis of the tables required to support various end-user views of the database. This analysis must include the identification of primary and secondary access paths to data.

Additional consideration may be given table partitioning that goes beyond the issues that surround table normalization. Horizontal table partitioning may improve performance by minimizing the number of rows involved in table joins. Vertical partitioning may improve performance by minimizing the size of rows involved in table joints. A detailed discussion of table partititioning my be found elsewhere [1].

REFERENCES

[1] Physical Database Design for Sybase SQL Server by Rob Gillete, Dean Muench, and Jean Tabaka, Prentice-Hall Publishing, Englewood Cliffs, NJ, 1995.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License