View Single Post
Old 03-29-2005, 01:01 PM
loquin's Avatar
loquin loquin is offline
Google Hound

Retired Moderator
* Guru *
Join Date: Nov 2001
Location: Arizona, USA
Posts: 12,400
Default Just how normal ARE you ?

“Normal is as Normal does, Forest.”

That’s what Mama always said. Of course she was always talking ‘bout chocolates, too.

One of the things you will often hear, at least when discussing databases, is normalization. Just what in the heck is normalization, anyway?

Normalization refers to the level to which redundant data has been eliminated from a relational database. Generally, the higher the level of normalization, the more redundant data has been eliminated. Changes in normalization are achieved by changing the design of the relational model of your database.

As we’ve discussed in earlier installments, time spent planning is time well spent. In the case of data normalization, time spent in ensuring that your data is normalized can have great benefit when storing, modifying, or retrieving data in your database, as well as reducing time and costs to implement changes in the database design later.

First Normal Form (1NF)

One definition of First Normal Form (1NF) states “All column values must be atomic.” This means that any one field in the table can only hold one value. Suppose you had an order table where the ITEMS field held “2 Hammer, 3 Chisel, 1 Saw”. This table is NOT in 1NF, because the ITEMS field is not atomic – it does not hold just one value. If you ever tried to build a report to summarize the parts and quantities sold, you would have a hard time with this table structure. First normal form also means that you can’t have repeating groups (or arrays) of fields in a single table. For instance, suppose you tried to fix your earlier order table by dropping the ITEMS field and replacing it with ITEM1, DESCR1, ITEM2, DESCR2, and ITEM3, DESCR3 fields. Since the first customer to order 4 items could break this table, the first problem with this design is obvious. Even though you could keep adding fields up to the maximum probable order size, you would still be artificially limited as to the maximum number of items in an order, you would waste a lot of space, and it would still be difficult to summarize your sales records.

In order to move a table to 1NF, you will need to break up the multi-value fields and remove the repeating groups. This is done by taking the information that was stored horizontally, and instead, store it vertically. In the above case, you would have a single ITEM and DESCRIPTION field, and add an OrderItem field. Our first order above, which consisted of three items, would then have a total of three records to store the data about these three items.

Second Normal Form (2NF)

A table is said to be in Second Normal Form (2NF) if “it is in 1NF and every non-key column is fully dependent on the entire primary key.” This means that tables should only be storing information pertaining to the one entity that is described by the primary key.

As an example, suppose we’ve expanded our tblOrders table a bit, so that the fields are now OrderID, OrderItem, CustomerID, OrderDate, Quantity, ProductID, and ProductDescription. Will this table definition meet 2NF rules? Well, the answer to this question falls back to the definition of 2NF: Is every other field in the table fully dependent on the primary key (OrderID and OrderItem?) Fully dependent means that you can only determine the value of a field if you know the value of the primary key. In this case, the answer is NO. The CustomerID and Order date are dependent on the OrderID only, and not on the combination of OrderID and OrderItem. It is very easy to spot this problem by looking at the sorted data from the table. Note that OrderID and CustomerID are always the same for a given order, no matter how many items are in the order.

This situation is not a good one, since the user is forced to enter lots of redundant information; The customer number and the order date are stored with every order detail record, resulting in higher data storage requirements (and thus expense) and in additional opportunities for erroneous entry. We can resolve this issue simply, by breaking the table into two tables; tblOrder (OrderID, CustomerID, and OrderDate,) and tblOrderDetail (OrderID,OrderItem, Quantity, ProductID, and ProductDescription.)

The table tblOrder describes the entire order; it’s primary key is OrderID. The table tblOrderDetail describes only the order items; it’s primary key is a composite of OrderID and OrderItem. Note: The act of breaking a table down into their normalized forms is called decomposition. No information was lost when a table is normalized in this fashion, as we can always reconstruct the original table by running a query against the two tables created by normalization, or even better, by creating a view if using a database server.

Third Normal Form (3NF)

The definition of a table said to be in Third Normal Form (3NF) is that “it is in 2NF and if all non-key columns are mutually independent.” This means that all non-key fields must not have inter-field dependencies. They must be fully dependent upon just the primary key, and not on each other.

If we look at the tblOrderDetail we created in the 2NF discussion, it appears to have an interdependency between ProductID and ProductDescription. In other words, wherever you have a particular value of ProductID, you would see an identical value of ProductDescription. This table design is not an optimum one for several reasons. First, you are storing information about two “things” (order detail information, as well as product information) in one table. Every time a user wishes to refer to a screwdriver, for instance, he/she would have to enter both its code and its description. This is both inefficient, and error prone. Second, if you decided that you needed to change the description of a product, you would need to change it everywhere it is used in the entire orders table. The third, and potentially, the most dangerous problem though, is if you were to delete all detail records that included a particular ProductID and Description. In this case, you would lose all reference to this part, as it is no longer stored in the database.

The Third Normal Form also forbids the creation of tables that contain calculated fields. The reason for this is that they waste space, and they can easily get out of sync with the source data. Normally, it’s better to generate calculated values only when you need them – within queries and reports, for instance.

In order to move from 2NF to 3NF, just break out another table. In the tblOrderDetail case, add a table called tblProduct which contains a primary key of ProductID, and which also stores specific information to the product itself (Description, Units of measure, minimum stocking quantities, etc.) Then, in table tblOrderDetail, just store a foreign key reference to the tblProduct table.

Codd only presented three normal forms, but since, others have shown that further normalizations levels can exist. In complicated databases, these higher levels of normalization may need to be considered, but they will not be addressed in this tutorial.

Note that, while normalization is generally a good thing, there are times that you may need to denormalize your data – that is, to not follow a strict adherence to normal forms. However, when doing so, you should always start from a strictly normalized database, and then denormalize in order to make your database work better in the real world. The usual reason for denormalizing a database is to improve performance. As an example, suppose that you are developing a crossword puzzle dictionary app. In a crossword, you generally need to know the length of the word you are searching for. Even though SQL offers a length function to calculate the length of a word in a query, if you need to perform this calculation hundreds of thousands of times for each and every search, it would probably be best to intentionally denormalize your database by adding a calculated field (WordLength.)

There are tradeoffs in denormalization, though. When you store a calculated field in your table you must take the responsibility to ensure that the data remains up-to-date. At any point that users may be able to add or edit table data that impacts the calculated field, you must ensure that the calculated data is correct. Otherwise, if the users makes a change, not only is your data denormalized, it is also WRONG. And don’t assume that denormalization will always result in a performance improvement. Even though the performance may be improved in one area (typically a select query,) it will certainly be reduced whenever you update the data.

As a general rule, normalize your database whenever you can, and denormalize only when you must. If you decide that you must denormalize, follow these rules:
  • Start with a normalized structure.
  • Denormalize deliberately, and not by accident.
  • Have a good reason for denormalizing.
  • Be fully aware of the tradeoffs that are involved.
  • Document the deviation. Thoroughly.
  • Make sure to create necessary application adjustments to ensure data integrity.
If you do become involved in the modeling of a complex, real-world application, database design can become messy. In this case, you may wish to consider using tools, like Asymetrix’s InfoModeler, or Logicworks’ Erwin, or for simpler applications, MS Visio. In order to use them effectively, you should already understand the relational model.

Now, 'bout those chocolates...

Note: Ref Paul Litwin's Fundamentals of Relational Database Design - I did. Also, Wikipedia has a nice article dealing with normalization.
"I have my standards. They may be low, but I have them!" ~ Bette Middler
"It's a book about a Spanish guy called Manual. You should read it." ~ Dilbert
"To understand recursion, you must first understand recursion." ~ unknown

Last edited by loquin; 04-25-2007 at 02:27 PM.
Reply With Quote