Normalizing is Abnormal

 

By Jeff Fowler, President, Decision Software


If I had to choose the single most common mistake I've seen in marketing database design, it would be over normalization; or, as we dummies say: too many tables.  Generally speaking, the more tables there are in a database, the more difficult it becomes for marketing people to ask marketing questions.  And until the day arrives when we can talk to our database and get accurate answers (something my wife claims she can't even do with me), we're stuck with having to know just a little about how it works. 


Why be normal? 


Database administrators like to normalize because that's what good DBAs do.  There are formal rules governing relational database design, and breaking one is as inconceivable to them as watching a sporting event without drinking beer is to me.  So why do these maddeningly-precise DBAs like taking a perfectly good file and splitting it up into different tables?  For two very good reasons: to keep your database clean and accurate, and to save storage space. 


Saving space sounds like a good idea.  Suppose you have a table holding all of the items purchased by your customers.  Would it make sense to include a customer's name and address in a table each time she bought something from us?  Most print shops use 30-character (or byte) fields for the name and two address lines, and another 20 bytes for the city.  If a customer buys a hundred items, we're potentially looking at wasting over 10,000 bytes.  With a million customers we're looking at using ten gigabytes to store something that could have fit into around a hundred megs.  Ouch. 


Keeping your database clean is also a good thing to do.  In our example above, if a customer changes her address, do we want to fix it a hundred times, or only once?  The answer is obvious even to us dummies.  These issues and others are precisely why a DBA may choose to store item data in an item table and name/address data in a customer table.  The problems occur when this efficiency theme goes a little too far. 


Efficiency run amok


Let's delve further into our scenario and take a single customer file, household it, append demographics, and turn it over to our trusty DBA.  Meticulously applying the normalization logic we saw before, he takes the file and proceeds to create a household table, a customer table, a household demographic table, and a customer demographic table.  Why?  Because clearly if a household has three people in it, we waste space by repeating the address three times, when in fact only their names are different.  And after all, don't some demographics than the individual?  Also, because we had about a 70% match rate, many of the demographic fields are missing, influencing the decision to store demographic data in separate tables. 


Next, being a thorough and diligent fellow, our DBA finds out that we've got some business data mixed into our file, and - you guessed it - suddenly we find ourselves with a separate business table along with a business demographic table,complete with data elements like SIC code, number of employees, and annual revenue.  Before we know it, the customer file we started with has become seven database tables!  Now let's talk about some of the consequences of all this efficiency: slower performance, greater complexity, and increased risk of error. 


Poor performance- is your database a Slacker? 


Generally speaking, queries using table joins run more slowly than queries without joins, because there is a certain amount of overhead involved in linking tables together.  To give you an idea, I took some test data and loaded a table of about a million rows, then split it into three related tables: household, customer, and demographics, indexing appropriate columns for join performance. (Think of an "index" as a shortcut for databases to access data more quickly.) Then I ran a series of typical marketing queries, first against the original master table, and again against the three normalized tables joined together.  What I found is that my queries consistently took 2-3 times longer running against the three-table join than they did against the original table. 


Complexity causes perplexity 


There is an inherent problem to consider whenever you link tables together: what to do with the non-matches. That is, rows in table A that have no match in table B. In our above scenario, customers for whom no demographic data exists are not included in the demographic table at all. This forces us to make a decision when linking the two tables back together: do we want to see all customers whether or not they have demographic information, or do we only want those customers who do?  Depending on our answer we either need to create an inner join or an outer join.  Worse yet, there are occasions when table B might have a record that is not in table A, in which case we may need to create a full outer join, allowing all data to be processed whether it comes from A, B, or both.  To a DBA, these situations are routine, but for us mortals, figuring out how to link several tables together to get the right answer can be about as enjoyable as doing our taxes. 


Misinformation 


One of the problems I've always had with computers (and my programmers) is their annoying tendency to do exactly what I say, not what I mean.  Say we want to run a marketing campaign to female homeowners, sending upper income households one offer and all other incomes another.  Marketing databases commonly contain fields such as Gender, Dwelling_Type, and Income, and as a long as all three of these are in one table our campaign is easy to do.  Consider, however, what might happen if Gender and Dwelling_Type are in a customer table and Income is stored in a separate demographic table.  An inner join returns only customer data for which demographic information exists, while an outer join returns customer data whether or not there's a match to the demographic table.  The problem is that either way we still get an answer from the database when we ask for women of certain incomes.  However, our counts from the inner join won't include females for which no income information exists, and consequently, if we link the tables wrong we aren't soliciting 30% of our file (remember that we had a 70% match rate for demographics). 


What's a marketer to do? 


The trick is to find the proper balance between marketing's desire to make one Great Big Table and a DBA's tendency to make a hundred.  Now that you can buy zillion gigabyte drives for $3, saving storage space simply isn't that much of an issue.  And as far as changing one record vs. a hundred, remember that by far most marketing databases are read/only, fed nightly or weekly by operational systems or data warehouses.  As such, an interactive update is generally done in the warehouse first, and a batch load process updates the marketing database with all accumulated changes at periodic intervals.  Thus, the design of your warehouse typically is (and should be) much more normalized than your marketing database. 


In our example, I'd advise three tables instead of seven: an address table (I like keeping all the bulky name/address data in a separate address table since it's not needed for marketing queries), a customer table (one row for each customer plus demographics, if any), and a business demographic table containing data that applies only to businesses. 


The bottom line is that when building a marketing database, it's OK to break a few rules in the name of marketing convenience.  Just don't let your DBA find out.



Copyright 2002, PRIMEDIA Business Magazines & Media Inc.  All rights reserved.