I’ve written several times about the problems that can occur when joining database tables together in your marketing database. To refresh your memory, joining tables can be thought of as linking them together based on one or more common fields, such as a customer number. The problems we’ve discussed so far include getting different answers based on whether the join is an Inner Join, which is simply the intersection between two tables, or an Outer Join which behaves like a union in that it returns data from one table whether or not there’s a match in the other. Today’s discussion is on another interesting situation involving joins that can give marketers headaches, and it involves one of our favorite subjects: money.
Before I go further, let me emphasize that I’m not saying you shouldn’t join tables together. Indeed, you don’t have much of a choice, unless your database consists of a single Great Big Table (GBT to us savvy technical types). As I’ve mentioned before, DBAs like to store data in separate tables for efficiency. For example, if one customer purchases three items, it’s not a good idea to repeat her name and address three times. Instead, a good DBA will store the name and address once in a Customer table, and then write each of the three items out to a separate Items table. The fun begins when we need to link these tables back together again in order to run a selection that needs data from each table.
Whenever a row in one table matches more than one row in another, SQL databases repeat the values from the first row for each matching row in the other table. So in the example above, the row in Customer links to three rows in Items, causing the database to repeat the name/address data for each item, precisely as if our DBA had stored them together to begin with! Under normal circumstances, this is exactly what we want, because it lets us generate reports that show item sales by location and so on. On occasion, however, joining tables can cause some really funky results, and you need to be aware of them before you rush in and tell your boss about the spectacular results you’ve achieved with your creative genius.
Let's use a catalog database as our example. Most of them are very straightforward, containing three main tables:
- Customer - containing names and addresses along with demographic data;
- Orders - data such as catalog number, promotion code, order number, order date, and payment type (cash vs. credit card);
- Items - order line item detail such as SKU number, quantity purchased, amount, and tax.
To make marketing queries easier and faster, what many people like to do - myself included - is generate aggregate totals at various levels of detail. For example, you might calculate a total lifetime paid amount, average order size, and most recent order date and store these on the Customer table for fast and convenient access. It's also pretty common to calculate a total order amount (for all items in an order) and place that on the Orders table. Our problems arise when we join a table containing one of these calculated fields with another table, and the aggregated value gets repeated for each matching row.
Say we ran a promotion that offered a free gift to everyone placing an order of $15 or more, and we want to see how well it did. We’ll produce a report showing the number of orders, total revenue, number of items shipped, and average order size. Because this report requires order as well as item information, we need to start out by joining the tables together. And this is precisely where you can get into trouble if you’re not careful.
All databases contain standard built in functions for aggregating data. Three that are frequently used in marketing queries are: COUNT (to count how many rows match a given condition), SUM (to summarize a numeric field), and AVG (to compute an average). So to produce our report, it seems reasonable to assume that we'll COUNT the number of orders, SUM the order amount, SUM the item quantity, and AVG the order amount. And if you like to watch game show television, you're doubtless familiar with the sound. BZZZZT.
The reason our assumption doesn’t work is made clear in Figure 1. Note that customer Bonnie has placed a single order for items A, B, and C. The total order amount for these three items is $28.50. But because the order contains three items, this total gets repeated for each one when the two tables are joined. This means when we SUM the order amount, we’re producing a total that is three times too large! Sounds like a nifty new Enron accounting technique. Sorry boss…
Further study of the diagram shows that the order amount is simply a sum of the item amounts on each order, and what we should be totaling to get the right answer is the item amount, by entering SUM(Item_Amount). But there are a couple of other problems that warrant discussion: how to count the number of orders, and how to calculate average order size.
There are two "flavors" of the COUNT function: Count Occurrences and Count Distinct. Counting occurrences (coded as COUNT(*) in SQL) works just the way it sounds; it simply counts the number of records. So if we count how many rows in the join have an order number, COUNT(*) returns a 4. But unfortunately this is not what we want. It represents the number of items, not the number of orders. That's where COUNT DISTINCT comes in handy. Since each order is given a different order number, what we need to do is find out how many unique order numbers there are, which gives us the number of orders that were placed. When we do this entering COUNT (DISTINCT Order_Number), it correctly returns a 2.
We're cooking now! That leaves us with one final challenge: how to calculate the average order size. This is where my two years of 7th grade Algebra class stands me in good stead. Notice that entering AVG(Order_Amount) won't give us the correct answer, again because the amount is being repeated for each item, meaning it appears that three orders were placed for $28.50 instead of one. Neither can we enter AVG(Item_Amount), since that returns the average spent on each item, not the entire order. What we want to do is step back and recall how the average should be calculated in the first place; that is, take the total order amount and divide it by the number of orders. Bingo! You guessed it. We do this by entering SUM(Item_Amount) / COUNT(DISTINCT Order_Number).
I've never been a big fan of having marketing users join tables together, because it seems to cause more problems than it fixes. And besides, our much-maligned DBA is all too eager to help us with problems like this. You've heard me extol the virtues of "views" before, and this is yet another case where our DBA can set up a view that pre-joins the tables we need, and omit the fields that get us in trouble with the boss.
Today's message is: watch where you put your money. All our confusion could have been avoided had we NOT included the total order amount in the join. Had it not been present, we wouldn't have had the opportunity to screw up in the first place.
Copyright 2002, PRIMEDIA Business Magazines & Media Inc. All rights reserved.
|