Sunday, July 27, 2008

Joins In SQL

Inner Join Review:
The most commonly used join is an INNER JOIN. This type of join combines rows from two tables only when they match on the joining condition. Usually the primary key from one table matches a foreign key on another table, but join conditions can be more complex than that.

INNER JOIN will retrieve a results row only where there is a perfect match between the two tables in the join condition. You will also often see one row from one of the tables matching multiple rows in the other table. For example, one customer can have many orders. One order can have many order details. The data on the one side will be repeated for each row on the many side. The following query is an example showing how the information from the Sales.SalesOrderHeader is repeated on each matching row:
-------------------------------------------------------------------------
SELECT s.SalesOrderID, OrderDate,ProductID
FROM Sales.SalesOrderHeader AS s
INNER JOIN Sales.SalesOrderDetail AS d ON s.SalesOrderID = d.SalesOrderID
ORDER BY s.SalesOrderID, ProductID
-------------------------------------------------------------------------
Outer Join Introduction:
OUTER JOIN is used to join two tables even if there is not a match. An OUTER JOIN can be used to return a list of all the customers and the orders even if no orders have been placed for some of the customers. A keyword, RIGHT or LEFT, is used to specify which side of the join returns all possible rows. I like using LEFT because it makes sense to me to list the most important table first. Except for one example demonstrating RIGHT OUTER JOIN, this article will use left joins. Just a note: the keywords INNER and OUTER are optional.

The next example returns a list of all the customers and the SalesOrderID for the orders that have been placed, if any.
--------------------------------------------------------------------------
SELECT c.CustomerID, s.SalesOrderID
FROM Sales.Customer c
LEFT OUTER JOIN Sales.SalesOrderHeader s ON c.CustomerID = s.CustomerID
---------------------------------------------------------------------------
It uses the LEFT keyword because the Sales.Customer table is located on the left side and we want all rows returned from that table even if there is no match in the Sales.SalesOrderHeader table. This is an important point. Notice also that the CustomerID column is the primary key of the Sales.Customer table and a foreign key in the Sales.SalesOrderHeader table. This means that there must be a valid customer for every order placed. Writing a query that returns all orders and the customers if they match doesn’t make sense. The LEFT table should always be the primary key table when performing a LEFT OUTER JOIN.

If the location of the tables in the query are switched, the RIGHT keyword is used and the same results are returned:
---------------------------------------------------------------------------
SELECT c.CustomerID, s.SalesOrderID
FROM Sales.SalesOrderHeader s
RIGHT OUTER JOIN Sales.Customer c ON c.CustomerID = s.CustomerID
---------------------------------------------------------------------------

If I have a LEFT OUTER JOIN, what is returned from the table on the right side of the join where there is not a match? Each column from the right side will return a NULL. Try this query which lists the non-matching rows first:
----------------------------------------------------------------------------

SELECT c.CustomerID, s.SalesOrderID
FROM Sales.Customer c
LEFT OUTER JOIN Sales.SalesOrderHeader s ON c.CustomerID = s.CustomerID
ORDER BY s.SalesOrderID
---------------------------------------------------------------------------
By adding a WHERE clause to check for a NULL SalesOrderID, you can find all the customers who have not placed an order. My copy of AdventureWorks returns 66 customers with no orders:
--------------------------------------------------------------------------
SELECT c.CustomerID, s.SalesOrderID
FROM Sales.Customer c
LEFT OUTER JOIN Sales.SalesOrderHeader s ON c.CustomerID = s.CustomerID
--------------------------------------------------------------------------

WHERE s.SalesOrderID IS NULLOccasionally, you will need to be more specific. How can you find all the customers who have not placed an order in 2002? There are several ways to solve this problem. You could create a view of all the orders placed in 2002 and join the view on the Sales.Customer table. Another option is to create a CTE, or Common Table Expression, of the orders placed in 2002. This example shows how to use a CTE to get the required results:

WITH s AS
( SELECT SalesOrderID, customerID
FROM Sales.SalesOrderHeader
WHERE OrderDate between '1/1/2002' and '12/31/2002'
)
--------------------------------------------------------------------------
SELECT c.CustomerID, s.SalesOrderID
FROM Sales.Customer c
LEFT OUTER JOIN s ON c.customerID = s.customerID
WHERE s.SalesOrderID IS NULL
--------------------------------------------------------------------------
My favorite technique to solve this problem is much simpler. Additional criteria, in this case filtering on the OrderDate, can be added to the join condition. The query joins all customers to the orders placed in 2002. Then the results are restricted to those where there is no match. This query will return exactly the same results as the previous, more complicated query:
--------------------------------------------------------------------------
SELECT c.CustomerID, s.SalesOrderID
FROM Sales.Customer c
LEFT OUTER JOIN Sales.SalesOrderHeader s ON c.CustomerID = s.CustomerID
and s.OrderDate between '1/1/2002' and '12/31/2002'
WHERE s.SalesOrderID IS NULL
--------------------------------------------------------------------------
Using Aggregates with Outer Joins:
Aggregate queries introduce another pitfall watch out for. The following example is an attempt to list all the customers and the count of the orders that have been placed. Can you spot the problem?
--------------------------------------------------------------------------
SELECT c.CustomerID, count(*) OrderCount
FROM Sales.Customer c
LEFT OUTER JOIN Sales.SalesOrderHeader s ON c.CustomerID = s.CustomerID
GROUP BY c.CustomerID
ORDER BY OrderCount
--------------------------------------------------------------------------
Now the customers with no orders look like they have placed one order. That is because this query is counting the rows returned. To solve this problem, count the SalesOrderID column. NULL values are eliminated from the count.
--------------------------------------------------------------------------
SELECT c.CustomerID, count(SalesOrderID) OrderCount
FROM Sales.Customer c LEFT OUTER JOIN Sales.SalesOrderHeader s
ON c.CustomerID = s.CustomerID
GROUP BY c.CustomerID
ORDER BY OrderCount
--------------------------------------------------------------------------
Multiple Joins:
Once more than two tables are involved in the query, things get a bit more complicated. When a table is joined to the RIGHT table, a LEFT OUTER JOIN must be used. That is because the NULL rows from the RIGHT table will not match any rows on the new table. An INNER JOIN causes the non-matching rows to be eliminated from the results. If the Sales.SalesOrderDetail table is joined to the Sales.SalesOrderHeader table and an INNER JOIN is used, none of the customers without orders will show up. NULL cannot be joined to any value, not even NULL.

To illustrate this point, when I add the Sales.SalesOrderDetail table to one of the previous queries that checked for customers without orders, I get back no rows at all.
--------------------------------------------------------------------------
SELECT c.CustomerID, s.SalesOrderID, d.SalesOrderDetailID
FROM Sales.Customer c
LEFT OUTER JOIN Sales.SalesOrderHeader s ON c.CustomerID = s.CustomerID
INNER JOIN Sales.SalesOrderDetail d ON s.SalesOrderID = d.SalesOrderID
WHERE s.SalesOrderID IS NULL
--------------------------------------------------------------------------
To get correct results, change the INNER JOIN to a LEFT JOIN.

--------------------------------------------------------------------------
SELECT c.CustomerID, s.SalesOrderID, d.SalesOrderDetailID
FROM Sales.Customer c
LEFT OUTER JOIN Sales.SalesOrderHeader s ON c.CustomerID = s.CustomerID
LEFT OUTER JOIN Sales.SalesOrderDetail d ON s.SalesOrderID = d.SalesOrderID
WHERE s.SalesOrderID IS NULL
--------------------------------------------------------------------------
What about additional tables joined to Sales.Customer, the table on the left? Must outer joins be used? If it is possible that there are some rows without matches, it must be an outer join to guarantee that no results are lost. The Sales.Customer table has a foreign key pointing to the Sales.SalesTerritory table. Every customer’s territory ID must match a valid value in Sales.SalesTerritory. This query returns 66 rows as expected because it is impossible to eliminate any customers by joining to Sales.SalesTerritory:

--------------------------------------------------------------------------
SELECT c.CustomerID, s.SalesOrderID, t.Name
FROM Sales.Customer c
LEFT OUTER JOIN Sales.SalesOrderHeader s ON c.CustomerID = s.CustomerID
INNER JOIN Sales.SalesTerritory t ON c.TerritoryID = t.TerritoryID
WHERE SalesOrderID IS NULL
--------------------------------------------------------------------------
Sales.SalesTerritory is the primary key table; every customer must match a valid territory. If you wanted to write a query that listed all territories, even those that had no customers, an outer join will be used. This time, Sales.Customers is on the right side of the join.
--------------------------------------------------------------------------
SELECT t.Name, CustomerID
FROM Sales.SalesTerritory t
LEFT OUTER JOIN Sales.Customer c ON t.TerritoryID = c.TerritoryID

Conclusion:
Queries with outer joins can be tricky to write. Extra time and care must be spent making sure the results are correct. Think about the relationship between the tables and make sure that the outer join is continued down the path. This article covered almost everything you need to know about outer joins.

No comments: