Converting One-to-Many Relationships to One-to-One in Data Warehousing

In a transactional database, one-to-many relationships are very common. For example, one product may belong to multiple categories, and one person may have multiple addresses. In data warehousing, however, we often need only a one-to-one relationship for data analytics. Here are some common ways to convert a one-to-many relationship in the source table.

1. Add Multiple Columns

One way to convert a one-to-many relationship to a one-to-one relationship is by adding more columns to the target table. This method is sometimes called “flattening.” For example, if a customer has up to three phone numbers and is represented in a source table CustomerPhone as follows:

CustomerPhoneID

CustomerID

Phone

1

123

555-1111

2

123

555-2222

3

123

555-3333

In the target table, we can have three columns for Phone: Phone1, Phone2 and Phone3. To "flatten" the source table, we can use PIVOT in T-SQL as follows:

WITH cte AS 
(
    SELECT CustomerID, Phone, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CustomerPhoneID) AS row_num
    FROM CustomerPhone
)
SELECT CustomerID, [1] AS Phone1, [2] AS Phone2, [3] AS Phone3
FROM cte
PIVOT (MAX(Phone) FOR row_num IN ([1], [2], [3])) AS pvt

The result is shown shown below:

CustomerID

Phone1

Phone2

Phone3

123

555-1111

555-2222

555-3333

While this approach is straightforward, it can become impractical if the “many” part of the one-to-many relationship has a large number of values. (I’ve seen cases where 16 columns were added in a project.) Additionally, the number of columns must be fixed, which may lead to data loss—what if a customer has more than three phone numbers?

2. Concatenate Multiple Values into One String

Another way to convert a one-to-many relationship to a one-to-one relationship is by concatenating the multiple values into a string with a separator. The easiest way to do this in SQL Server 2017 and later is by using the STRING_AGG function. Using the above example, we can write the following T-SQL query (with ' | ' as the separator):

SELECT STRING_AGG(CONVERT(VARCHAR(MAX), Phone), ' | ') AS Phones  
FROM CustomerPhone  
GROUP BY CustomerID

The result looks like this:

CustomerID

Phones

123

555-1111 | 555-2222 | 555-3333

With this approach, we are not limited to a fixed number of values in the one-to-many relationship. However, one potential drawback is that parsing the values in the target table may require additional processing for end users.

3. Select Only One Value

This is the most commonly used approach. Often, in data warehousing, we do not need to keep all source data and instead select only one value based on business rules. These rules might involve using columns such as active_flag, priority_seq, or last_updated_datetime to uniquely identify a single record from a one-to-many relationship.

In our example, assuming CustomerPhoneID is the primary key of CustomerPhone, we can use the following T-SQL query to retrieve only one phone number per customer:

WITH cte AS  
(  
SELECT *, RANK() OVER (PARTITION BY CustomerID ORDER BY CustomerPhoneID DESC) AS rnk  
FROM CustomerPhone  
)  
SELECT CustomerID, Phone  
FROM cte  
WHERE rnk = 1  

The result will be:

CustomerID

Phone

123

555-3333

If there is no clear business rule to determine a single record, we can randomly select one using the ROW_NUMBER() function:

WITH cte AS  
(  
SELECT *, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY (SELECT NULL)) AS row_num  
FROM CustomerPhone  
)  
SELECT CustomerID, Phone  
FROM cte  
WHERE row_num = 1  

Conclusion

Converting a one-to-many relationship to a one-to-one relationship is a common task in many data warehousing ETL projects. This article describes three approaches, with the last one (picking only one value) being the most commonly implemented.

Be extremely careful when selecting only one record in your implementation to avoid unintended duplicate records in the target table—a mistake often made by less experienced ETL developers.


Comments