Handling many-to-many joins using a bridge table (Part 1)

August 3, 2017 by: David Lai

As a data warehouse architect, star schemas are great until we encounter the dreaded many-to-many fact to dimension scenario.

For example, we have a Ticket Fact table and a Tag Dimension table. The cost metric is calculated at the ticket level, so tickets have the lowest granularity and hence the ticket fact table.

However tickets can be tagged with multiple describers. A ticket can be interesting, excellent, and complicated.

If we flatten the tags into the fact table, you’ll notice that when we sum up the costs, it is no longer correct since we are double counting the costs.

There are several workarounds to solve this issue, but at the end of the day there is no solution that will not require additional training to the end user which I will explain later on.

The first method of working around a many-to-many fact to dimension scenario is to use a bridge table.

Our goal here is to ensure that the fact table remains at 1 row per ticket. In addition, we want to ensure that the tag dimension remains the same size as well.

Create a bridge table that will help keep fact and dimension table as is. The bridge table will consist of groups of tags. These groups correspond to the tags associated to tickets.

You will also need to add an extra column (Bridge ID) to the Ticket Fact so that you can link to the bridge table.

We can see that the Ticket Fact and Tag Dimension remain the same size.

Also, group of tags can belong to multiple tickets. For example Group 1 can consist of interesting, excellent, and complicated, that belongs to ticket 1. If another ticket shares the same combination of tickets, then Group 1 will also belong to that ticket (ticket 3).

Keep in mind that you will have to code your ETL so that it processes all the unique combinations of tag groups. In our example above, we have 4 unique tag groups.

For those of you who are familiar with Ralph Kimball’s teachings, you’ll ask why we don’t add a weighting factor to the bridge table. In the next example I will show you why that does not work.

Test out the bridge table!

Okay so now that everything is done, let’s test it out.

Let’s create a simple report that takes objects from the ticket fact and tag dimension.

We’ll show the ticket cost breakdown by tag. As you can see, if I want to know what the total cost of tickets with the interesting tag, we get the correct answer of $550.

Unfortunately, the problem comes when we want to retrieve the grand total cost of tickets. If my ticket has more than 1 tag, it will double count that cost. When we look at the Grand Total, you will see $2550 instead of $1000.

Going back to my comment on Ralph Kimball’s teachings of adding a weighting factor to the bridge table, this would work okay when we want to get the grand total ticket cost.

However, this would not work very well when we want to display total cost by tag. Unless we were to create another measure that does not multiply the cost by the weight factor, which makes it cumbersome for the user.

In addition, the weighting factor would not work very well if we were to ask the question.

“What is the total cost of tickets that are tagged with interesting and excellent?”

Ticket 1 falls into both interesting and excellent, however when we sum the grand total, we would only get (0.3 * cost) 2 times, instead of 3 because the complicated tag was not in the question.

Solution

Here’s a set of solutions to get the best of both worlds. Being able to show total ticket cost by tag, and to show the grand total ticket cost without double counting.

Please keep in mind that we cannot provide one perfect solution that handles EVERY case (If you know of one, please post your solution in the comments below). However, in my opinion the solutions below are quite user friendly and not difficult to implement.

Use case 1

We want to be able to get the total ticket cost for each tag, and then a grand total cost below.

We can see the total ticket cost for each tag looks fine.

Now to get the grand total cost, we will need to sum up the total ticket cost for each ticket and then divide it by the number of tags on each ticket.

Sum( (total ticket cost) / (number of tags) for each ticket)

First create a variable called m_tag_count that gets the count of tags per ticket

Then create a grand total variable m_gtotal_ticket_cost that will sum up the total cost correctly for each ticket.

Note: You must have the ticket object available in the report. Otherwise you will not be able to do a foreach cost on each ticket to prevent double counting

Use case 2

We want to be able to select tags from a picklist and then display the total cost correctly. We don’t want to show any tag properties on the report itself.

Normally we would just drag the tag object into the filter area and set a prompt.

However the problem here is that the query will join to the bridge and tag table, causing ticket cost to double count if a ticket contains more than 1 tag.

The solution here is to create a tag filter that will not join to the bridge and tag tables.

The tag filter will instead collect the bridge ids that associated to the selected tags, and we only get one line per bridge id from the fact table.

Thus ensuring that we do not double count the ticket cost.

Conclusion

As you can see, handling a many-to-many fact to dimension scenario is not so straight forward (Create bridge table, grand total variable when required, tag filter object).

However with the correct user training and guidance, you can tackle these scenarios successfully.

Be sure to document objects and filters clearly that fall in this category. For example, have warnings in the description or name of the tag object so that users are aware of the many-to-many scenario.

Then they will know that they must use the tag filter object instead of dragging the tag object into the query filter if the tag object does not show up in the report.

Thoughts on bridge table

Here are my thoughts on using a bridge table to handle many-to-many fact to dimension scenarios.

Pros

Easy to understand, since we are using regular joins
Low complexity

Cons

Requires an extra bridge table to join through which affects performance
Bridge tables can be extremely large if there are a lot of ticket to tag combinations

In part 2 of this series, I will be writing about an alternative method which in my humble opinion is a bit more elegant.

Stay Tuned!

Filed under: Best Practices,Business Objects,SQL Server,Universe Design
Tags: data warehouse, universe, Universe Design

Comments (2)

Comments

2 Responses to “Handling many-to-many joins using a bridge table (Part 1)”

Namo00 says:

July 23, 2020 at 12:18 pm

Thank you David for the informative article on the bridge tables and using the example with sum instead of counts. Using the weighting is not always an options as you mentioned because you can’t get amount per tag. Its great if you only have to worry one many to many relation table.

In many reports/analytics we do, we have similar situations where credits need to be given to multiple parties. Also, we have quite a few many to many relation tables. One of the options, I have used is without bridge table and in the fact table we stored lowest granularity but then created separate amounts.

So in your example, fact table will have ticket key, tag key, ticket cost and ticket cost per tag and regular ticket amount will be stored on all the combination of tickets and tags however in the ticket amount only one row will have ticket amount. So for example ticketID 1 has two rows due to two tags but amount will be stored on the first combination.

It allows you to sum the total on the tickets but if you would like to see amount by tags then you will used ticket cost per tag. Also, it allows you to count tags per ticket and total tickets etc. I think it may be confusing for users sometimes but I am hoping with the right training it will work out. Many of our reports need to see all these separate amounts in the report and if we create different facts or multiple bridge tables then it is also confusing. So not sure what is the best method.

I was wondering what your thoughts on this method. Is this more confusing and error pron?

Thank you
Namrata

Reply
Chelsea says:

December 9, 2020 at 10:02 pm

Hello,

Where is part two of this post? We are using bridge tables and suffering performance issues. Looking for another solution

Reply

David Lai's Business Intelligence Blog

Handling many-to-many joins using a bridge table (Part 1)

Test out the bridge table!

Solution

Use case 1

Use case 2

Conclusion

Thoughts on bridge table

Comments

Leave a Reply

About David Lai

SAP BusinessObjects Dashboards 4 Cookbook

Categories

Tag Cloud

Archives