Handling many-to-many joins using a bridge table (Part 1)

August 3, 2017 by: David Lai

As a data warehouse architect, star schemas are great until we encounter the dreaded many-to-many fact to dimension scenario.

For example, we have a Ticket Fact table and a Tag Dimension table. The cost metric is calculated at the ticket level, so tickets have the lowest granularity and hence the ticket fact table.

Ticket fact and Tag dimension

 

However tickets can be tagged with multiple describers. A ticket can be interesting, excellent, and complicated.

many to many example

If we flatten the tags into the fact table, you’ll notice that when we sum up the costs, it is no longer correct since we are double counting the costs.

flattening table with incorrect values

There are several workarounds to solve this issue, but at the end of the day there is no solution that will not require additional training to the end user which I will explain later on.

The first method of working around a many-to-many fact to dimension scenario is to use a bridge table.

Our goal here is to ensure that the fact table remains at 1 row per ticket. In addition, we want to ensure that the tag dimension remains the same size as well.

Create a bridge table that will help keep fact and dimension table as is. The bridge table will consist of groups of tags. These groups correspond to the tags associated to tickets.

You will also need to add an extra column (Bridge ID) to the Ticket Fact so that you can link to the bridge table.

bridge table join

We can see that the Ticket Fact and Tag Dimension remain the same size.

Also, group of tags can belong to multiple tickets.  For example Group 1 can consist of interesting, excellent, and complicated, that belongs to ticket 1. If another ticket shares the same combination of tickets, then Group 1 will also belong to that ticket (ticket 3).

Keep in mind that you will have to code your ETL so that it processes all the unique combinations of tag groups. In our example above, we have 4 unique tag groups.

For those of you who are familiar with Ralph Kimball’s teachings, you’ll ask why we don’t add a weighting factor to the bridge table. In the next example I will show you why that does not work.

Test out the bridge table!

Okay so now that everything is done, let’s test it out.

Let’s create a simple report that takes objects from the ticket fact and tag dimension.

many to many query1

We’ll show the ticket cost breakdown by tag. As you can see, if I want to know what the total cost of tickets with the interesting tag, we get the correct answer of $550.

Unfortunately, the problem comes when we want to retrieve the grand total cost of tickets. If my ticket has more than 1 tag, it will double count that cost. When we look at the Grand Total, you will see $2550 instead of $1000.

bridge ex 1

Going back to my comment on Ralph Kimball’s teachings of adding a weighting factor to the bridge table, this would work okay when we want to get the grand total ticket cost.

bridge ex 2

However, this would not work very well when we want to display total cost by tag. Unless we were to create another measure that does not multiply the cost by the weight factor, which makes it cumbersome for the user.

bridge ex 3

In addition, the weighting factor would not work very well if we were to ask the question.

“What is the total cost of tickets that are tagged with interesting and excellent?”

Ticket 1 falls into both interesting and excellent, however when we sum the grand total, we would only get (0.3 * cost) 2 times, instead of 3 because the complicated tag was not in the question.

Solution

Here’s a set of solutions to get the best of both worlds. Being able to show total ticket cost by tag, and to show the grand total ticket cost without double counting.

Please keep in mind that we cannot provide one perfect solution that handles EVERY case (If you know of one, please post your solution in the comments below). However, in my opinion the solutions below are quite user friendly and not difficult to implement.

Use case 1

We want to be able to get the total ticket cost for each tag, and then a grand total cost below.

We can see the total ticket cost for each tag looks fine.

bridge use case 1a

Now to get the grand total cost, we will need to sum up the total ticket cost for each ticket and then divide it by the number of tags on each ticket.

Sum( (total ticket cost) / (number of tags) for each ticket)

First create a variable called m_tag_count that gets the count of tags per ticket

bridge use case 1b

Then create a grand total variable m_gtotal_ticket_cost that will sum up the total cost correctly for each ticket.

bridge use case 1c

bridge use case 1d

Note: You must have the ticket object available in the report. Otherwise you will not be able to do a foreach cost on each ticket to prevent double counting

Use case 2

We want to be able to select tags from a picklist and then display the total cost correctly. We don’t want to show any tag properties on the report itself.

Normally we would just drag the tag object into the filter area and set a prompt.

bridge use case 2a

However the problem here is that the query will join to the bridge and tag table, causing ticket cost to double count if a ticket contains more than 1 tag.

bridge use case 2b

The solution here is to create a tag filter that will not join to the bridge and tag tables.

The tag filter will instead collect the bridge ids that associated to the selected tags, and we only get one line per bridge id from the fact table.

Thus ensuring that we do not double count the ticket cost.

bridge use case 2c

bridge use case 2d

Conclusion

As you can see, handling a many-to-many fact to dimension scenario is not so straight forward (Create bridge table, grand total variable when required, tag filter object).

However with the correct user training and guidance, you can tackle these scenarios successfully.

Be sure to document objects and filters clearly that fall in this category. For example, have warnings in the description or name of the tag object so that users are aware of the many-to-many scenario.

Then they will know that they must use the tag filter object instead of dragging the tag object into the query filter if the tag object does not show up in the report.

Thoughts on bridge table

Here are my thoughts on using a bridge table to handle many-to-many fact to dimension scenarios.

Pros

  • Easy to understand, since we are using regular joins
  • Low complexity

Cons

  • Requires an extra bridge table to join through which affects performance
  • Bridge tables can be extremely large if there are a lot of ticket to tag combinations

In part 2 of this series, I will be writing about an alternative method which in my humble opinion is a bit more elegant.

Stay Tuned!

Leave a Reply


7 × eight =