Skip to content

Loading Hierarchical Data

2012 February 2
tags:
by Shannon Lowder

Earlier this week the question came  up on how can we insert data into a table that has a hierarchical data structure.  If you’re not familiar with a hierarchical data structures, here’s a quick explanation.

Consider the staging table to the right.  This table is used to load new accounts into our system.  Each account has a unique identifier and a name.  Since one company can own another, we have to be able to show that in our system.  In this system we have parent companies, and children companies.  The children companies’ accounts are related to a parent account.

So in our destination table, this relationship would be enforced by adding a primary key on the AccountID, and then adding a foreign key on ParentAccountID to relate it to AccountID.

Fairly straightforward, right?

Well, when you’re pushing thousands of rows into the table at a time, you can’t control the order of the inserts.  As soon as you try to insert a record that has a ParentAccountID that hasn’t yet been defined with a row using that AccountID, you’re going to get an error message.

Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the FOREIGN KEY SAME TABLE constraint "fk_AccountBaseDestination_AccountBaseDestination__ParentAccountID_AccountID". The conflict occurred in database "RV_DataConversion_Stage", table "dbo.AccountBaseDestination", column 'AccountID'.
The statement has been terminated.

So now we’re to the problem at hand.  How can we make sure we don’t try to insert a row with a parentID that’s not defined?

Use A Cursor

Curser.It was proposed we use a cursor to run through the rows, grab a parent and all it’s children, and insert the parent first, then insert all it’s children.  Then return to the beginning of the set and repeat.  While that would avoid the foreign key violation, it would cost a lot of I/O (disk resources) to do it that way.

BTW: while this option was being presented, the presenter said we won’t use cursors to do this.  But at the same time he was advocating a row-by-row operation.

If you’ve known me for any length of time, you know my stance on row-by-row operations in SQL.  I’m not a fan.

 

Insert all the Records, Then Update the ParentAccountID

Another proposal was to insert all the records, and not pass in the ParentAccountID.  Then we could run an update statement against the Destination table where that parent had been defined.  If there were any rows that needed a ParentAccountID where that AccountID was not defined, we simply wouldn’t update that row.

Basically the update would be an update with an INNER JOIN, so we’d avoid the violation that way.

Not bad, but if the destination table was being used, users might see a ton of new accounts at the root level, not at whatever level they’re supposed to be at.  You also would require locks on the destination table during the update that might escalate from row locks, to page locks, to table locks.  You’d have to test a load where you were adding 25 – 50% new records.  So if your destination table had 100 records, test a load of 25 new records.  Then test a load of 50 records.  During the load watch what locks are being taken?

If you see it escalate to a table lock, you may not want to do that to your users, if they are on the destination table regularly.

Use a CTE

A rather interesting idea that came up was to write a CTE that would select the staging data and add on a LevelNumber column.  This column would indicate what level in the hierarchy the account was on.  For example, level 1 would be the root node, level 2 would be children of accounts on level 1, level 3 would be children of level 2, and so on.

Once we know what levels each account is on, we could load all the level 1 accounts, then all the level 2 accounts, etc.  Not a bad idea.  Especially if you have your insert has to go through an SSIS package, or web service to do the actual insert into the database, you could be sure you’re only adding records in a integrity-safe way!

This is the method we ended up going with, due to the fact the insert was going through a webservice.  But… there is another way.

WHILE EXISTS Loop

The last method we discussed was first, we insert all the root nodes (level 1 from the previous image), and then we use a WHILE EXISTS loop to insert children where their parent was defined.  The end result is very similar to the STE solution, in fact the costs are almost identical.  You have to “loop” through the code once for every level.  The difference here is in order to use the WHILE EXISTS method, you have to be able to run the actual INSERT statements.  And that’s why we didn’t chose this method this time.

In my next training article I’m going to go through this method, so you can see the design pattern.

Conclusion

In the end we went with the method that used the least impact on the server, hits all of the requirements, and gets the job done with the least effort.  There’s always a lot of back and forth in these design meetings.  It’s all a part of the process of designing good code.  You can’t get too invested in “my way is best, I won’t do it any other way.”  That’s just counter-productive.

You have to open your ideas up to scrutiny and participate in the give and take.  In the end, you have to work with these people long after the process has been implemented.  You’re going to have to go through another design meeting.  You don’t want to make the next one even harder because you had to have your way…right?

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS