Skip to content

PDW Architecture: The Data Rack

2010 June 30
by Brian Mitchell

In my previous post, PDW Architecture: The Control Rack, I gave a high level description of the Control Rack within the PDW appliance.  In this post, we’ll spend some time with the Data Rack.  Each SQL Server 2008 R2 Parallel Data Warehouse appliance comes with at least one data rack.  If the Control Nodes in the Control Rack are considered the brains of the operation, the Compute Nodes in the Data Rack are certainly the brawn.  It is here within the Data Rack that all user data is stored and processed during query execution.  Each Data Rack has between 8-10 compute nodes.  Additionally, the Data Rack uses Microsoft Failover Clustering to gain high availability.  This is accomplished by having a spare node within the rack that acts as a passive node within the cluster.  Essentially, each compute node has its affinity set to failover to the spare node in the event of a failure on the active Compute Node.

PDW Architecture

Each compute node runs an instance of SQL Server and owns its own dedicated storage array.  User data is stored on the dedicated Storage Area Network.  The local disks on the Compute Node are used for TempDB.   The user data will be stored in one of two configurations:  Replicated tables or Distributed tables.  A replicated table is duplicated in whole on each Compute Node in the appliance.  When you think replicated tables in PDW, think small tables, usually dimension tables.  Distributed tables, on the other hand, are hash distributed across multiple nodes.  This horizontal partitioning breaks the table up into 8 partitions per compute node.  Thus, on a PDW appliance with eight compute nodes, a distributed table will have 64 physical distributions.  Each of these distributions (essentially a table in and of itself) have dedicated CPU and disk that is the essence of Massively Parallel Processing in PDW.  To swag some numbers, if you have a 1.6 TB fact table that you distribute across an eight node data rack, you would have 64 individual 25 GB distributions with dedicated CPU and disk space.  This is how the appliance can break down a large table into manageable sizes to find the data needed to respond to queries.  I’ll speak to this in more detail in the future.

If your data set is too large to store on a single data rack, you can add another.  By adding an additional data rack, not only expand your storage but you also significantly increase your processing power and the data will be distributed across additional distributions.  The current target size of an appliance is up to forty nodes, which would be either 4-5 data racks, depending on the manufacturer.  Larger appliance sizes are expected in the future.