Skip to content

PDW Architecture: The Data Rack

2010 June 30
by Brian Mitchell

In my previous post, PDW Architecture: The Control Rack, I gave a high level description of the Control Rack within the PDW appliance.  In this post, we’ll spend some time with the Data Rack.  Each SQL Server 2008 R2 Parallel Data Warehouse appliance comes with at least one data rack.  If the Control Nodes in the Control Rack are considered the brains of the operation, the Compute Nodes in the Data Rack are certainly the brawn.  It is here within the Data Rack that all user data is stored and processed during query execution.  Each Data Rack has between 8-10 compute nodes.  Additionally, the Data Rack uses Microsoft Failover Clustering to gain high availability.  This is accomplished by having a spare node within the rack that acts as a passive node within the cluster.  Essentially, each compute node has its affinity set to failover to the spare node in the event of a failure on the active Compute Node.

PDW Architecture

Each compute node runs an instance of SQL Server and owns its own dedicated storage array.  User data is stored on the dedicated Storage Area Network.  The local disks on the Compute Node are used for TempDB.   The user data will be stored in one of two configurations:  Replicated tables or Distributed tables.  A replicated table is duplicated in whole on each Compute Node in the appliance.  When you think replicated tables in PDW, think small tables, usually dimension tables.  Distributed tables, on the other hand, are hash distributed across multiple nodes.  This horizontal partitioning breaks the table up into 8 partitions per compute node.  Thus, on a PDW appliance with eight compute nodes, a distributed table will have 64 physical distributions.  Each of these distributions (essentially a table in and of itself) have dedicated CPU and disk that is the essence of Massively Parallel Processing in PDW.  To swag some numbers, if you have a 1.6 TB fact table that you distribute across an eight node data rack, you would have 64 individual 25 GB distributions with dedicated CPU and disk space.  This is how the appliance can break down a large table into manageable sizes to find the data needed to respond to queries.  I’ll speak to this in more detail in the future.

If your data set is too large to store on a single data rack, you can add another.  By adding an additional data rack, not only expand your storage but you also significantly increase your processing power and the data will be distributed across additional distributions.  The current target size of an appliance is up to forty nodes, which would be either 4-5 data racks, depending on the manufacturer.  Larger appliance sizes are expected in the future.

3 Responses leave one →
  1. Dave permalink
    December 11, 2010

    Great article. I am struggling to get specifics on the HP hardware configuration for the PDW. I cannot seem to find a document or web page that tells me very specifically what the options are for the Data Rack. I did see some presentations from Tech Ed but they were from June and I would love to see an official/definitive document from either HP or Microsoft that tells me:

    1: what types of servers/processors are used for the Control Node
    2: same question for compute node although I have seen a mention of 2X6core Westmere chips but no mention of memory capacity.
    3: How about storage on the Storage Node, how many disks in each node, what is the capacity of each drive and speed etc.

    Why is this so hard to find?

  2. Brian Mitchell permalink*
    December 16, 2010

    Unfortunately, the answers to your questions are not so easily answered. The reason for this is that PDW is a reference architecture. Thus, Microsoft works with each vendor to come up with multiple hardware versions of the architecture that differ depending on what the customer is trying to do. For example, do you have 20 TB of data that you want to have incredible query response times or are you happy to just get it all on a system that you can query at responsible times. For the first you might have a two data rack system with quick relatively small drives and for the second you may have a single data rack system with 1 TB drives on the storage nodes. Thus, I believe that there will be several appliances types to choose from with each vendor plus the ability to add data racks as necessary. Also, hardware changes and the reference architecture’s will continually be updated.

    To try and answer your questions a bit more directly, an example of one of the current reference architecture’s for HP will be that a control node is a DL380 G7 with 6 core 3.33 GHz processors. Compute nodes would be 6 core 2.93 GHz processors with 96 GB of Memory. Disks on a compute nodes include both internal and external disks. Internally, you would have eight disks per compute node. Currently their configuration would be 300 GB 10K RPM disks. Externally, they would have 10 disks plus one hot spare for each compute node. This is where you could choose to use 1 TB 7200 RPM disks or something smaller and faster like 300 GB 15K disks depending our your needs. Feel free to contact me and I’ll try and get you in touch with the right people at Microsoft/HP so that you can get all the answers to any questions you have about PDW’s configuration.

Trackbacks and Pingbacks

  1. Parallel Data Warehousing (PDW) Explained | James Serra's Blog

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS