What MPP means to SQL Server Parallel Data Warehouse
I stopped by the home page of SQL Server 2008 R2 Parallel Data Warehouse today and noticed this quote:
Microsoft SQL Server 2008 R2 Parallel Data Warehouse (previously code named project “Madison”) is a highly scalable appliance that delivers performance at low cost through a massively parallel processing (MPP).
I’m sure that for many people their first thought is what is Massively Parallel Processing (MPP)? Their second may be why does Microsoft need to build a MPP data warehouse solution anyway?
Up until now, SQL Server has been a Symmetric Multiprocessing (SMP) solution. SMP involves two or more processors tied to a single main memory and are controlled by a single operating system instance. Think of a single server. Most operating systems are configured this way. SMP systems allow any processor to work on any task no matter where in memory that task is located. Each task can only be managed by one processor at any given time. To effectively manage larger and more complex sets of data on a SMP database system, scalability can be achieved by scaling up. When you scale up a database server such as SQL Server, you add more processors, memory, and additional disks on that single server. However, there is a point where one gets diminishing performance returns and run into prohibitively expensive hardware. Currently that point of diminishing returns is somewhere around the 10 TB mark, depending on the database workload.
Massive Parallel Processing systems provide the needed scalability and query performance by running independent servers in parallel. The word Massive refers to the concept of dozens or hundreds of computers tied together achieving a single process. MPP systems scale linearly as you continue to add nodes. The trade-off between SMP and MPP data warehousing servers is between ease of administration and scalability. SMP offers ease of administration, but the addition of another processor does not provide a linear improvement in performance. In an MPP architecture, additional nodes provide near linear improvement in processing power with the additional burden of administrative costs. SQL Server has been a leader in the DBMS industry in making it easy for administrators to manage their systems and SQL Server 2008 R2 Parallel Data Warehouse is no exception. With the ability to add compute nodes as necessary, SQL PDW will be able to handle hundreds of terabyte’s of data volume.
SQL Server PDW implements a shared-nothing architecture where each compute node functions autonomously by controlling its own memory and disk. MPP database systems, such as SQL Server PDW, then distribute parts of the data set on each server participating in the MPP system. On SQL Server PDW we call these servers the compute nodes. This divide and conquer method significantly improves the scalability and performance of a system trying to answer complex queries on large data sets.
How is this achieved in reality? Each compute node starts off as typical SQL Server installation. On an appliance with 10 compute nodes, we have 10 separate installs of SQL Server, with its own processors, memory, and SAN attached disks. There is also a control node, which is where queries are submitted by users. This control node is also a separate install of SQL Server, but it stores no user data. What makes SQL Server PDW different from standard SQL Server are the PDW Engine and the Data Movement Service (DMS). The DMS is essentially the brain that ties all the nodes together. When you submit a query to the control node, it is the PDW Engine that determines what the query plan will be on each individual compute node and then submits that query to all the compute nodes through the DMS. In the simplest of cases, each compute node then acts as an individual SQL Server. The compute node returns rows that satisfy the query back to the control node which gathers the rows together from all compute nodes and streams them back to the user.
With data sets continuing to grow in size, MPP is here to stay. The necessity of MPP appliances will only continue to grow. Getting comfortable with the concepts and technology available now will put you in the drivers seat when your company is ready for the next step in data warehousing.