Ladies and Gentlemen, TechED 2013 is less than a month away. In case you didn’t know, TechED is in New Orleans this year. I haven’t been to New Orleans in almost 20 years, so i’m looking forward to it. I started out the year with the modest goal of getting a TechED session accepted. Well, one thing lead to another and next thing you know I had agreed to do three sessions at TechED NA. Luckily the material has come together pretty well and at this point I’m really concentrating on what the demos will look like.
A really cool aspect of the sessions that I have at TechED is the relative breadth of the three sessions makes them all interesting to me in on very different levels. My first session is on Monday and I’ll be introducing a roomful of people to SQL Server 2012 Parallel Data Warehouse. Even if you thought you knew SQL Server PDW, come to this session to find out the radical changes the product group has made to the architecture to make it faster and more accessible at the same time. As many of you know, I moved over to the Big Data Center of Expertise within Microsoft Services last year. Thus, you won’t be surprised I have a session on Big Data. If you are still grappling with what Big Data is and want to better understand what the heck you might do with it, come to my session on Tuesday. Finally on Thursday I have a session on Columnstore Indexing. I promise that you won’t disappointed as we’ll not only discuss what we have today, but a quick peak into what the next release of SQL Server has for Columnstore. Below are the abstracts, see you there!
If you have happened to come across this after TechED or if you aren’t going to be able to make it, you can find all three of my sessions here on Channel 9.
DBI-B304 Large-Scale Data Warehousing and Big Data with SQL Server Parallel Data Warehouse V2
Breakout Monday June 3 4:45 PM – 6:00 PM
The new version of SQL Server Parallel Data Warehouse has many new features and functionality that make it a compelling solution for your data warehouse. In this session we will explore the features that make this a groundbreaking release for Microsoft.
DBI-B325 Do You Have Big Data? (Most Likely!)
Breakout Tuesday June 4 8:30 AM – 9:45 AM
The meaning of Big Data, the hot buzzword of 2012, is still a mystery to many in the field. This session walks you through many scenarios in which companies are leveraging Big Data Tools. We discuss the sources of data that are often involved in Big Data solutions and how those sources provide additional insight for an organization. In this session, we explore concrete examples and demos of various data sets and determine whether or not they are Big Data. With each example, once we’ve come to a conclusion of the type of data it is, we explore when and why to choose a schema on write solution (traditional SQL/PDW) or a schema on read solution (Hadoop).
DBI-B322 What’s New for Columnstore Indexes and Batch Mode Processing?
Breakout Thursday June 6 1:00 PM – 2:15 PM
Have you experienced the blazingly fast query performance enabled by Columnstore indexes and batch mode processing? Are you wondering what’s next for these revolutionary data warehouse features? We’re adding some new query processing enhancements to extend the benefits of batch mode processing. More query types will benefit from batch mode, and larger proportions of your complex queries will be executed in batch mode. In this session, learn about the new capabilities for processing data from Columnstore indexes and how to get access to these benefits.
It seems like most of the information you will find around Hadoop is focused on how to use the tools that are part of its ecosystem. There is less documentation and focus on the architectural best practices. I believe this is by design because the nature of Hadoop is supposed to be that it just works. But if you’ve used it for any length of time, you know that this really isn’t true and some guidance on improving performance is always helpful. Microsoft has released three new white papers on this subject for their version of Hadoop, HDInsight. They are well written and provide good guidance for three subject areas that are going to come up time and time again for anyone implementing HDInsight.
Compression in Hadoop
Hadoop Performance in Hyper-V
Job Optimization in Hadoop
Ok kids, time to get excited. There is a great deal of activity around SQL Server 2012 Parallel Data Warehouse. As of March 1st, you can order what was previously known as “PDW V2″. This post will be pointers to several other pages and posts providing good information on the product.
The SQL Server PDW landing page has been updated with information around the 2012 product.
What’s New in SQL Server 2012 Parallel Data Warehouse
Up to 50x performance gains with the xVelocity updateable columnstore.
Up to 100x performance gains over legacy warehouses.
Seamless integration with “Big Data” using PolyBase.
Multi-petabyte data capacity.
2.5x lower price per terabyte than SQL Server 2008 R2 PDW and lowest price per terabyte in industry.
I’ll do some separate posts on the specs of the hardware and the like in the near future. What is clear here is that SQL Server PDW now fits much more punch in a much smaller footprint. A single data rack appliance now literally comes in a single rack whereas before it too two racks (control + compute). This smaller footprint means a much lower cost for the appliance even as PDW maintains all the high availability standards you come to expect from an appliance. I’ll tell you this much right now: This is a game changer for Microsoft and SQL Server PDW.
In a nutshell, PolyBase is a technology in SQL Server 2012 PDW that allows you to query both data from both within your SQL Server PDW and Hadoop environments at the same time. You can even join that data together for new insights you can’t do easily today.
I’m sure I’ll do a technical rundown of PolyBase in a future post, but for now you’ll have to check out some material posted by Microsoft:
Want to know more about who created PolyBase? There is a cool blog post with interviews of Jim Gray’s team at the University of Wisconsin.
Is there more?
Did you miss Ted Kummert’s Keynote on SQL Server 2012 PDW from the PASS Summit? Watch this video.
PASS Business Analytics Conference
There are a couple important session on SQL Server 2012 PDW at the PASS Business Analytics Conference. I’ll be there, will you?
PDW Architecture Gets Real: Customer Implementations, Brian Walker and Murshed Zaman
Relational and Non-relational Data Living in Peace and Harmony, Artin Avanes
Choosing your distribution key is one of the most important things you do during the architecture of any database on SQL Server PDW. A distributed table in SQL Server PDW is a table with rows distributed across several physical storage destinations. Each of these physical locations is known as a distribution. For each compute node you have in the appliance, you have 8 distributions and any row that is inserted into the distributed table will live in one and only one of those distributions. The location of that row is based on a deterministic function that is assigned to that rows based on the column you choose for the distribution key.
Guidance for choosing a good distribution key includes:
Choose a column with high cardinality
- Choose a column with an order of magnitude of several more than the number of distributions that you have in your appliance. If you have 80 distributions, choose a column that has only has a several hundred or more distributions. If you have 320 distributions in your appliance, you should be looking for a distribution key that has at least 500 unique values. Either way, you should validate your chose by using DBCC PDW_SHOWSPACEUSED.
Choose a column that is not skewed
- Skew means there are many more rows on average in one or more distributions than in the others.
Choose a column that is not usually limited in the where clause
- For example, if you choose a date key as the distribution key and then filter on OrderDate = 20130214, then you are querying a single distribution while the others lay idle.
- Choose a column that does not allow NULLs.
- If you do choose a column that does allow nulls, ideally, the number of nulls will not exceed the average number of rows for any other distribution value.
What happens when the distribution key is nullable and you know there will be many nulls? If there are many more nulled keys than other keys, you will be introducing skew into your database. For example, if 25% of the rows in your table allow a null for the distribution key, then all of those rows will fall into a single distribution. If it is significant enough, this will cause performance degradation for queries hitting this distributed table.
Here you have a couple options:
Choose a different distribution key
- Depending on your situation, this may be an easy option. Usually the question comes down to: are you using that same distribution key in many other distributed tables hoping to get locality when doing Distributed table to Distributed table (often fact to fact) joins. If you are, it may be more difficult to change this one table to another key because you will lose the locality that you desire.
Keep the Nulls but break out your table.
Create two tables.
- The first is with the original schema including the nullable distribution key.
- The second is a new one with the same schema except choosing another column for the distribution key based on the rules above.
Change your ETL (ELT)
- All rows with a value for the distribution key go into the original table. This table now is now evenly distributed.
- All rows with a NULL value for this column go into the second table with the new distribution key. This table will be evenly distributed.
Create a View over both the tables with the original table name.
- This will minimize data movement when you join this table to the other distributed tables based on the original distribution key.
Let’s use the venerable AdventureWorksDW database to illustrate this. But let’s assume that instead of selling just bikes, they sold all sorts of products: most physical and some virtual like games and music downloads. What would our FactInternetSales table look like? There are lots of keys
Imagine a scenario where for whatever reason we really wanted SalesTerritoryKey to be our distributed key. Most likely because we are using it in many other tables in the database because it satisifies all the criteria above and provides us a usefull distribution key except in this one table. In FactInternetSales, we don’t know the SalesTerritoryKey for digital downloads, so that value is null for 20% of our rows in this table. This makes SalesTerritoryKey a poor choice for a distribution key in this table. But we really don’t want to change it because it works so well in all the other tables in our data warehouse. This is where you can use the solution above to create two new tables with two different distribution keys and place the view above it.
Do I suggest using this solution on a regular basis? No. I would always try to refactor the table to use an appropriate distribution key that handles the situation without breaking it up into two tables. There is usually a good solution that doesn’t require this kind of major architectural change to your database. But every once in a while you will find yourself stuck and this can be an effective solution.
Microsoft and Hortonworks recently announced the release of the beta for Hortonworks Data Platform (HDP) for Windows. In short, Hadoop on Windows. The 100% fully compatible Apache Hadoop distribution that runs on Windows servers. This means Hadoop users can port your current Hadoop applications from a Linux environment to Windows Server today. Hortonworks has a very good webpage dedicated to HDP on Windows.
You can download HDP 1.1 for Windows here.
Is this the HDInsight Server that we’ve heard so much about?
No, HDP for Windows is the foundation for HDInsight Server that will be available at a future date from Microsoft. HDInsight Server is an on-prem solution that has full integration with other Microsoft enterprise tools such as Active Directory and System Center for deployment and provisioning. Hortonworks Data Platform for Windows is ideal for those who have some experience with Hadoop and want to deploy on Windows today.
Download HDInsight Server Developer Preview here.
How is that different from HDInsight Service?
HDInsight Service is Microsoft’s cloud offering for organizations that want lower upfront costs and elasticity in their Big Data solution.
All three solutions are 100% Apache Hadoop compatible. So you can build on HDP today, move it to HDInsight Server tomorrow, and move the solution to the cloud on HDInsight Service when you are ready.
Try out HDInsight Service here.
Why have three solutions?
Some people like to build their own solutions from scratch and they can use HDP. Some want to eliminate most complexity but still want some control over their environment and they will likely use HDInsight Server on premises. Finally, some could care less about the technology and just want a system that will get them the results they are looking for; these users will likely choose HDInsight Service on Azure.
I’m an occasional cyclist and you can pretty much put most cyclists into three camps:
Those who build their bike from scratch. They buy a frame from one vendor, handlebars from another, wheels from another. They spend hours mulling over the decisions of which part is a couple grams lighter than another. They love spending the time in the garage assembling their masterpiece.
Those who buy a really nice carbon bike but make several modifications such as new carbon wheels and a specific crank set. They spend some time in the shop adjusting everything to get it just right for them and their style. They don’t spend near as much time as those from camp#1 do in preparing their ride, but they don’t take the decisions lightly either.
3. Those who go buy a Trek Madone, adjust the seat, and go ride.
At the end of the day, all three bikes are very comparable. When you are riding in a group it doesn’t really matter which one of the three camps you fall into. We are all doing the same thing, peddling about 100 rpm’s, hydrating regularly, drafting as much as possible, and finally giving it our all in the last few miles. How your bike was constructed has little bearing on the ride because let’s face it that factory-built Madone is a damn good bike, but your comfort level with how your ride was built affects your mindset and performance. Do you want to spend lots of time building and riding your bike or do you just want to ride?
Whether you choose to build your own solution today with Hortonworks Data Platform for Windows, wait for HDInsight Server to use MS tools deploy a Hadoop solution, or use HDInsight Service in Azure to spin up a cluster in 10 minutes, you will be running the same MapReduce jobs. Some of you will want to build a specialized solution that is perfect for your organization and others will just want to code cool solutions and don’t care as much about the platform that gets it done. Microsoft provides choice because people naturally demand it.