Barbara Kess and Dan Kogan have written an amazing document on the state of SQL Server 2012 Parallel Data Warehouse. This document steps through PDW’s design and describes why you should expect amazing results for your data warehouse experience. More importantly, what I liked about the document was the many customer references about the what their experience is like with SQL Server PDW. These next few quotes just blew me away:
Our queries are completing 76 times faster on PDW. This was after PDW compressed 1.5 TB to 134 GB.
Our daily load now takes 5.5 minutes on PDW whereas it used to take 2.5 hours without PDW. The load is 27 times faster on PDW.
5.5 TB (uncompressed) compressed to 400 GB (14x compression).
You’ll have to read the document yourself to find out what other customers are saying, including some good links to full reference stories.
Download the whitepaper today
I have a couple sessions in which I’ll be speaking about data warehousing at the PASS Summit in Charlotte, NC. One of them will be a deep dive into SQL Server 2014 Clustered Columnstore Indexes. If you are interested in how clustered columnstore indexes can dramatically improve the performance of your data warehouse on SQL Server 2014, please join us.
SQL Server 2014 Clustered Columnstore Indexes: Friday, Oct. 18th, 2013 1:00 pm Room Ballroom B
Have you experienced the blazingly fast query performance enabled by columnstore indexes and batch mode processing? Are you wondering what’s next for these revolutionary data warehouse features? In this session, we’ll examine new query processing enhancements to extend the benefits of batch mode processing, including updatable clustered columnstore indexes. More query types will benefit from batch mode, and larger proportions of your complex queries will be executed in batch mode. Come learn about these new capabilities for processing data from columnstore indexes and how to take advantage of the benefits.
Here’s a link to the session:
Once again, you may have noticed I’ve been remiss in posting lately. Many things have been going on both professionally and personally….its been a busy time. Hopefully this will be the first of many consistent posts about Big Data and SQL Server PDW.
Next week the PASS Summit is in Charlotte, NC. I will be speaking there on a couple different sessions about data warehousing. One of them will be a deep dive on Polybase within SQL Server 2012 Parallel Data Warehouse. If you are interested in how PDW and Hadoop play well together, please join us.
Polybase Deep Dive: Thursday, Oct. 17th, 2013 5:00 pm Room 217 B-C
With the increasing role of Hadoop in capturing and processing raw, non-relational data, integrating Hadoop into existing SQL Server products is becoming essential as we move toward one comprehensive data platform. Polybase provides a set of features for SQL Server 2012 Parallel Data Warehouse (PDW) that enables you to combine Hadoop and relational data in a seamless and fully parallelized fashion via T-SQL. This session will present a deep dive of the Polybase features shipped in SQL Server 2012 PDW, followed by a comprehensive overview of additional Polybase features planned for the next appliance updates.
Here’s a link to the session:
I have a third session at TechED 2013 North America next week. The session is titled Whats New for Columnstore Indexes and Batch Mode Processing and it occurs on Thursday June 6th at 1:00 PM CDT.
We’ll quickly walk through Columnstore Indexing in general to level set everyone with the technology and how Microsoft implements it in SQL Server. We’ll also speak to batch mode processing and how that helps improve performance significantly. Next we’ll speak to what’s next for SQL Server and Columnstore indexes: Clustered Columnstore indexes. We’ll again speak to improvments in batch mode processing and a new feature for for SQL Server vNext: Archival Compression. Want details? Attend the session!
Did I mention demo’s? Lot’s and Lot’s of demo’s will be included throughout to show the benefits of all of these improvements. Finally, ‘we’ll finish the session with best practices that will help you improve the performance of your queries that are hitting Columnstore Indexes.
If you missed the session, you can catch the recording on Channel 9 here.
TechED is only a week away and as I’ve noted in a previous post, I have several sessions. I thought I would spend the week running up to TechED 2013 North America wetting your appetite for the presentations either in person or on Channel 9. Today I’ll discuss the session titled Do You Have Big Data? The session is on Tuesday June 4th, 2013 at 8:30 am. If you are attending TechEd, please shake off the previous night on Bourbon Street and head over early to this session on Big Data.
Let me start off with the obvious by saying it would be a real let down if I had a session called Do You Have Big Data? and it turned out I thought the answer was no. So yes, I think most organizations have it and they just haven’t begun the journey in earnest yet. What the heck is Big Data you say?
Traditionally we describe it along the three V’s of Velocity, Volume, and Variety.
- Most define Velocity as the speed of which data arrives. I personally think this belies the issue as we can handle a great deal of velocity by this definition with traditional data handling methods. But if you define Velocity as the speed at which you can respond to the the data, then that is a different question.
- Volume is usually defined as the total amount of data stored. The problem is how much Volume is needed to qualify as big? I would say its variable on both the other V’s along with your capability to handle it with traditional means.
- Variety is usually described as a data set in which you don’t know the structure of the data as it arrives. Typically I think of documents and web logs as examples that many customers may be encountering.
Some say if you meet two of the three V’s, you have Big Data. Meh.
The problem with the Volume, Velocity, Variety description is that you could be talking about
The same could be said for Water, Humans, or a many other objects.
How does any of that help you explain to Business Users how Big Data can solve a problem they have? I don’t have a clue. We are doing what us technologists do all the time…answer a technology question with technical answers. But that doesn’t tell us what something is. So let’s come at it from another angle.
Organizations have been collecting data for a long time now. Data about Customers, Sales, Marketing, Inventory, Production Processes, Fleet Management and on and on. They have struggled for years creating data warehouses attempting at getting to the single version of the truth. Companies have sprouted Business Intelligence organizations to provide insight into what is the state of the business. Generally BI Reports tell us what happened. This is called Performance Analytics and it is a good goal for any organization to have but the reality is these reports simple replaced reporting an organization previously did within ledgers. Maybe the reporting is slightly prettier, but I probably see as many legacy reports from customers that look like they are straight ports of a ledger report from 1985 than i see really exciting BI Reports that tell me a story.
Predictive Analytics, on the other hand, is a game changer for businesses. Predictive Analytics is all about determining causality between events. If you listen carefully enough, your treasure trove of data will speak to you.
- If I market to a customer, will he buy?
- If I use certain words in my online data profile will that improve my response rate?
- What products are customers likely to buy during a hurricane watch?
Big Data is a New Paradigm
A different way to think about Big Data is that it refers to things you can do at scale that cannot be done with traditional data methodologies. By leveraging your historical data and combining it with additional data sources, you can create a new form of data in a way that allows the business to see its markets, customers, or prices in a new light.
What can you do with your data so that you can improve your business? Look at the questions above and answer this question: Do you know how to answer these questions that might be relevant to your organization? I understand the derivation may be different, but you should get the idea by now. Let me give you a hint: You’ll need to usually combine multiple data sets and here is where the essence of Big Data lies.
Big Data is the ability to use the three V’s to your advantage. Use the Volume, Variety, and Velocity to be able to answer new questions that will improve your business. How do you answer what products will customers buy during a hurricane watch? Well you’ll need lots of historical purchase information and mash that up with some weather data. Walmart determined that in the United States one of those items is Pop Tarts. Now they place them prominently at the front of the store to make it easier for us to get a sugar high during the stress of hurricane season.
In my session at TechED we will discuss these concepts in more depth. We’ll then spend a good deal of time walking through examples of what other organizations are doing with Big Data. We’ll explore the algorithms that Data Scientists use to answer these questions. Finally we’ll discuss what tools Microsoft is making available to democratize this process and make it easier to be a practitioner in this field.
If you missed it, you can catch the recording on Channel 9 here.