For the last few months I’ve been hunched down in my bunker trying to figure out Big Data. Let me tell you, it is an elusive animal. Trying to explain Big Data to people is difficult. Ask two different people what it is and you will get two different answers. Ask the same person on two different days and you will likely get two different answers. Heck, I may give you two different answers in this blog post. About six months ago, I moved to the a group within Microsoft called Big Data Center of Expertise. Sometimes we call ourselves the Big Data / Business Intelligence Center of Expertise. Now that’s a mouthful. But I think its kind of instructive because what we did was create a COE around two topics that have been notoriously hard to define. Let’s take ourselves back a decade…
Explaining Business Intelligence (BI)
Remember the early days of Microsoft Business Intelligence?
“Which product is it?”
Well, its not a product at all, it’s a concept. On the other hand, it’s a collection of products. Microsoft BI includes SQL Server Analysis Services, SQL Server Reporting Services, and SQL Server Integration Services. In reality it also included Excel as a key component. As a concept Business Intelligence is a set of practices, architectures, processes, aligned with specific technologies to transform transactional into better decision making reporting. It grew from there into actionable dashboards, alerts, data mining, and many of other things we now think of as BI.
In the beginning this was hard to describe, but over time people seemed to understand that BI was most everything you did to get data either into or out of a data warehouse and provide it back to the business in some really cool ways. Back in the day, sometimes I would get so exasperated trying to explain BI, I would give up and say “It’s basically a marketing term to describe a set of products that many customers don’t know yet that they need to have to stay competitive.” Let me tell you, that didn’t help them understand it at all.
Now we mostly know that BI is often on top of a data warehouse, but it doesn’t have to be. Additionally, not all data warehouses have any BI on top of them. Sometimes BI is driven directly out of the transactional system and Microsoft has gone to great pains to make sure that organizations can do this with many of our products (PowerPivot is a great example). Also, our stack of BI products has grown to include PowerPivot, PowerView, SSAS Tabular Mode. Many of these tools can now be hosted natively in SharePoint providing a central location for users to access the intelligence we are extracting. BI has grown up and people generally get it.
Explaining Big Data
Here we are in the early days of the Big Data Revolution.
“Which product is it?”
Well, its not a product at all, it’s a concept (we’ll come back to this). On the other hand, it’s a collection of products. The Microsoft Big Data stack includes HDInsight Service, HDInsight Server, StreamInsight, and some would include SQL Server Parallel Data Warehouse. In reality, Excel is also a key component as we include a Hive ODBC driver that allows excel to import data directly from Hive on HDInsight. One might argue that to describe Big Data you would say that “Big Data is a set of products that many customers don’t know yet that they need to have to stay competitive.” Many of these tools will be hosted natively in Microsoft Azure and SharePoint.
See what I did there?
Explaining Big Data, Part II
Big Data is any data that is difficult to process using traditional relational database technologies and methods due to the large size and complexity of the data. What that means is that we have to have new tools to handle it. In order to scale out appropriately, we will use technologies that are massively parallel in nature. Organizations may use a distributed data warehouse appliance such as SQL Server Parallel Data Warehouse in their first foray into big data. Some will argue this isn’t truly big data yet as it can’t scale into the dozens of petabytes yet. I say lets apply the previous definition to it: If a customer has gotten to the point that their traditional technologies and methodologies are no longer allowing them to handle the data sizes that they are currently experience and SQL Server PDW helps alleviate this pain, then this is Big Data for them. We can call it Medium Data if you want. As the complexity of the data increases or the sizes of the data continue to increase we will likely need another scalable technology such as Hadoop (HDInsight). There are other tools available for specific scenarios like handling fast moving streams of data. For this kind of data you may need a complex event processing tool like StreamInsight that allows you to report on the data before you store it in the warehouse.
At the risk of being too cute with analogies, this might help out those who frequent Starbucks understand the scope of data :
A more appropriate way to visualize it may be this:
Big Data is the natural evolution of new sets of tools and methodologies to handle new data sizes and complexities. We’ll slowly learn them and incorporate them into our organizations. A decade from now we’ll wonder what all the hubbub was about. Sometime between now and then I’ll probably be blogging about the new tool to handle Trenta sized data.
PowerPivot and Power View are two of the most touted and demoed tools from the Microsoft Business Intelligence 2012 stack. You may have heard that Excel 2013 includes both of these add-ins by default. You’ve fired up that new copy of Excel and are wondering where the heck are they? Well, the add-ins are installed by default but they are not enabled. Here’s is how to enable them:
1. Open Excel 2013
2. Click on File:
3. Click Options:
4. Click Add-Ins:
5. Change Manage to Com Add-ins and click the Go button.
6. Ensure both Microsoft Office PowerPivot for Excel 2013 and Power View are checked and then click OK. (Why one has such a long and stuffy name while there other one is much more informal, I have no clue. Why one has a space between Power and Verb and the other one doesn’t is another mystery to me).
7. If enabled, you’ll find PowerPivot on the Home ribbon:
You’ll find Power View under Insert:
I can count on one hand the books I have in both Physical and Digital format. The Bible, Don Delillo’s Underworld, and all of Ralph Kimball’s books. Specifically, both my copies of The Data Warehouse Toolkit are severely marked up and dog-eared. When it comes to data warehouse modeling, no book comes close to providing as much useable material as this tomb. So I was wondering exactly what more I could learn by attending the Kimball University course. Could I really understand Fact and Dimensional tables any better than I already did by reading Kimball Material regularly over the last ten years? Yes, evidently I could. A couple weeks ago, I attended Kimball’s Dimensional Modeling in Depth class in San Diego, CA and am very glad I did.
The course took place over four days, the first two taught by Margy Ross and the second two taught by Ralph Kimball. They both did an excellent job at teaching the fundamentals of dimensional modeling and then getting into more esoteric cases as the week went on. Below you will see the high level course agenda from their website. Don’t let this fool you, there are a lot of details included in between the lines.
Dimensional Modeling Fundamentals
Retail Sales Case Study
Order Management Design Workshop
Inventory Case Study
Billing Design Review Exercise
Slowly Changing Dimensions
Credit Card Design Workshop
Insurance Case Study
Financial Applications, General Ledger, and Budgeting Value Chain
Retail Bank Account Tracking Workshop
ETL Backroom Dimensional Designs
Customer Behavior Patterns and Modeling Challenges
Real Time Tracking
At the end of the week, I felt like I went from understanding Dimensional Modeling from the Kimball perspective to it being a deeper ingrained part of my DNA. By that, I mean that I now understand that there really isn’t ONE way to do things but often times you can do something correctly a couple different techniques. The key is understanding the fundamentals and living with the modeling choices you make. Before, I would see two choices and wonder which one is correct. Now I understand that sometimes both can be right. There were multiple lab and group exercises that really brought this to light.
One technique I’ve never given much thought to but was referred to quite a bit during the class was simultaneous Type 1/Type 2 Slowly Changing Dimensions. In this case you want to show all history in terms of current profiles while at the same time maintain history using the correct historical profiles. This comes in handy there are multiple applications in the customer environment that want to report on the data differently.
As for the accommodations, I have no complaints. The training took place at the San Diego Bayfront Hilton. Breakfast and Lunch were included as part of the training tuition and were generally pretty good. When we arrived, there was a Kimball University engraved black portfolio. Does anyone use these obtrusive writing pads anymore? I have a whole pile of portfolios I’ve received over the last few years from different events. I do everything in OneNote and don’t have much use for them. I probably should have just given it back. Also included are two books from the Kimball Library. You can choose between hard copy and kindle editions of any of their currently published books. I choose a hard copy of The Kimball Group Reader and a Kindle edition of The Data Warehouse ETL Toolkit. I already have a copy of The Kimball Group Reader, but it will be a great gift to the right person.
I highly recommend this course to anyone who wants to get more deeply engaged in dimensional modeling and learn from the best.
There are a few ways to get data into SQL Server from HDInsight Service. You could use Sqoop to push data from HDInsight to SQL Server, use SSIS to pull data from HDInsight to SQL Server, or set up a linked server on SQL Server to Query Hive. In this post, I’m going to explore the latter option. There are a few steps you need to take to get the configuration set up correctly. At a high level you need to:
- Configure HDInsight to allow ODBC connections.
- Setup ODBC Connection on your SQL Server.
- Setup Linked Server on the SQL Server.
- Query to your heart’s desire.
Configure HDInsight to allow ODBC connections.
Notice the little lock on the open ports tile. This indicates that our ODBC connection is locked down. Click on the Open Ports tile.
Toggle the ODBC Server so that the Status is turned on to Open. That’s it, we’ve configured the cluster to accept ODBC connections.
Setup the ODBC Connection on your SQL Server.
From your SQL Server, click on the downloads tile. (On a separate note, notice the lock on the Open Ports tile?)
Click on the appropriate link for your SQL Server. Since this is your SQL Server, I’m assuming the appropriate link for you is the 64 bit version. If not, shame on you – It’s not 2004 anymore. Follow the directions to get it installed.
On your SQL Server, click start and type in ODBC. This will bring up the Data Sources dialog. Go to the System DSN tab and click Add. Configure it with a Data Source Name, Your Host Information, and a Username. I called mine HiveOnAzure so that I can differentiate the @server, @srvproduct, and @datasrc in the linked server command below.
Setup the Linked Server
Go to SQL Server Management Studio and execute the following query. I replaced the contents of User ID and Password with stars. Use your HDInsight username and password for your cluster here.
EXEC master.dbo.sp_addlinkedserver @server = N'HiveDW', @srvproduct=N'HIVE', @provider=N'MSDASQL', @datasrc=N'HiveOnAzure',
@provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=*****; Password=******;'
Now for the good stuff. If you don’t have any data setup in Hive yet, you can always query the sample table:
select * from openquery (hivedw, 'select * from hivesampletable')
I had a table created already, here is my result set:
Now if I need to load a well-known table on SQL from HDInsight, I can do it without having to use Sqoop or access my cluster:
I’m going to have a series of posts that essentially answer questions to use cases. “How do I…” kind of posts. Running through my head, I want to know how to load data into HDInsight Service from SQL Server and how to load data into SQL Server from HDInsight Service. Should I use SSIS? Should I use Sqoop? Should I do something else? We’ll answer these one at a time. Today the use case is going to be that I have data in a SQL Server VM on Windows Azure and I want to load it into a Hive table on the HDInsight Service. There are a few configuration steps I need to take to set up connectivity between the two systems and then we’ll load data.
The data set I’m using is from Research and Innovative Technology Administration Bureau of Transpiration Statistics. Essentially a compilation of 20 years’ worth of airline arrival ontime statistics, a topic near and dear to many of us consultants.
To get started setting up a VM in Windows Azure, check out this post: http://www.windowsazure.com/en-us/manage/windows/common-tasks/install-sql-server/. This provides great guidance around opening up the proper firewall rules and creating a sql user that you will need to use later when accessing the data remotely. If you can connect SQL Server Management Studio from your laptop to the VM in the cloud and browse the SQL Server objects, then sqoop will be able to access it also.
On the SQL VM, I created a database called AirStats and then a table called ontime (below) and loaded two years’ worth of data (1.3 GB of data). I’ll do a larger test later, but just wanted to get something working for this test.
create table ontime ( Year int, Month int, DayofMonth int, DayOfWeek int, DepTime int, CRSDepTime int, ArrTime int, CRSArrTime int, UniqueCarrier varchar(5), FlightNum int, TailNum varchar(8), ActualElapsedTime int, CRSElapsedTime int, AirTime int, ArrDelay int, DepDelay int, Origin varchar(3), Dest varchar(3), Distance int, TaxiIn int, TaxiOut int, Cancelled int, CancellationCode varchar(1), Diverted varchar(1), CarrierDelay int, WeatherDelay int, NASDelay int, SecurityDelay int, LateAircraftDelay int );
Next it’s time to run sqoop from the head node on the HDInsight cluster. Log into the remote desktop console of HDInsight (Click on the big square that says Remote Desktop) and launch the Hadoop Command Line prompt from the desktop. A couple notes about the code below….I’ve starred out the username and password for obvious security reasons, you would actually include your username and password in the command. –table is the table that will be imported. –target-dir is the HDFS destination directory. The –m switch is the number of map tasks to import in parallel.
c:\apps\dist\sqoop-1.4.2\bin>sqoop-import.cmd --connect "jdbc:sqlserver://brianwmitchell1.cloudapp.net;username=*****;password=*****;database=AirStats" –table ontime --target-dir asv://brianwmitchell/ontime -m 1
It took 4 minutes and 11 seconds to move the data over. We’ll worry about performance later by playing with the parallelism switch, adding a clustered index to the source table, and whatever else I can think of. Once the data is safely on azure storage, you can then create a Hive table over it. Here I create an external Hive table over the data:
create external table ontime (
) Row format delimited
Fields terminated by ','
Stored as textfile
Now let’s check on the data. Running a quick query from the Hive interactive console shows that I now have data.
Now I’ve got to figure out something to do with it. That’s next.