Pointer Corporation
The Information Technology Architects
A Decision Maker's Overview of Business Intelligence in SQL Server 2005

Here's the opportunity to wear both an architect's and a developer's hat at the same time.


I think the best way to start thinking about Business Intelligence (BI) is to put ourselves in a decision maker's shoes (which could be the boss, a client, or whoever is hiring us to build a BI system for them).  From their point of view, BI is all about connecting Point A to Point B .

Point A represents an all-inclusive umbrella under which the available enterprise data is stored.

  

Point B, on the other hand, is an Information Booth where all sorts of management questions are answered (in form of a report, graph, etc.) with reasonable response time.

 

Nevertheless, it would be premature at this point to explore the new Wizards in BI Development Studio.  Yes there are a number of powerful tools for BI development, and we should definitely learn about every detail aspect of them.  On the other hand, until we're familiar enough with some fundamental aspects of a BI development project, we don't just want to open this wonderful toolbox and start playing with all the gadgets it's got.

So how do we get from Point A to Point B?

Understanding the Needs of the Information Booth at Point B

The primary goal of our Information Booth is answering strategic questions from executives as well as tactical questions from operational managers.  The retrieval mechanism for these answers should also enable them to drill down into details and trends, and even explore "what if" situations.

How is all of of this possible?  In order to build an effective Information Booth, we need three fundamental elements:

  1. A place to store the data
  2. A mechanism to retrieve the data
  3. A tool to format and present the data

The storage model for a BI Information Booth is commonly known as a data cube, and over the course of BI development, we'll focus on creation of these cubes:

For our retrieval tool, we'll use the Analysis Services and for our presentation, we'll use the Reporting Services (both of which are part of SQL Server 2005).  There are also other options available in these areas, which I will mention later.

Checking Under the Umbrella at Point A

What have we got under the umbrella at Point A?  Let's take an inventory and try to categorize them in a logical way.   One of the easiest ways to do so would be to go by the technology where they were originated.  For example, SQL Server database tables, other database tables, text files, XML files, etc.

Connecting Point A to Point B

Here are some fundamental questions: How do we load the source data into the cubes and what are the pre-requisites for such source data being acceptable to enter the cubes?  Here's a short list:

  1. Source data must be clean.  The Information Booth shouldn't need to worry about data cleanup or missing values.
  2. If there are heterogeneous data from various sources, they must have already been converted to a uniform format.  The Information Booth shouldn't worry about data conversion. 
  3. Since time is of the essence, at least part of data groupings and summarizations must have been done ahead of time.  By the time a question reaches the Information Booth, the wait time to run summary functions will usually be unacceptable.

All of the above concerns may necessitate the creation of another staging station between Point A (where raw data is) and Point B (where the answers are retrieved in a timely manner).

The intermediate staging point between the source data and reportable data is called the Data Mart.

Is a Data Mart Necessary?

In theory, a Data Mart isn't always necessary.  In reality, however, this depends on a very fundamental test.  How do we create cubes for Analysis Services?  There's a new tool called the Unified Dimensional Model (UDM) which populates the cubes.  Therefore, the answer to the question of whether or not we need a Data Mart will really boil down to the following: If our source data are acceptable to UDM, we may not need a Data Mart; if not, we definitely need a Data Mart.

Here are four fundamental requirements of UDM for populating the cubes.  If any of these conditions are not met, then a Data Mart will be necessary, so that UDM can read the data in the Data Mart and create the cubes.

  1. Dirty data is out of question for UDM.  Any validation and/or adjustment to the source data must have been done before it reaches UDM.
  2. All non-database sources, such as text or XML, are out of the reach of UDM and will require a Data Mart.
  3. Even for database sources, UDM requires an OLEDB provider.  Therefore, if a legacy database doesn't have an OLEDB provider, UDM won't be able to read it.
  4. In addition to the OLEDB provider, the connection to the database source must be permanent and physical.  In the absence of such a connection, UDM won't work.

Needless to say, even if data sources we have pass the UDM qualification tests, we'll never know what might be needed tomorrow, or in the next couple of years.  Therefore, as a strategic decision to build a long-lasting BI system, it would be wise to plan on creating a Data Mart from the beginning.   

Selecting Between a Top-Down or a Bottom-Up Approach for Designing the Cubes 

As we'll find out shortly, one of the major steps in our upcoming Development Roadmap will be designing the cubes.  Before we do so, however, we must know whether we'll be using a top-down or a bottom-up approach.  Here the difference in a nutshell:

In a top-down approach, we'll first design the cubes without necessarily knowing their datasource.  Later on, once the cube datasources are identified and finalized, we'll go back to the cube design and add the datasource.

In a bottom-up approach, on the other hand, we'll focus on the datasources first.  Once all the necessary datasources are known, we'll proceed to the cube design.  That way, we'll be able to tie each cube to its datasource while designing the cube for the first time.

There's no magic rule as to which approach is better or more suitable for our cube design process.  For a simple architecture, or if we're just building a prototype for evaluation and approval purposes, the bottom-up approach may work just fine.  In a real-life complex situation, on the other hand, we probably cannot afford to wait for identification of all datasources before starting the cube design.  Therefore, the top-down approach will more likely fit that situation better.

The Terminology Maze

We have already come across many new terms and acronyms in this article.  Like any new technology, when building a BI system, we're not just dealing with tools of the trade but also with numerous terms that are new to us -- and it doesn't end there either.  Part of the challenge will also be to sort out these new terms and acronyms, and also to be prepared for discovering that not everyone on our team will have the same understanding of and definition for these terms.

So before it gets too late, let's wear the architect's hat and familiarize ourselves with some of the fundamental terminologies:

  • Online Transaction Processing (OLTP) encompasses all databases which contain the BI's source data, as they are generated during day-to-day operation of the business.
  • Online Analytical Processing (OLAP) is the counterpart of OLTP in the BI world and must contain verified, clean, and stable data readily available for timely retrieval and reporting. 
  • A Measure is a numerical value that will be summarized in different ways, before it can be presented to the user.  Units Sold and Dollar Amount of Sales  are typical examples of a measure.
  • A Fact Table is the storage place for detailed values of a measure, before they are summarized.  Furthermore, one Fact Table may contain detail values for multiple measures.  For example, a Fact Table may consist of the following five columns, representing detail values for Units Sold and Dollar Amount of Sales  before they are summarized:
    1. Item Number
    2. Date of Sales 
    3. Location
    4. Units Sold
    5. Dollar Amount of Sales  
  •  A Dimension is the grouping categorization of an aggregated measure.  If you're familiar with the SQL SELECT statement, think of the Dimension as your "GROUP BY" clause.  Typical examples of a dimension are Products, Services, Time (by month, by quarter, by year) and Location (by State, by Region, by Time-zone).
  • A Data Mart is where the measures and dimensions are stored, before they are loaded into the OLAP cubes.  Data Marts are usually built based on either a Star Schema or a Snowflake Schema .
  • A Data Mart which is based on a Star Schema has two types of tables: A Fact Table at the center of the star (where Measures are stored) and many Dimension Tables (where Dimensions are stored).  In a Star Schema, each dimension table is only joined to the fact table.  A Star Schema Data Mart is de-normalized in order to reduce the number of table joins and improve retrieval speed.
  • A Data Mart which is based on a Snowflake Schema also has two types of tables: A Fact Table at the center of the snowflake (just like the star schema) and many Dimension Tables that are chained together (unlike the star schema) and then connected to the fact table only at one end of the chain.  A Snowflake Schema Data Mart looks more like normalized relational databases that are commonly seen in OLTP designs. 
  • A Cube is a summarized representation of a measure by three dimensions.  For example, we can have a cube which summarizes Dollar Amount of Sales by Item, Location, and Month.

Now let's wear the developer's hat for a moment, where we'll be dealing with an additional set of new terminology and jargons.  For example, once we start using the tools and wizards in the BI Development Studio, sooner or later we'll discover the underlying language isn't just the familiar Transact-SQL (T-SQL).  Instead, we have to learn a few additional terms such as the following:

  • Multi-Dimensional Expression (MDX) is a new language which provides mathematical and navigational functions for OLAP Cubes.  MDX can be called from T-SQL.
  • Data Mining Expression (DMX) is also a new language which is used for Data Mining.  DMX can also be called by T-SQL.
  • XML for Analysis Services (XMLA) is a new protocol for interacting Analysis Services data over an HTTP connection.  XMLA uses the Simple Object Access Protocol (SOAP).

Architectural Models of OLAP

When the time comes to implement OLAP, we'll have to be familiar with its three architectural models:

  1. Relational OLAP (ROLAP) -- In this model, the leaf-level measures and the preprocessed aggregates are left in relational databases (or data marts) and only the dimensional data are stored in cubes.  This results in a faster load time but a slower retrieval time.
  2. Multi-dimensional OLAP (MOLAP) -- In this model, the leaf-level measures, the preprocessed aggregates, and the dimensional data are all stored in cubes.  This results in a slower load time but a faster retrieval time. 
  3. Hybrid OLAP (HOLAP) -- In this model, the leaf-level measures are left in relational databases (or data marts) and the preprocessed aggregates and the dimensional data are stored in cubes.  We can call this the best of both worlds. 

Necessary Tools

I think we can categorize the basic tools for building and maintaining a BI system as follows

  • Maintenance and/or conversion of the source data into a suitable format for loading into the BI databases:
    1.  Visual Studio .NET (for C# and VB.Net Windows applications and ASP.Net web applications)
    2. Other tools and technologies according to the needs of the source systems
  • BI Development Studio
    1. Integration Services (for helping with conversion of source data from various formats into a unified format and populating the Data Marts)
    2. Analysis Services (for defining data mining models and creating OLAP Cubes)
    3. Reporting Services (for presenting analysis results to the users)
  • SQL Server Management Studio, for day-to-day maintenance of:
    1.  Relational Databases
    2. OLAP Databases

Alternative Tools

Although the main focus of this article is to highlight BI development using Microsoft SQL Server and Visual Studio, there are other 3rd party products that can be used as alternative options.  One area in particular where alternate tools are commonly adopted is reporting and charting of the retrieved data.

Here are two examples in this area:

The Development Roadmap

I would like to conclude this introductory article with an overall roadmap and look at where we came from and where we're going to.  Although there's no single way of building a roadmap for BI development, this should at least give us a good idea:

  • Gain general familiarity with the objectives, tools, and terminologies.
  • Design the Data Marts.
  • Use Integration Services to Populate the Data Marts. 
  • Design the cubes, using the Cube Wizard.
  • Based on the cube's structure, use the Schema Generation Wizard to build its datasource.  The datasource will be a relational database that will hold the data before being loaded into the cube.
  • Load data into the relational databases.
  • Load data into the cubes.  It's important to remember that we'll be spending a lot of time on designing and populating these OLAP cubes, which will be the heart of our BI system where pre-calculated aggregates of the measures (based on their corresponding dimensions) will be stored. 
  • Deploy and Process the cubes.
  • Browse the cubes to verify the dimensions and measures within each, using the Browser in the Cube Designer.
  • Build reporting and presentation objects.
  • Deploy the solutions for test and user-acceptance.
  • Run necessary tests and obtain sign-off.
  • Deploy to production.

Concluding Thoughts

Developing a new Business Intelligence system is a major undertaking for any organization -- large or small.  Aside from costs and other resources involved, there's an inevitable element of uncertainty that cannot be ignored, particularly since BI crosses many organizational levels and boundaries.  Therefore, it is highly recommended to start off a new BI system by launching a pilot project to create an evaluation prototype first.  Once everyone signs off on the prototype, then a full-scale development can start with a lot more confidence and probability of success.  

Suggested Readings:

  • Delivering Business Intelligence with Microsoft SQL Server 2005 - Brian Larson
  • Microsoft SQL Server 2005 ANALYSIS SERVICES Step by Step - Reed Jacobson
  • Applied Microsoft Analysis Services 2005 - Teo Lachev

For further information, please refer to our feedback page.