Here's the opportunity to wear both an architect's and a
developer's hat at the same time.
I think the best way to start thinking about Business Intelligence (BI) is to
put ourselves in a decision maker's shoes (which could be the boss, a client, or
whoever is hiring us to build a BI system for them). From their point of
view, BI is all about connecting Point A to Point B .
Point A represents an all-inclusive umbrella under which the available
enterprise data is stored.
Point B, on the other hand, is an Information Booth where all sorts of
management questions are answered (in form of a report, graph, etc.) with
reasonable response time.
Nevertheless, it would be premature at this point to explore the new Wizards in
BI Development Studio. Yes there are a number of powerful tools for BI
development, and we should definitely learn about every detail aspect of them.
On the other hand, until we're familiar enough with some fundamental aspects of
a BI development project, we don't just want to open this wonderful toolbox and
start playing with all the gadgets it's got.
So how do we get from Point A to Point B?
Understanding the Needs of the Information Booth at Point B
The primary goal of our Information Booth is answering strategic questions from
executives as well as tactical questions from operational managers. The
retrieval mechanism for these answers should also enable them to drill down into
details and trends, and even explore "what if" situations.
How is all of of this possible? In order to build an effective Information
Booth, we need three fundamental elements:
- A place to store the data
- A mechanism to retrieve the data
- A tool to format and present the data
The storage model for a BI Information Booth is commonly known as a data cube,
and over the course of BI development, we'll focus on creation of these cubes:
For our retrieval tool, we'll use the Analysis Services and for our
presentation, we'll use the Reporting Services (both of which are part of SQL
Server 2005). There are also other options available in these areas, which
I will mention later.
Checking Under the Umbrella at Point A
What have we got under the umbrella at Point A? Let's take an inventory
and try to categorize them in a logical way. One of the easiest ways
to do so would be to go by the technology where they were originated. For
example, SQL Server database tables, other database tables, text files, XML
Connecting Point A to Point B
Here are some fundamental questions: How do we load the source data into the
cubes and what are the pre-requisites for such source data being acceptable to
enter the cubes? Here's a short list:
- Source data must be clean. The Information Booth shouldn't need to worry about
data cleanup or missing values.
- If there are heterogeneous data from various sources, they must have already
been converted to a uniform format. The Information Booth shouldn't worry about
- Since time is of the essence, at least part of data groupings and summarizations
must have been done ahead of time. By the time a question reaches the
Information Booth, the wait time to run summary functions will usually be
All of the above concerns may necessitate the creation of another staging
station between Point A (where raw data is) and Point B (where the answers are
retrieved in a timely manner).
The intermediate staging point between the source data and reportable data is
called the Data Mart.
Is a Data Mart Necessary?
In theory, a Data Mart isn't always necessary. In reality, however, this
depends on a very fundamental test. How do we create cubes for Analysis
Services? There's a new tool called the Unified Dimensional Model (UDM)
which populates the cubes. Therefore, the answer to the question of
whether or not we need a Data Mart will really boil down to the following: If
our source data are acceptable to UDM, we may not need a Data Mart; if not, we
definitely need a Data Mart.
Here are four fundamental requirements of UDM for populating the cubes. If
any of these conditions are not met, then a Data Mart will be necessary, so that
UDM can read the data in the Data Mart and create the cubes.
- Dirty data is out of question for UDM. Any validation and/or adjustment to the
source data must have been done before it reaches UDM.
- All non-database sources, such as text or XML, are out of the reach of UDM and
will require a Data Mart.
- Even for database sources, UDM requires an OLEDB provider. Therefore, if a
legacy database doesn't have an OLEDB provider, UDM won't be able to read it.
- In addition to the OLEDB provider, the connection to the database source must be
permanent and physical. In the absence of such a connection, UDM won't work.
Needless to say, even if data sources we have pass the UDM qualification tests,
we'll never know what might be needed tomorrow, or in the next couple of years.
Therefore, as a strategic decision to build a long-lasting BI system, it would
be wise to plan on creating a Data Mart from the beginning.
Selecting Between a Top-Down or a Bottom-Up Approach for Designing the
As we'll find out shortly, one of the major steps in our upcoming Development
Roadmap will be designing the cubes. Before we do so, however, we must
know whether we'll be using a top-down or a bottom-up approach. Here the
difference in a nutshell:
In a top-down approach, we'll first design the cubes without necessarily knowing
their datasource. Later on, once the cube datasources are identified and
finalized, we'll go back to the cube design and add the datasource.
In a bottom-up approach, on the other hand, we'll focus on the datasources
first. Once all the necessary datasources are known, we'll proceed to the
cube design. That way, we'll be able to tie each cube to its datasource
while designing the cube for the first time.
There's no magic rule as to which approach is better or more suitable for
our cube design process. For a simple architecture, or if we're just building a
prototype for evaluation and approval purposes, the bottom-up approach may work
just fine. In a real-life complex situation, on the other hand, we
probably cannot afford to wait for identification of all datasources
before starting the cube design. Therefore, the top-down approach will
more likely fit that situation better.
The Terminology Maze
We have already come across many new terms and acronyms in this article.
Like any new technology, when building a BI system, we're not just dealing with
tools of the trade but also with numerous terms that are new to us -- and it
doesn't end there either. Part of the challenge will also be to sort out
these new terms and acronyms, and also to be prepared for discovering that not
everyone on our team will have the same understanding of and definition for
So before it gets too late, let's wear the architect's hat and familiarize
ourselves with some of the fundamental terminologies:
- Online Transaction Processing (OLTP) encompasses all databases which contain the
BI's source data, as they are generated during day-to-day operation of the
- Online Analytical Processing (OLAP) is the counterpart of OLTP in the BI world
and must contain verified, clean, and stable data readily available for timely
retrieval and reporting.
- A Measure is a numerical value that will be summarized in different ways, before
it can be presented to the user. Units Sold and Dollar
Amount of Sales are typical examples of a measure.
- A Fact Table is the storage place for detailed values of a measure, before they
are summarized. Furthermore, one Fact Table may contain detail values for
multiple measures. For example, a Fact Table may consist of the following five
columns, representing detail values for Units Sold and
Dollar Amount of Sales before they are summarized:
- Item Number
- Date of Sales
- Units Sold
- Dollar Amount of Sales
- A Dimension is the grouping categorization of an aggregated measure. If you're
familiar with the SQL SELECT statement, think of the Dimension as your "GROUP
BY" clause. Typical examples of a dimension are Products,
Services, Time (by month, by quarter, by year)
and Location (by State, by Region, by Time-zone).
- A Data Mart is where the measures and dimensions are stored, before they are
loaded into the OLAP cubes. Data Marts are usually built based on either a Star
Schema or a Snowflake Schema .
- A Data Mart which is based on a Star Schema has two types of tables: A Fact
Table at the center of the star (where Measures are stored) and many Dimension
Tables (where Dimensions are stored). In a Star Schema, each dimension table is
only joined to the fact table. A Star Schema Data Mart is de-normalized in
order to reduce the number of table joins and improve retrieval speed.
- A Data Mart which is based on a Snowflake Schema also has two types of tables: A
Fact Table at the center of the snowflake (just like the star schema) and many
Dimension Tables that are chained together (unlike the star schema) and then
connected to the fact table only at one end of the chain. A Snowflake Schema
Data Mart looks more like normalized relational databases that are commonly seen
in OLTP designs.
- A Cube is a summarized representation of a measure by three dimensions. For
example, we can have a cube which summarizes Dollar Amount of Sales by Item,
Location, and Month.
Now let's wear the developer's hat for a moment, where we'll be dealing with an
additional set of new terminology and jargons. For example, once we start
using the tools and wizards in the BI Development Studio, sooner or later we'll
discover the underlying language isn't just the familiar Transact-SQL (T-SQL).
Instead, we have to learn a few additional terms such as the following:
- Multi-Dimensional Expression (MDX) is a new language which provides mathematical
and navigational functions for OLAP Cubes. MDX can be called from T-SQL.
- Data Mining Expression (DMX) is also a new language which is used for Data
Mining. DMX can also be called by T-SQL.
- XML for Analysis Services (XMLA) is a new protocol for interacting Analysis
Services data over an HTTP connection. XMLA uses the Simple Object Access
Architectural Models of OLAP
When the time comes to implement OLAP, we'll have to be familiar with its three
- Relational OLAP (ROLAP) -- In this model, the leaf-level measures and the
preprocessed aggregates are left in relational databases (or data marts) and
only the dimensional data are stored in cubes. This results in a faster load
time but a slower retrieval time.
- Multi-dimensional OLAP (MOLAP) -- In this model, the leaf-level measures, the
preprocessed aggregates, and the dimensional data are all stored in cubes. This
results in a slower load time but a faster retrieval time.
- Hybrid OLAP (HOLAP) -- In this model, the leaf-level measures are left in
relational databases (or data marts) and the preprocessed aggregates and the
dimensional data are stored in cubes. We can call this the best of both
I think we can categorize the basic tools for building and maintaining a BI
system as follows
- Maintenance and/or conversion of the source data into a suitable format for
loading into the BI databases:
- Visual Studio .NET (for C# and VB.Net Windows applications and ASP.Net web
- Other tools and technologies according to the needs of the source systems
- BI Development Studio
- Integration Services (for helping with conversion of source data from various
formats into a unified format and populating the Data Marts)
- Analysis Services (for defining data mining models and creating OLAP Cubes)
- Reporting Services (for presenting analysis results to the users)
- SQL Server Management Studio, for day-to-day maintenance of:
- Relational Databases
- OLAP Databases
Although the main focus of this article is to highlight BI development using
Microsoft SQL Server and Visual Studio, there are other 3rd party products that
can be used as alternative options. One area in particular where alternate
tools are commonly adopted is reporting and charting of the retrieved data.
Here are two examples in this area:
The Development Roadmap
I would like to conclude this introductory article with an overall roadmap and
look at where we came from and where we're going to. Although there's no
single way of building a roadmap for BI development, this should at least give
us a good idea:
- Gain general familiarity with the objectives, tools, and terminologies.
- Design the Data Marts.
- Use Integration Services to Populate the Data Marts.
- Design the cubes, using the Cube Wizard.
- Based on the cube's structure, use the Schema Generation Wizard to build its
datasource. The datasource will be a relational database that will hold the
data before being loaded into the cube.
- Load data into the relational databases.
- Load data into the cubes. It's important to remember that we'll be spending a
lot of time on designing and populating these OLAP cubes, which will be the
heart of our BI system where pre-calculated aggregates of the measures (based on
their corresponding dimensions) will be stored.
- Deploy and Process the cubes.
- Browse the cubes to verify the dimensions and measures within each, using the
Browser in the Cube Designer.
- Build reporting and presentation objects.
- Deploy the solutions for test and user-acceptance.
- Run necessary tests and obtain sign-off.
- Deploy to production.
Developing a new Business Intelligence system is a major undertaking for any
organization -- large or small. Aside from costs and other resources
involved, there's an inevitable element of uncertainty that cannot be ignored,
particularly since BI crosses many organizational levels and boundaries.
Therefore, it is highly recommended to start off a new BI system by launching a
pilot project to create an evaluation prototype first. Once everyone signs
off on the prototype, then a full-scale development can start with a lot more
confidence and probability of success.
- Delivering Business Intelligence with Microsoft SQL Server 2005 - Brian Larson
- Microsoft SQL Server 2005 ANALYSIS SERVICES Step by Step - Reed Jacobson
- Applied Microsoft Analysis Services 2005 - Teo Lachev
For further information, please refer to our feedback