3.1 A Data Warehouse Defined
3.2 Different Data For Different Uses
3.3 Need For Better Analysis Tools
3.4 Types of Data Warehousing Applications
3.5 Evolution of Data Across Three Server Platforms
3.6 Benefits of Data Warehousing
__3.6.1 Has a subject area orientation
__3.6.2 Integrates data from multiple, diverse sources
__3.6.3 Allows for analysis of data over time
__3.6.4 Adds ad hoc reporting and enquiry
__3.6.5 Provides analysis capabilities to decision makers
__3.6.6 Relieves the development burden on IT
__3.6.7 Provides improved performance for complex analytical queries
__3.6.8 Relieves processing burden on transaction oriented databases
__3.6.9 Allows for a continuous planning process
__3.6.10 Converts corporate data into strategic information
3.7 Costs of Data Warehousing
__3.7.1 Time spent in careful analysis of measurable needs
__3.7.2 Design and implementation effort
__3.7.3 Hardware costs
__3.7.4 Software costs
__3.7.5 On-going support and maintenance
__3.7.6 Resulting re-engineering effort
3.8 Data Warehousing Processes and Functions
3.9 Important Data Warehousing Concepts
__3.9.1 OLTP: OnLine Transaction Processing
__3.9.2 OLAP: OnLine Analytical Processing
__3.9.3 MDD: Multi-Dimensional Database
__3.9.4 ROLAP: Relational OLAP
__3.9.5 Facts and Dimensions
__3.9.6 Drill-down and Roll-up
__3.9.7 Aggregation and Granularity
3.10 Multi-dimensional Models - Data Cubes and Star Schema
__3.10.1 The Data Cube Model
__3.10.2 Star Schema
__3.10.3 Evaluation Rules for Multi-dimensional Analysis Tools
3.14 Commercially Available Data Warehousing Tools
__3.14.1 The Data Warehousing Institute "Roadmap"
__3.14.2 The "Data Warehousing Information Center"
This section summarises data warehousing concepts found in White Papers published on the web by vendors, consultants, practitioners and researchers, as well as material found in journals and reference books and from experience. Where possible, examples and illustrations are taken from experiences with the Avondale College data warehouse.
In 1993, the "father of data warehousing", Bill Inmon,
gave this definition of a data warehouse:
Figure 2. Definition of a Data Warehouse(2)
Many of the concepts and practices of data warehousing have existed for years, but it is only within the last few years that the term has acquired "buzz word" status. While it is true that software is available for automating some of the data warehouse processing, a data warehouse is not a product - it is not something that can be purchased from a vendor. Rather, it is a model of a corporation's data, put together in such a way that it answers the corporation's business questions.
According to this definition, a data warehouse is different from an operational database in 4 important ways.
|Data Warehouse||Operational Database|
|subject oriented||application oriented|
|integrated||multiple diverse sources|
An operational database is designed primarily to support day to day operations. A data warehouse is designed to support strategic decision making.
One of the fundamental assumptions of data warehousing is that operational data needs to be stored separately in a different format in order to support data analysis. This diagram illustrates the fact that different sets of users access the data, using different sets of applications and for different purposes.
Figure 3. Operational and Informational Data(3)
A statement from The IBM Information Warehouse Solution describes the clear distinction that needs to be made between operational and informational data.
"Most organizations need two different data environment, one optimized for operational applications and one optimized for informational applications. For example, operational applications and databases are typically optimized for fast response time and typically cannot tolerate the impact on response time created when access by an informational application. The two types of applications are fundamentally different. If the same data environment is used to support both, the performance, capability and benefit of both will be compromised."(4)
Within a data warehouse implementation itself, the following types of data will be required to support typical uses:
These are described more fully in The IBM Information Warehouse Solution.
Oracle has published a useful table that compares three different types of database - OTLP, report/enquiry DSS and OLAP.
|Characteristics||RDBMS OLTP||RRDBMS DSS||MDBMS OLAP|
|Level of analytical requirements||Low||Medium||High|
|Amount of data per transaction||Small||Small to large||Large|
|Data level||Detail||Detail to summary||Mostly summary|
|Age of data||Current||Historical and current||Historical, current and projected|
As data processing systems record more and more data, it is not enough for IT to produce more and more management reports to be viewed by decision makers. In "Relational OLAP: Expectations & Reality", an Arbor Software White Paper, the author states that "organizations are attempting to maximize the business value of the data that are available in ever increasing volumes from operational systems, spreadsheets, external databases and business partners. It is not enough to simply view this data - business value comes from using it to make better informed decisions more quickly, and creating more realistic business plans. In the past, these decisions have often been made based on 'gut feel' and experience rather that solid data, analyses and tested hypotheses. With the flattening of management structures, re-engineered businesses and globalization, the need for better analysis tools is greater than ever."(6)
Data warehousing systems target at least three different types of applications:
These are described well in a White Paper from Arbor Software entitled "The Role of the Multidimensional Database in a Data Warehousing Solution".
"As with all information systems, it is best to view data warehousing's core components against a framework that focuses not on technology, but on the business applications the system is designed to address. In general, the applications served by data warehousing can be placed in one of three main categories.
"Personal productivity applications such as spreadsheets, statistical packages and graphics tools, are useful for manipulating and presenting data on individual PCs. Developed for a standalone environment, these tools address applications requiring only small volumes of warehouse data.
"Data query and reporting applications deliver warehouse-wide data access through simple, list-oriented queries, and the generation of basic reports. These reports provide a view of historical data but do not address the enterprise need for in-depth analysis and planning.
"Planning and analysis applications address such essential business requirements as budgeting, forecasting, product line and customer profitability, sales analysis, financial consolidations and manufacturing mix analysis--applications that use historical, projected and derived data.
"These planning and analysis requirements, referred to as on-line analytical processing (OLAP) applications, share a set of user requirements that cannot be met by applying query tools against the historical data maintained in the warehouse repository. The planning and analysis function mandates that the organization look not only at past performance but, more importantly, at the future performance of the business. It is essential to create operational scenarios that are shaped by the past, yet also include planned and potential changes that will impact tomorrow's corporate performance. The combined analysis of historical data with future projections is critical to the success of today's corporation."(7)
The following table relates multidimensional databases (OLAP) to OLTP and relational database servers.
|DECISION SUPPORT SERVERS|
|OLTP Server||Relational Database (Warehouse repository)||Multidimensional Database (OLAP)|
|Business significance||Mission critical||Informational critical||Management critical|
|Access mode||Singular, simple update queries||Singular, simple queries list oriented||Iterative, comparative, analytical investigation|
|Access process||IS-supported queries||IS-assisted or preplanned queries||IS-independent ad hoc navigation and investigation, drill-down|
|Response characteristics||Fast update, varied query response||Varied query response||Fast/consistent response|
|Content scope||Application specific
* Limited historical
* Historical/archival data
* Projected/"what if" data
* Derived data
|Data detail level||Transaction detail||Cleansed & summarized||Summarized, aggregated & consolidated using complex computations (ratios, allocations, variance, time series and gross margin %)|
|Data structure||Normalized||Denormalized||Multidimensional Hierarchical|
|Data structure design goal||Update||Query||Analysis|
|Adaptability||Limited||Low Significant resource||High Easily modified|
|Computer hard-ware investment required||Extensive/high cost hardware||Moderate/medium cost hardware||Minimal/low cost hardware|
Data warehousing is being hailed as one of the most strategically significant developments in information processing in recent times. One of the reasons for this is that it is seen as part of the answer to information overload.
Some of the benefits of data warehousing that were seen as relevant to the Avondale College project are listed here. The points highlighted in the Bill Inmon's definition give some of the reasons why data warehousing is regarded as important.
Data is arranged by subject rather than by application, and is more intuitive for users to navigate. This is closer in concept to the way decision makers think about their business.
One of the initial motivations for this data warehousing project at Avondale College was that we had multiple, diverse sources of data, and integrating them into a single administrative suite was known to be a long term project for which there was neither the staff, the funds nor the time. Many information needs were not being met, and the integration provided by a data warehouse was seen as very desirable.
Table 1 (Information System Components and Platforms) illustrates the diversity of data formats and platforms currently in use at Avondale College. Data that is currently stored in separate indexed flat file and Databus emulation systems and that is relevant to the pilot project has been integrated in the data warehouse. This is of immediate benefit to knowledge workers who are now able to formulate their own queries and write their own ad hoc reports. It is also used as the basis for the aggregation and summarization that makes up the multi-dimensional database.
The operational database provides detailed current information, often with last years data available for comparison, but analytical queries typically require much more. Plotting trends and looking for relationships over time, at all of the possible levels of aggregation, is not provided for in the operational database.
With a data warehouse, data snapshots taken at times that are significant to the decision making process make it possible to analyze trends over time. A typical example is to monitor the numbers of applications received, accepted and rejected, and the number who actually enroll. Snapshots can be taken at specified points in time over the months, weeks and days leading up to and immediately following registration. These can be plotted over recent years, and provide one way to assess the impact of promotional programs, changes in course offerings, changes in government regulations. This can readily indicate areas where further analysis is warranted.
In addition to the points highlighted in the definition, a data warehouse is designed to be accessible with end-user tools, and this allows ad hoc reporting and analysis by end-users.
Until quite recently, management information meant hardcopy summaries and exception reports. Some of these were produced by automated processes running daily, but others required considerable manual preparation and could be delivered weeks or even months after the relevant time period had closed. A growing need to work interactively with the data has been identified, and is referred to generically as analysis.
"Today's markets are much more competitive and dynamic than those in the past. Business enterprises prosper or fail according to the sophistication and speed of their information systems, and their ability to analyze and synthesize information using those systems. The numbers of individuals within an enterprise who have a need to perform more sophisticated analysis is growing."(9)
Knowledge workers have been attempting to meet this need using spreadsheets and simple, general purpose report writers based on data extracts prepared by the IT department.
"Most notably lacking has been the ability to consolidate, view, and analyze data according to multiple dimensions, in ways that make sense to one or more specific enterprise analysts at any given point in time. This requirement is called 'multi-dimensional data analysis.' Perhaps a better and more generic name for this type of functionality is on-line analytical processing (OLAP), wherein multi-dimensional data analysis is but one of its characteristics." (10)
"OLAP is made up of numerous, speculative 'what-if' and/or 'why' data model scenarios executed within the context of some specific historical basis and perspective." (11)
"There are typically a number of different dimensions from which a given pool of data can be analyzed. This plural perspective, or Multi-Dimensional Conceptual View appears to be the way most business persons naturally view their enterprise." (12)
The multiple data dimensions correspond to data consolidation paths. This is what allows the end-user to chose to view certain data dimensions while aggregating over others. The choices of what dimensions to view and whether to "drill down" or "roll up" are made interactively, and may reveal new or unanticipated relationships in the data. This is referred to as dynamic data analysis.
"Dynamic data analysis can provide an understanding of the changes occurring within a business enterprise, and may be used to identify candidate solutions to specific business challenges as they are uncovered, and to facilitate the development of future strategic and tactical formulae." (13)
Figure 4. Sample Data Consolidation Paths(14)
Each level in the above diagram (State, Gender, Year in course, etc) correspond to dimensions in a model used to analyze facts such as student performance. While the diagram seems to imply a fixed hierarchy, this is actually not the case. Any combination of any number of dimensions can be analyzed and displayed in graphical or tabular form using typical MDA tools.
At Avondale, there was a wealth of corporate data that was virtually inaccessible to users because each request for information required code to be written by IT, which already had a large backlog of requests. With a data warehouse and users trained in the use of appropriate desktop tools, users can find answers to their own questions.
Online Transaction Processing (OLTP) systems are optimized for update and reporting with relatively simple cross-table joins. Multi-dimensional analysis (MDA) based on OLTP databases suffers in performance for two reasons. Firstly, the normalized nature of OLTP databases and the multi-dimensional nature of typical query and analysis operations means that many joins have to be done just to lookup or decode coded data such as course names and student acceptance status meanings. Secondly, and more importantly as noted in Aggregate-Query Processing in Data Warehousing Environments, "aggregate queries are frequent in decision support applications, where large history tables often are joined with other tables and aggregated. Because the tables are large, better optimization of aggregate queries has the potential to result in huge performance gains. Unfortunately, aggregation operators behave differently from standard relational operators like select, project, and join."(15)
A data warehouse provides improved performance for complex analytical queries by denormalization and aggregation. Frequently used aggregates are often precomputed and materialized in views commonly known as summary tables These materialized views provide fast access to integrated data, regardless of the original data sources.
Precomputing the data integration and aggregation is sometimes known as the eager or in-advance approach. The approach that must be taken if no data warehouse is available is referred to as lazy or on-demand. This is one of the fundamental benefits of data warehousing. Taking the eager approach:
Not only do analytical queries give frustratingly poor response time for those who are performing them, they are also very demanding on resources needed for adequate transaction processing performance. This problem is not solved simply by installing bigger, faster database servers. The optimization required for one environment is fundamentally different from what is required in the other. A more effective solution to the performance problem for both OLTP and OLAP is to separate the data into independently designed structures, as a minimum step. In the case of very large data warehouses, it becomes necessary to use separate hardware as well as physically separate databases.
"An intensively competitive and continuously changing business climate is driving (educational institutions) to more frequently evaluate business structures and the allocation of resources. In this environment, the once-a-year budget is being replaced by a continuous planning process. The rolling forecast and other ad hoc analyses are becoming the principal vehicles for providing management with the information needed to make fast-paced business decisions and to manage resource allocations." (17)
In the education marketplace, competition is increasing as universities develop more aggressive marketing strategies. Economic and political factors are having a marked affect on the numbers of students enrolling in institutions of higher learning, leading to reduced intakes at many universities. Demographics indicate that smaller family sizes now means that the number of potential students to offer places to is no longer able to support the growth that many universities are aiming to achieve. This has lead to "re-focusing" on core objectives and "re-engineering" of fundamental processes. In addition to these internal pressures, universities are facing increased competition from overseas.
Avondale College is actively pursuing the development of its marketing capacity. To avoid a threatened enrollment crisis, strategic planning exercises have been conducted with the help of consultants. The renewed focus on enrollment planning and management will depend very heavily on the Data Warehouse project described in this report.
The objective is to make planning a continuous process by providing powerful, integrated systems so that rolling forecasts and ad hoc analyses incorporate the level of detail, department head involvement and analytical richness of budgeting. The key to an integrated system for planning, analysis and reporting is the data warehouse and appropriate desktop tools.
In a White Paper entitled "Multidimensional Analysis: Converting Corporate Data into Strategic Information", Arbor Software state:
"Information is a strategic weapon in today's fast-paced global business environment. Companies are realizing that the key to successful competition and growth lies in their abilities to quickly obtain the right information for spotting trends, forecasting market changes, and analyzing performance. An in an effort to better manage the sheer volume of data available, they have invested heavily in information systems and technologies at both the corporate information system level and on individual's desktops." (18)
|Spreadsheets||Relational Database Management System (RDBMS)|
|single user tools only||enterprise in scope|
|limited to simple analyses, small data sets||includes central repository, online transaction processing (OLTP) and large data sets|
|slow consolidation||inflexible in design, requiring extensive consulting or programming to obtain satisfactory performance|
|often restricted by proprietary structures and memory shortage requirements|
|slow for ad hoc analysis|
|require extensive training|
Through the use of data warehousing and appropriate desktop tools, the information overload can be reduced to meaningful strategic information, and the limitations of traditional tools can be overcome. The many dimensions of typical data can be summarized to just those of interest. Viewing other dimensions is accomplished simply by clicking its title button and dragging it into the viewing area.
There are down sides of course. Data warehouses can be extremely expensive to build and maintain, and have a high failure rate unless there is the right mix of high need, powerful sponsor, and reasonably short time scale. Also, the impact they can have on traditional views of data ownership and organizational structures can be quite disruptive.
Some of the more noticeable costs associated with data warehousing are listed in the following table, and are described below.
|1. Time spent in careful analysis of measurable needs|
|2. Design and implementation effort|
|3. Hardware costs|
|4. Software costs|
|5. On-going support and maintenance|
|6. Resulting re-engineering effort|
Especially in the initial stages of the first pilot project, it can be very time-consuming and frustrating trying to establish needs that can be measured and that are important in decision making processes. End users will struggle to grasp what it is that they are being asked to do. A typical end user response is "give me what I say I want, then I can tell you what I really want."
Time will need to be spent demonstrating examples, where available, and showing prototypes of what is being developed. This process needs to be done repeatedly, and the system that is developed must be flexible. To be successful, the end user must be able to explore the possibilities. They need to be able to find answers to questions they had not thought of asking.
One of the reasons why many data warehouse projects fail is that they attempt to do too much. The key is to choose a pilot project that has a high profile but is manageable in size. One rule of thumb suggests that a data warehouse should have its first model working within four months or the project has a high likelihood of failure.
Spending too much time on design and implementation can cost the success of the entire project. The most critical time is spent in choosing the actual measures to model.
Except in the case of small data warehouses, such as departmental data marts, significant processing power and disk storage are required to deliver acceptable performance to analysts. The data warehouse needs to be on hardware separate from the operational systems, in order the transaction processing is not adversely impacted. Large data warehouses require a major investment in hardware, typically for multiple CPU configurations, such as Symmetric Multi Processing (SMP) or Massively Parallel Processing (MPP) systems.
Vendors of relational databases and other data related tools are rapidly developing a range of products to support data warehousing. This includes software to automate data extraction, cleanup, manipulation, aggregation, scheduling. It includes several different OLAP server technologies, as well as a variety of business intelligence tools for end user desktops. There are tools to assist with data design, change detection, automated maintenance for structural changes, automated maintenance and synchronization of metadata.
Whether solutions are purchased from vendors, or the work is done manually or with on-house developed code, it comes only at significant cost.
This cost was alluded to in the previous point. In addition to contracted software support costs, support staff will be needed to install updates, perform maintenance reprogram tools as requirements change.
One of the potential benefits of data warehousing is that it can reveal trends early enough to respond effectively. The response could be to take corrective action when a negative trend is detected, or to capitalize on actions that are causing a positive trend. However, the changes will very often involve changes to processes, work procedures and organizational structures. These can be very expensive in human terms because of the disruption that often accompanies these changes. This is one of the effects noted by C.N.G. Dampney when describing the "Growth of Impact" of an Information System on an organization. As Figure 5 shows, this impact can be described as a progression.
Figure 5. Growth of Impact(20)
Data warehousing starts with operational data, and looks at questions asked by management to meet corporate goals and develop strategic initiatives. Data that provides answers to these questions is what is placed in the data warehouse. One of the direct effects of such an approach is to impact organizational structure and business processes.
Figure 6. Data Warehousing - A Process View(21)
This diagram gives a more detailed view of the processes involved in managing and maintaining a data warehouse. The processes run from left to right, with a feedback loop from the users. One of the very clear lessons of data warehousing is that you don't build one in the way you build a house. Iteration and refinement is vital. The clue is to start small, and then evolve the data warehouse as the needs develop.
Flexibility and the ability to adapt to changing business needs are essential. Some vendors are beginning to talk about tools for automating maintenance. For this to happen, the management of the metadata needs to become more tightly integrated into the data warehousing process.
However, one of the fundamental assumptions of a data warehouse is that it is scaleable. All of the advice I have seen suggests starting small with a pilot project, and then letting it grow. In fact, I read where one consultant predicts that any data warehouse that takes longer than 4 months to get its first model working has a high likelihood of failure.
The proprietary multi-dimensional databases work well for smaller data warehouses or departmental 'data marts' up to 4 or 5 gigabytes. Using the more open Relational OLAP database products, data warehouses into the terabytes in size have been built, running on multi-CPU platforms.
The actual design process for developing a data warehouse runs from right to left in this diagram.
Figure 7. Data Warehousing Functions(22)
Functions that are desired as part of a data warehousing solution are shown in Figure 7. This illustrates the flow of data from originating sources to the user, and includes management and implementation aspects. It starts with access mechanisms for retrieving data from heterogeneous operational data sources. That data is replicated via a transformation model and stored in the data warehouse. The definition of data elements in the data warehouse and in the data sources, and the transformation rules that relate them, are referred to as 'metadata'. Metadata is the means by which the end-user finds and understands the data in the warehouse. The data transformation and movement processes are executed whenever an update to the warehouse data is desired. Different parts of the warehouse may require updates at different times, some at regular intervals such as weekly or monthly, and some on specified dates. There should be a capability to manage and automate the processes required to perform these functions. Particularly in a multi-vendor environment, adopting an architecture with open interfaces would facilitate the integration of the products that implement these functions. Quality consulting services can be an important factor in assuring a successful and cost effective implementation.
These are just a few of the terms and concepts used when describing data warehouses. More complete glossaries are given in the Appendices. The terms given here relate to storage and processing technologies, and multidimensional analysis (MDA). Some of the definitions are taken from page 2 of an IBM White Paper entitled Multi-Dimensional Analysis: Extending the Information Warehouse Framework.
OLTP is the traditional data processing area, now dominated by relational databases, which have matured into products optimized for transaction throughput.
The case for OLAP is very well put in a white paper by E. F. Codd & Associates. OLAP requires the ability to consolidate, view, pivot and rotate, and analyze data according to its multiple dimensions. This requirement is called "multi-dimensional analysis" or MDA.
An analysts view of the enterprises' universe is typically multi-dimensional in nature. For instance, they will want to analyze a given student population with regard to course, department, school, year in course, gender, age, etc. These could be some of the dimensions for analyzing a retention rate model. The effect the dimensions have on retention rate is observed interactively by an operation know as 'slice and dice'. Consolidation paths can be followed up or down using 'roll-up' or 'drill-down'.
There are a number of proprietary multi-dimensional database formats, one of which was developed by Cognos and is used in their PowerPlay product. The multi-dimensional attributes of this data model - also known as a Hypercube - are designed into the storage technology of the database and the desktop tool that sits on top of the database. All of the various levels of summarization and cross-tabulation are pre-computed and stored using sparse matrix technology.
ROLAP or Relational OLAP is the answer to MDD being proposed by vendors of traditional RDBMS. They argue that the multi-dimensionality of data is merely an attribute of the way the data is viewed and made available to user applications. The actual storage technology used to store the views can be treated separately. However, multi-dimensional analysis (MDA) based on OLTP databases suffers in performance for two reasons. Current optimizing algorithms are inappropriate for the resulting complex joins, often spanning large history tables. Also, aggregate queries are frequent in decision support applications. New optimization strategies are needed, with MDA-supporting index types and aggregation operators.
There are two types of tables in a data warehouse. These are designated as "fact" tables and "dimension" tables. A fact table holds the information that is the subject of the analysis. Facts are metrics which describe the results of business activity. They are scaleable and provide the measurement data on which business decisions are based. They describe the magnitude of business performance from the business strategies and operational activities. Examples of facts are: sales revenue, percent of store sales, and cost of goods sold.
Dimensions are different points of view about the facts. Dimensions describe what facts are. For example, sales revenue is examined for a certain period of time, by a set of markets, and for specific product brands. In this example, period, market, and product are dimensions of the facts.
For different sets of values in the dimension tables, the fact table will hold a different value. A database that is designed for data warehousing will use what is known as a star schema. The star schema supports analysis of facts by any combination of dimensional data.
Drill-down is the repetitive selection and analysis of summarized facts, with each repetition of data selection occurring at a lower level of summarization. An example of drill-down is a multiple-step process where sales revenue is first analyzed by year, then by quarter, and finally by month. Each iteration of drill-down returns sales revenue at a lower level of aggregation along the period dimension.
Roll-up is the opposite of drill-down. Roll-up is the repetitive selection and analysis of summarized facts with each repetition of data selection occurring at a higher level of summarization.
MDA is performed through software which enables repetitive drill-down and roll-up of facts along varying combinations of dimensions.
Aggregation is a key attribute of a data warehouse. Summarization and consolidation are other words that used to convey the same meaning. In the case of a multi-dimensional database, the summarizations are pre-computed for all the various combinations of the dimensions. This allows for very fast response to slice and dice operations at any level of drill-down, and also allows for fast drill-down and roll-up operations.
Granularity is a term that is used to describe the level below which no supporting details are stored, only the summaries. Good judgment needs to be exercised to determine granularity. If the granularity is set to be too fine, unused data will be stored, wasting processing time during replication steps, and wasting disk space to hold it. On the other hand, if the granularity is set too coarse, the detailed data will not be available if it is needed at some point in the future.
Traditional data analysis has usually been presented in the form of two-dimensional tables. For example, numbers of students classified by home state and course, or numbers of students by gender and year in course are two-dimensional.
Decision makers analyzing data today formulate more complex queries across multiple dimensions. A three dimensional model, for example, would help answer the question, "how many female students from Victoria are enrolled in a science degree program?" The concept of answering across multiple dimensions can readily be extended, and its value is immediately apparent.
In a three dimensional model, the intersection of the three axes is depicted. This intersection is called a fact. Some multi-dimensional vendors store facts in proprietary formats, often called multi-dimensional databases. Relational database vendors are adding OLAP capabilities to their products, use the relational database storage technology to hold multi-dimensional data in a star schema model or cube.(23)
Representing data in a multidimensional structure makes it possible for business professionals to retrieve information and analyze it using terminology that they understand, and in ways that make sense to them.
In Figure 8, a three dimensional data cube is used to model student performance over three measures or dimensions - age group, gender and department/course. The dimensional axes hold the metrics to be analyzed. In this case it is student performance, represented as the number of classes completed with a satisfactory grade as a fraction of the number attempted. The views of the metrics are called dimensions. An individual student performance metric is given by the intersection of the three axes, and is referred to as a fact. In this particular representation, the facts are for a given year, aggregated over all students in that age group, gender and course. Cohort year and year in course are typically other dimensions in this model.
Slicing and dicing takes place along any of the three axes. This makes it possible to relate the effect of age group and course on student performance, or age and gender, or all three dimensions. Graphical representations make it easy to spot apparent relationships for further analysis.
The model contains all the precomputed aggregations, making response fast. For instance, Slice A contains values of student performance for both genders and all courses for the 20 to 29 age group. In slice B, the values for all age groups and all course for female students. This is what makes it possible to drill-down and roll-up. The analysis can start with Department (Business or Education), then drill down to actual courses.
Figure 8. Data Cube Example
Traditional OLTP RDBMSs depend on schema definitions that focus on defining tables that map very efficiently to operational requests while minimizing contention for access to individual records. This maximizes concurrency while optimizing insert/update/delete performance.
The analytical processing performed on data warehouses places very different demands on the RDBMS. OLAP involves queries that are large, complex, ad hoc and data-intensive. A fundamentally different approach to defining the database schema is required.
A Red Brick White Paper entitled STARjoin Technology(24) illustrates this point very well with a simple purchase order example.
Figure 9. Star Schema Example
A query to list company name, cost of goods, source location and destination location would require the analyst to non-intuitively navigate the PO table in an OLTP database. Asking the same business question against the same data represented in a Star Schema is much more straight forward, because we are looking up specific facts (PURCHASES) through a set of dimensions (SHIP_FROM, SHIP_TO, ITEM). Because star schemas represent data intuitively, they are far better suited to data warehousing than traditional OLTP schemas. When querying an OLTP schema, analysts usually spend inordinate amounts of time navigating a maze of interrelated tables with cryptic names. When they query a star schema, they can go directly to the key facts, using a set of familiar dimensions.(25)
In Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate on page 18, Codd & Associates describe 12 rules for evaluating MDA tools. They are summarized in the following table, and described more fully in Appendix C: OLAP Product Evaluation Rules.
|1. Multi-Dimensional Conceptual View|
|4. Consistent Reporting Performance|
|5. Client-Server Architecture|
|6. Generic Dimensionality|
|7. Dynamic Sparse matrix Handling|
|8. Multi-User Support|
|9. Unrestricted Cross-dimensional Operations|
|10. Intuitive Data Manipulation|
|11. Flexible Reporting|
|12. Unlimited Dimensions and Aggregation Levels|
The simplest definition of the term metadata is "data about data". A data warehouse unlocks the data held in corporate databases but only if business users are able to find out about the data and information objects (queries, analyses, reports, etc) that are stored there. Such as facility is known as metadata or an information directory.
An information directory needs to hold more than the names of the tables with their elements and data types. As Figure 10 shows, it needs to hold information for technical tasks as well as for business tasks.
Much of the data needed by technical users will exist in a variety of places, such as program libraries, DBMS system catalogs, CASE tools, etc. Some of these places will include information of interest to business users as well, such as data descriptions and meanings. As Colin White points out, "one key objective of an information directory is to be able to integrate this diverse set of metadata, and then provide easy access to it for data warehouse developers, administrators, and business users."(27)
Figure 10. Tasks and Metadata(28)
The information directory holds information about the design of the data warehouse, including a history of changes that occur over time. It includes all the information necessary for administering the data warehouse - authorization, archiving, backups, building data collections, etc. It also specifies the data acquisition rules, including scheduling, sources, transformations and cleanups, etc. And it maintains a log of data collection as well as data access operations. An important component of a complete data warehousing solution is one that integrates and synchronizes the various sources of metadata.
The term replication is used to describe the process of managing copies of data, and has traditionally been applied in the field of distributed databases and in client/server environments. Its application in data warehousing is relatively recent and somewhat specialized, due in part to the aggregation and denormalization that occurs. Also, a data warehouse will contain multiple instances of the same set of data elements as snapshots, each with a different timestamp.
The more difficult replication problems in data warehousing occur in tables where the current values are merely to be updated if they have changed. In a research paper written at Stanford University, entitled Efficient Snapshot Differential Algorithms for Data Warehousing, the authors state that "detecting and extracting modifications from information sources is an integral part of data warehousing." They report that "there are essentially three ways to detect and extract modifications:
- The application running on top of the source is altered to send the modifications to the warehouse.
- A system log file is parsed to obtain the relevant modifications to the application.
- The modifications are inferred by comparing a current source snapshot with an earlier one. We call the problem of detecting differences between two sources snapshots the snapshot differential problem; it is the problem we address in this paper." (29)
Replication tools are available for use with the major databases. However, Avondale's legacy systems are not accessible by any of these tools. Instead, we have written our own tools to extract, transform and load data into the data warehouse, using the Cognos 4GL PowerHouse. This is described further in Section 4.6.3 Develop the Data Warehouse Load Routines.
An important part of replication is to determine the schedule - what parts of the data warehouse get written or refreshed when? Some parts of the data warehouse will get updated with a regular frequency, such as monthly or weekly. Other parts will get new snapshots written according to a calendar of significant dates, such as around census and registration dates. The replication schedule needs to be automated, and documented in the information directory as well as in the system manual.
One of the benefits of data warehousing is consistency - analyzing the same data in different ways always accumulates to the same totals. This is not possible with operational systems since real-time updates to the data can take place between doing the first analysis and doing a later one. However, in planning the replication schedule, special care needs to be taken to ensure that the operational data is in a stable, non-volatile state during the replication process. Synchronizing the various parts of the data warehouse is important, or inconsistencies can still arise.
It is a fact of life that systems change, and as changes are made in the operational systems, they must be reflected in the data warehouse. The Stanford University Database Group is conducting research into making views self-maintainable in data warehouses. "A data warehouse stores materialized views over data from one or more sources in order to provide fast access to the integrated data, regardless of the availability of the data sources. Warehouse views need to be maintained in response to changes in the base data in the sources... A view is a derived relation defined in terms of base relations. A view is said to be materialized when it is stored in the database, rather than computed from the base relations in response to queries."(30) The paper presents an algorithm for determining what are called auxiliary views. The objective is to find a minimal set of auxiliary views sufficient to maintain a view, and ultimately, to maintain a set of views.
In addition to the typical tasks associated with database maintenance, there are two additional factors to take into account when a data warehouse is involved:
One of the benefits of a data warehouse is that it is time-variant, allowing data to be analyzed over time. When a change occurs in the operational database, it will be necessary to propagate the change back through all the history records. This will often mean recomputing aggregations, but the detail will not always be available to do this. Even when it is available, it will not always make sense to recompute. For instance, if academic restructuring leads to a department of one school being moved to a different school, should school-based aggregations prior to the move be recomputed or should they stay the way they are?
Whatever choice is made in this case, it needs to be recorded in the metadata. This is the other complicating factor when maintaining a data dictionary - all changes need to be recorded in the information directory.
The Data Warehousing Institute has created what they call a
"New Roadmap To Data Warehousing" in the form of a poster. It is a
comprehensive although not exhaustive compilation of data warehousing components
and tools organized into categories of use. The different categories are summarized
in the following diagram.
Figure 11. "New Roadmap To Data Warehousing"(31)
Larry Greenfield has compiled and organized links on the world wide web to a comprehensive list of data warehousing resources, at URL:
(used to be: http://pwp.starnetinc.com/larryg/index.html)
There are listings for vendors of end user tools as well as for infrastructure technology, as shown in the following list of headings.
End User Tool Vendors
Infrastructure Technology Vendors
(2) Inmon, W.H. Building the Data Warehouse. p33.
(3) Adapted from: IBM. The IBM Information Warehouse Solution: A Data Warehouse Plus! p3.
(4) IBM. The IBM Information Warehouse Solution: A Dara Warehouse Plus! p3.
(5) Oracle. Oracle OLAP Products: Adding Value to the Data Wahrehouse. An Oracle White Paper. p4.
(6) Arbor Software. Relational OLAP: Expectations & Reality. p3.
(7) Arbor Software. The Role of the Multidimensional Database in a Data Warehousing Solution. p2.
(8) Arbor Software. The Role of the Multidimensional Database in a Data Warehousing Solution. p3.
(9) Codd E.F., Codd S.B. and Salley C.T.: E. F. Codd & Associates. Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate. p6.
(10) Codd E.F., Codd S.B. and Salley C.T.: E. F. Codd & Associates. Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate. p8.
(11) Codd E.F., Codd S.B. and Salley C.T.: E. F. Codd & Associates. Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate. p10.
(12) Codd E.F., Codd S.B. and Salley C.T.: E. F. Codd & Associates. Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate. p11.
(13) Codd E.F., Codd S.B. and Salley C.T.: E. F. Codd & Associates. Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate. p13.
(14) Adapted from Codd E.F., Codd S.B. and Salley C.T.: E. F. Codd & Associates. Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate. p12.
(15) Gupta A., Harionarayan V. and Quass D.: Stanford University. Aggregate-Query Processing in Data Warehousing Environments. p1.
(16) Widom J.: Stanford University. Research Problems in Data Warehousing. p1.
(17) Arbor Software. An Enterprise Solution for Financial Planning & Analysis. fpandac1.html p1.
(18) Arbor Software. Multidimensional Analysis: Converting Corporate Data into Strategic Information. multic0.html p1.
(19) Arbor Software. Multidimensional Analysis: Converting Corporate Data into Strategic Information. multic0.html p2.
(20) CNG Dampney COMP820 Notes for Session 3 Harnessing the Information Resource. p23.
(21) IBM. The IBM Information Warehouse Solution: A Date Warehouse Plus! p6.
(22) IBM. The IBM Information Warehouse Solution: A Date Warehouse Plus! p5.
(23) From IBM. Decision Support Solutions: IBM's Strategy. p10.
(24) Red Brick Systems. Star Schemas and STARjoin Technology. p2.
(25) Red Brick Systems. Star Schemas and STARjoin Technology. p2.
(26) Codd E.F., Codd S.B. and Salley C.T.: E. F. Codd & Associates. Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate. p18.
(27) White C. Data Warehousing: The Role of the Information Directory. p5.
(28) White C. Data Warehousing: The Role of the Information Directory. p8.
(29) Labio W. and Garcia-Molina H.: Stanford University. Efficient Snapshot Differential Algorithms for Data Warehousing. p1.
(30) Quass D., Gupta A., Mumick I. and Widom J.: Stanford University. Making Views Self-Maintainable for Data Warehousing. p1.
(31) The Data Warehousing Institute, 9158 Rothbury Drive, #200 Gaithersburg, MD 20879 USA
Next section, Previous section or return to table of contents