4. Data Warehousing At Avondale


Contents:
4.1 Decision Support Stages
4.2 Establishing Needs
4.3 Possible Subject Areas
__4.3.1 Retention Rates and Cohort Analysis
__4.3.2 Average time to graduate
__4.3.3 Student Performance Indicators
__4.3.4 Demographic Data
__4.3.5 Enrollment Analysis
__4.3.6 Other Indicators
4.4 Dimensions for Pilot Project
4.5 Data Warehousing Architectures
__4.5.1 Simplified Data View
__4.5.2 Data for Query and Analysis
4.6 Data Warehouse Construction
__4.6.1 Document Current System
__4.6.2 Design Data Warehouse for Pilot Project
__4.6.3 Develop Data Warehouse Load Routines
__4.6.4 Establish Granularity and a Replication Schedule
__4.6.5 Develop MDD Models
__4.6.6 Security
4.7 Implementation Issues
__4.7.1 Data Ownership and Responsibility
__4.7.2 Data Integration and Cleanup


This section describes what was done at Avondale College, how we went about it, the choices we made and why we made them Where our approach was different from common practice, the reasons are given. At times, general data warehousing principles are amplified here before describing the particular situation at Avondale College.

Issues or problem areas that we faced are highlighted. Many of these issues relate closely to the actual steps in the process of building a data warehouse. While such a list would vary in detail from one installation to another, there would be some commonality.

4.1 Decision Support Stages

An important element in creating a decision support environment is to break it down into its key components. Different sets of user tools are available for different aspects of decision support. Studies have shown that these tools fall into four categories:

  1. Data Enquiry - basic query and reporting. The premise for ad hoc query is to verify some known or semi-known hypothesis.
  2. Data Interpretation - more advanced analysis. The application provides statistical models for forecasting and trend analysis.
  3. Multi-dimensional Analysis - These applications provide a multi-dimensional view of data presented in an aggregated form to demonstrate the interrelationships of different dimensions.
  4. Information Discovery - more popularly known as data mining, these applications have the capabilities to discover previously unknown information by examining a large amount of data.(32)

This is summarized in the following table:

Sophistication Level Computer-based tool
data enquiry traditional data processing applications
data interpretation 'point-and-click' reporting tools
multi-dimensional analysis 'slice and dice', 'drill-down' analytical tools
information discovery intelligent agents
Table 10. Levels of Decision Support Tools

Each of the four levels of sophistication depicted in the following diagram corresponds to a different set of computer-based tools. Decision Support Stages
Figure 12. Decision Support Stages(33)

Avondale has been operating at Stage One in this representation. Canned reports are written by IT to support operational reporting and enquiry. The data warehouse will extend our decision support capabilities along to stages 2 and 3.

Users will be able to perform data enquiry using a 'point and click' reporting tool like Impromptu without having to wait for special programs to be written by IT. As further questions are raised in response to this, users will be able to refine their questioning with further enquiries, and will be better able to interpret the data.

Decision makers will be able to perform strategic analysis on multi-dimensional data models using business intelligence tools like PowerPlay.

It is important to note that as an organization moves through the stages of decision support, and achieves higher levels of sophistication in their use of data, the lower levels are not made redundant. There will always be a place for standard operational reports. Knowledge workers will always benefit from having easy-to-use reporting tools, and so on.

4.2 Establishing Needs

This is where a data warehousing project begins, but it proved to be very elusive at Avondale. Administrators found it very difficult to be precise about what they felt would enhance the quality of their decision making. We had to start with very general ideas, and then over a period of time and after several meetings, we refined the needs to a set of key areas for which we could provide or derive the data or dimensions that would be used in the data analysis.

For instance, after several rounds of meetings, we approached the Principal immediately after he had returned from 2 weeks vacation and asked him what information he would most dearly have wished he had to assist with the decisions he had to make on his first day back. The reality was that he was unable to describe to us something he had never had and had never seen.

For a Data Warehouse to be successful, it must satisfy the real needs of decision makers. These needs can be found by listing key topics or subject areas for each of the management areas. Key topics are those that have a critical effect on achieving the strategic plan. The strategic plan comes out of the corporate goals and objectives, which are determined by the mission statement.

This progression is summarized in the following checklist.

1. State Mission Statement
2. State Goals and Objectives
3. State Strategic Plan
4. List Management Areas
5. List Key Topics or Subject Areas
6. Choose High Impact Management Areas
7. Identify and Define Measures
Table 11. Checklist for Data Warehousing Success

Arriving at Step 7 (Identify and Define Measures) in the above table is also very important to the success of a data warehousing project. In this regard, the Goal Question Measure (GQM) approach can be very helpful. GQM starts with the goals and strategic plans. Factors critical to the success of those plans are identified. Questions are asked about the critical success factors and these are decomposed iteratively until the questions can be answered with measurements. These measures are described as Key Performance Indicators (KPIs).

Representatives from each of the management areas shown in Figure 1 (Management Structure) were invited to separate meetings where presentations were made describing the proposed Enterprise Information System (EIS) and Data Warehouse. At these meetings, the participants were interviewed to determine the most pressing needs of each group. We began by listing the key topics for each management area.

Management Area Key Topics
Principal numerous ad hoc requests
Assistant Principal class size
teaching load/teaching capacity
staffing
course development
staff study projections
management reports
Business Manager budget forecasting and modeling
management reports
board proposals
Assistant Business Manager course costing
enrollment/fee modeling
financial planning
cafeteria/residence costing & fees
teaching load/staffing
fleet management
Director of Public Relations and College Development recruiting/applications from school leavers
alumni
Registrar class sizes
resource scheduling - classes, lecturers, students, rooms, buildings
staff loads
retention rates by course and other fields
pass/fail by course and other fields
changes in application and enrollment patterns
distribution patterns of TER's
various statistical summaries and analyses as required by the Principal
Table 12. Information requirements for specific management areas

On the basis of these meetings, a pilot group was chosen, consisting of the Principal, Assistant Principal and Registrar. Following successful implementation with the pilot group, the project will be extended to include the Business Manager, Marketing, and Public Relations and Development.

Having chosen a pilot group, the project leaders from Avondale's Computer Services Center met with the group to refine the specification of critical needs. A brainstorming technique know as a Joint Application Development (JAD) session was used to identify key attributes that could be measured.

At this JAD session, it became apparent that there were two sets of information access requirements, and later in the project, a third type of access requirement emerged.

User Type Access Type
decision maker drill down, slice and dice, multi-dimensional analysis
knowledge worker point and click enquiry and reporting
casual enquirer world wide web access
Table 13. Data Warehouse User and Access Types

Following the JAD session, critical factors that could be measured from data already being recorded were selected as dimensions to model for analysis. Some of the requirements were expressed a little vaguely at this point in the project, but meanings and definitions became clearer as the project progressed.

4.3 Possible Subject Areas

4.3.1 Retention Rates and Cohort Analysis

It was felt that it would be very useful to be able to analyze retention rates by a variety of factors, such as course, school and discipline, year in course, Tertiary Entrance Score, academic progress and performance, age, gender and other factors.

The initial attempts to specify how to measure retention rate met with difficulty in finding a precise definition of the term retention. One way to measure retention was to include all students who completed any course, even when the course was different from the one they started. An alternative was to consider only those students who completed the course they started as being retained.(34)

Contact with other Australian universities revealed the fact that some of these measurement issues had been studied elsewhere. A series of indicators of institutional performance had been proposed by the Australian Vice-Chancellor's Committee (Directors and Principals in Advanced Education (1988)) and Linke (1991). The specific indicators of student performance and progress proposed in these two studies reveal a common thrust:

AVCC/ACDP LINKE
Second year retention Student progress rate
Major sequence retention Program completion rate
Completion rate in minimum time Mean completion time
Eventual completion rate  
Table 14. Indicators of Student Performance and Progress(35)

The key to finding a measurable definition of retention was to adopt the traditional concept of a cohort. Because of small class sizes, Avondale lecturers are often able to know personally all the students in their teaching area. This means that it is possible to have a reasonably good feel for the number of students dropping out of courses, changing to different courses, or resuming after an absence. However, competitive pressures are focusing much more pressure on some of these issues, and more accurate and more responsive measures are being sought. Hence, the concept of a cohort as a group of students commencing a course in any one year has been adopted.

A software package called COHORT was developed in 1991 by Bardsley at the Institutional Research office at Curtin University of Technology. It uses the standard DEET Student Submission files for Enrollment and Past Course Completions to monitor the progress of any commencing cohort of students. Using this terminology, the members of a cohort can be divided, at increments of one year, into three groups - completed, lapsed or continuing.

DEET Files & COHORT Methodology
Figure 13. DEET Files & COHORT Methodology(36)

4.3.2 Average time to graduate

Some of the indicators mentioned in Table 14 (Indicators of Student Performance and Progress) will be useful in gaining an understanding of how to measure average time to graduate. One area to resolve was what date to use as the starting point. There were several alternatives, all of which could be used to reveal useful information:

There are other questions still to be resolved.

4.3.3 Student Performance Indicators

The Student Progress Unit (SPU) was first proposed as a student performance indicator by Linke (1991), the first practical applications of its use were reported by Dobson & Sharma (1992). Derivation of Student Progress Unit
Figure 14. Derivation of Student Progress Unit(37)

The element completion status can take one of four values:

In previous work using this indicator of student performance, data elements available in the DEET Student Load file and Student Enrollment file were used. While most analysis would be based on time units of whole academic years, we felt we needed to design the data warehouse to support analysis down to the level of semester or half year. Approximately 20 per cent of all graduates complete courses that start and finish in the middle of an academic year. Since our student enrollment and graduating class size are so small, we felt that this could distort some analysis based on whole academic years. Therefore we are using data from our internally developed student administration system rather than from the DEET files to compute SPU.

4.3.4 Demographic Data

For many of the indicators that were being requested, there was a need to perform analyses based of various types of demographic data. One of these was the origin or source of students:

One of the projected uses of this information was for followup on the effectiveness of promotional campaigns. However, the data to support this are not held in the required form in the current system. The data held as home and term address are dynamic, and are updated as a student moves from one address to another. A new data item needs to be created, and the data that are captured should be held permanently without further change.

For students coming from within Australia, the postcode could be used as a very fine indicator of location. This data is available in electronic form, and was obtained for use in this project. Maps of Australia showing postcodes have been used to group postcodes into regions of interest, since the analysis is usually by region. There can be multiple mapping of postcodes into regions, depending on the purpose of the analysis.

4.3.5 Enrollment Analysis

There are a number of attributes that can be used to yield useful information throughout the enrollment process.

4.3.6 Other Indicators

The Avondale College Mission Statement states:

"Avondale College is a community with a Christian world view committed to excellence in education, the development of the whole person with a love for life and learning, and the preparation of skilled professionals to serve in the workplace and in society."

Since Avondale College is a Christian institution, and spiritual development is an important part of its mission, some indicators of its effectiveness in this area were suggested. Religious affiliation and baptismal status are held in the current database, but they are updated dynamically, as changes occur. For this kind of analysis to be possible, additional data items need to be held to record this information. It needs to be recorded initially at time of enrollment, on leaving Avondale College, and later on as part of a graduate survey.

4.4 Dimensions for Pilot Project

With this background, our next task was to map goals to measurable indicators of performance. The first so-called measures that were proposed were very vague, and in fact were not capable of being measured. As our understanding of what we were trying to do developed, some indicators were proposed that were easy to measure and build into models, but were not particularly useful in assisting decision making. Others were found that were highly desirable and specific, but important parts of the data were not available.

In the end, we settled on measurable indicators of cohort retention rate and student performance. Each of these were developed into PowerPlay models, with a range of relevant dimensions, including course, year in course, department, classes taken, class completion status, age, gender, and others.

The cohort analysis model has the following dimensions:

1. cohort year (year a group of students started the course)
2. analysis year (year being analyzed)
3. AOU (Academic Organizational Unit)
4. Course Type (a subtype of AOU)
5. Gender
6. Age Group
7. Age (a subtype of Age Group)
8. Course Status (Continuing, Completed, Lapsed)
9. SPU Annual (Student Progress Unit - see Section 4.3.3)
10. SPU Cumulative
Table 15. Cohort Analysis Model Dimensions

4.5 Data Warehousing Architectures

4.5.1 Simplified Data View

A Simplified Data View
Figure 15. A Simplified Data View

This is a simplified view of how data from the various sources is taken into the data warehouse, and is then accessible to end users for reporting and analysis, and corresponds to a model we had proposed early in our project. We later learned that a database to support multi-dimensional analysis was quite different from one supporting knowledge workers writing their own ad hoc reports and enquiries.

A data warehouse suitable for multi-dimensional analysis is denormalized in several ways:

4.5.2 Data for Query and Analysis

The data warehouse we are building consists of two parts:

It is not really important whether the "two parts" of this data warehouse are physically separate or only logically separate. In the pilot project at Avondale, they are physically separate, as in the following diagram.

Relational View of Operational Data, With Data Warehouse
Figure 16. Relational View of Operational Data, With Data Warehouse

The Operational Data Store is a relational representation of the operational data, which is held in a variety systems, formats and platforms (see Table 1. Information System Components and Platforms). Procedures are executed every night that refresh the tables in this database from the production sources. In addition, this database holds detailed historical information - all data is indexed by year.

Using Impromptu, users are able to create their own queries and reports, directly accessing the data held in the Operational Data Store. Impromptu uses ODBC (Open DataBase Connectivity) drivers to access both of these relational databases.

The Data Warehouse contains only the data needed to support the multidimensional analysis models that have been developed in PowerPlay. The Transformer utility is used to convert this data into the proprietary multidimensional format used by PowerPlay. It is in this step that the aggregation and indexing occurs, that allows rapid analysis through slicing and dicing, and drill-down.

The following diagram is a variation on the previous two diagrams that illustrates the flow of data from the various operational systems to end-user data analysts:

Architecture For Analysis
Figure 17. Architecture For Analysis

4.6 Data Warehouse Construction

4.6.1 Document Current System

Quite a bit of work had to be done as preparation for designing the data warehouse. Parts of the existing data structures were defined in one or both of the data dictionaries we were using. These two data dictionaries were the Data Repository in the data modeling tool we were using (Visible Analyst Workbench) and the PowerHouse Data Dictionary for the 4GL we were using. We had developed a number of useful reports based on the PowerHouse Data Dictionary so we decided to make that our primary source for data definitions.

We created definitions for all files, records and elements that were not already defined in the Data Dictionary. In addition, we added descriptive comments to all items defined in the dictionary. A relational database called an Operational Data Store (ODS) was built from these definitions with very little changes being made. Samples of the reports produced from the data dictionary are shown in Appendix E: Data Dictionary - Files & Elements and Appendix G: Data Dictionary - Elements Alphabetically. As the sample reports show, the tables are fully normalized and contain many items that will never be used in data analysis or for decision making purposes. They are included in the ODS since it is a relational image of operational data, with history, for easy reporting via desktop reporting tools.

A high level Entity Relationship Diagram for the ODS is shown in Figure 18. In a typical data modeling project, the main objectives are to establish logical business data objects or entities, and to specify relationships between entities. Entity attributes are chosen that are consistent with the rules of normalization, with foreign keys to support the relationships, and a minimum of alternate keys to reduce processing overheads for add, update and delete transactions. However, since both the ODS and the Data Warehouse itself are used for read only operations, there is no need for foreign keys or referential integrity constraints. Removing them speeds up the data replication processes. However, in their place, it is common to find additional alternate keys to improve the performance of enquiries, as well as various levels of summarization.

Entity Relationship Diagram for ODS
Figure 18. Entity Relationship Diagram for ODS

4.6.2 Design Data Warehouse for Pilot Project

The important thing to keep in mind when designing a data warehouse is to have a small number of well defined subject areas and to design a database that supports them. There will be numerous other possibilities for data to include in the data warehouse, but one of the most common reasons for failure in building a data warehouse is trying to start too big. The key is to choose a high impact subject area, with a powerful, committed sponsor, and go for a small, fast implementation. As the users become more sophisticated in their queries and analysis, and as the pilot project spreads to other users and subject areas, the database will pass through many iterations as it evolves.

The database that is used as the source for the PowerPlay cube has only three tables:

The simple structure of this database is illustrated in Figure 19. The database definition for the DW database is given in Appendix F: Data Warehouse Schema Definition.

Entity Relationship Diagram for DW
Figure 19. Entity Relationship Diagram for DW

4.6.3 Develop the Data Warehouse Load Routines

We used Oracle Rdb for the data warehouse, and the PowerHouse 4GL QTP from Cognos for extracting, transforming and loading the data. Since most of the administrative software we use has been written in-house, we had already completed the integration of name and address under a single person ID. QTP has proved to be quick and easy to write and maintain, and is powerful and efficient in its operation.

DISPLAY 'Updating SPU_ANNUAL fields'

ACCESS *DW_DATA:RAWSPUA ALIAS RAW_SPU

DEFINE D_SPU NUM = (T_PASS / T_ATTEMPT) * 100
DEFINE D_SPU_GROUP NUM = 10 IF D_SPU EQ 100                            &
                    ELSE  8 IF D_SPU GE  80                            &
                    ELSE  6 IF D_SPU GE  60                            &
                    ELSE  4 IF D_SPU GE  40                            &
                    ELSE  2 IF D_SPU GE  20                            &
                    ELSE  0

OUTPUT DW_PP_COHORT IN DW UPDATE ADD                                   &
  VIA COURSE_CODE, STUDENT_ID, ANALYSIS_YEAR, SEMESTER                 &
  USING SUBJECT_CODE OF RAW_SPU,                                       &
        NAME_NUMBER OF RAW_SPU,                                        &
        YEAR_YYYY OF RAW_SPU,                                          &
        SEMESTER OF RAW_SPU

  ITEM SPU_ANNUAL = D_SPU_GROUP

Figure 20. Sample QTP source code

The above sample of QTP code illustrates some of the features that are useful in a data warehouse load procedure.

The objectives in this step are as follows:

We were initially going to load data and analyze it by year, but because of small class sizes and courses starting in second semester as well as first semester, we decided we needed to load data by semester. The code for doing the load by semester was not completed when I moved away from Avondale College. In its present form, each student has records created unconditionally in both semesters. This creates unwanted records for students who start in second semester, or who finish in first semester.

All of the data in both the Operational Data Store (ODS) and the Data Warehouse (DW) is recoverable from production data and archive directories. The code that loads the semester data needs to be completed so that it has the proper logic for dealing with semesters. Then the ODS and DW can be deleted, regenerated, and the data can be reloaded.

4.6.4 Establish Granularity and a Replication Schedule

A replication schedule needs to be determined, put in place and automated. We have nothing sophisticated for achieving this at present. We are simply using batch jobs that resubmit themselves appropriately. This is working quite satisfactorily for our size of data warehouse (5-10MB), which is really only in the small data mart size category.

$ submit daily.com -
        /restart -
        /queue=ACVSA_SLOW$BATCH -
        /after="TOMORROW+00:05:00" -
        /log=od_prog:daily.log
$ submit CI_ALL.COM /queue=acvsa$batch /log=od_prog:CI_ALL.log

Figure 21. Sample JCL for scheduling - daily.com

4.6.5 Develop MDD Models

In the debate over MDD versus ROLAP, for our size of data warehouse, MDD provides all the functionality and performance that we can use. The data warehouse we have built is used by the Cognos Transformer tool to populate the multidimensional database used by PowerPlay.

Most of the preliminary work to build the cohort analysis and student performance tracking models has been completed. The unwanted semester records, described in the previous section, prevented the models from becoming operational. As soon as the coding and data reload described in the previous section are completed, the models can be completed and put into operation.

The dimensions for the cohort analysis model are as listed in Table 15.

4.6.6 Security

Fortunately, many of today's database and business intelligence desktop tools include comprehensive security provisions. Access rights can be granted down to the level of data items and even data values, with templates for different roles and sets of users..

However, designing a security strategy that all parties will agree to, then implementing and maintaining it, can be quite a task.

4.7 Implementation Issues

4.7.1 Data Ownership and Responsibility

At Avondale, name and address information had traditionally been maintained separately in the finance, the academic and the alumni records. We had actually achieved some integration of this data in the production systems as a preliminary step to setting up the data warehouse. When alumni mail-outs turned up surname or address corrections, there was some initial conflict between the academic office and the alumni office over who owned the data and therefore who had the right to update it.

The reverse problem also occurred in some instances. Once some of the initial problems of data ownership were resolved and users became accustomed to the concept of distributed maintenance of some of the data, it became evident that we needed to identify "data custodians" to be responsible for different portions of the data. This was important for maintaining standards in the operational systems for data entry and update procedures.

4.7.2 Data Integration and Cleanup

A range of problems were found on trying to integrate data from various sources into the data warehouse, as summarized in the following table.

1. semantics
2. integrating mismatched data types
3. integrating mismatched coded values
4. invalid typed data
5. orphan child data entries
6. duplicated IDs
7. out of date surnames and addresses
8. data restructuring for array items, etc
Table 16. Data Integration & Cleanup Issues

Some fields were known by different names or had different data types in different systems, or were represented with different sets of coded values. Integrating these proved to be quite a challenge. STUDENT_ID is one example of this.

STUDENT_ID                        PIC 9(10).
STUDENT_NUMBER                    PIC 9(5).
NAME_NUMBER                       PIC X(6).
    (NAME_NUMBER replaces STUDENT_NUMBER plus CHECK_DIGIT)
Table 17. Semantic Differences in STUDENT_ID

Other examples of elements with semantic or data type differences include:

ACCOUNT   account, student_account, gl_account
YEAR      subject_year, acad_year, ref_year
SEMESTER  subject_semester, current_semester, 
          requested_semester, submission

Some fields contained data that was invalid for the data type, such as date fields with invalid dates or numeric fields containing non-numeric data. These data errors were not detected in indexed file system, but were rejected by the RDBMS when being loaded into the relational ODS or DW.

In some of the earlier data models that we built, we defined relationships between tables with foreign keys. This highlighted other types of errors in the source data, since it did not obey the referential integrity rules we were trying to impose. We corrected these errors in the source data. They occurred in such cases as where a student had been enrolled in a class which was subsequently canceled and deleted from the classes offered file. The student enrollment remained, making it an orphan record.

The most difficult data cleanup problem was when duplicate records occurred for the same person, with different IDs. This happened even within the same system. Correcting all the records with the wrong ID was a time-consuming process, and required special care and concentration to ensure that it was done correctly. It also had to be repeated in each archive directory where the same errors were found, and in the DEET submission files. Even then, in many cases it was not possible to determine if two possibly duplicate records where really for two different people or not.

Another problem with the validity of the data was out of date surnames and addresses. This is a perpetual problem and has no easy solution.

Some restructuring of the data was necessary in going from the operational systems to the ODS and DW. For instance, the year had to be added to the key where it was not already present to allow for holding of historical information. There were several cases were COBOL-style group items were used. These had to be eliminated. Also, it was necessary to handle information held by semester, which was indicated by a suffix in field name or a subscripted array. These fields were properly denormalized.

A major data and program conversion project had been completed prior to the commencement of the data warehouse project. This was to convert the 5-digit NAME_NUMBER and CHECK_DIGIT fields to a single 6-character NAME_NUMBER field.


Section Endnotes

(32) IBM. Decision Support Solutions: IBM's Strategy.p5.

(33) IBM. Decision Support Solutions: IBM's Strategy.p5.

(34) In the nomenclature used at Avondale College, a unit of study is called a subject. A program of study leading to an award is called a course.

(35) Peter Manass & Robyn Peutherer. Approaches to Student Progression Analysis Based On National Data Collections In Australia. Australasian Association for Institutional Research Conference.Aair_pap.doc p2.

(36) Peter Manass & Robyn Peutherer. Approaches to Student Progression Analysis Based On National Data Collections In Australia. Australasian Association for Institutional Research Conference.Aair_pap.doc p6.

(37) Peter Manass & Robyn Peutherer. Approaches to Student Progression Analysis Based On National Data Collections In Australia. Australasian Association for Institutional Research Conference.Aair_pap.doc p8.


Next section, Previous section or return to table of contents


Last Modified: Monday, September 2, 2002 12:12 PM