Part 2: The State of the Art
Every university in the world wants to know the answers to these questions:
How many students should we expect this fall? How many faculty will we need to teach various disciplines in two, three or four years? How can we accurately project our budget now and in future years? How can we predict which students might be getting into trouble so that we can help them at the very earliest opportunity? How can we best help first generation students be successful? How can we target our recruitment efforts to direct our resources to those potential students most likely to yield (enroll at the university)? Is it possible to not only target students most likely to yield, but also those students most likely to be successful and, indeed, stay for all four years? And on and on…
These questions might seem simple, but getting to the right answers can be a daunting and complex task. The dirty little secret is that today, many universities simply don’t know the answers to these questions. Many make do with partial answers or they simply guess as to what the answers might be. However, that can be very dangerous. As those in the statistical sciences will tell you, not having all the data, using the wrong data or simply making up data will often lead to the wrong conclusions. For many institutions, this can lead to terrible decisions with just as awful consequences.
“Big Data” promises to help fix these problems. The idea behind big data is to get as much of an institution’s transactional data along with other associated information (co-curricular transcripts, admission materials, Learning Management System (LMS) statistics, etc.) into a denormalized data warehouse as possible. To make the data warehouse successful, the data must have a common syntactic and semantic base. In other words, the data in the warehouse must be put together (linked) in such a way that the data points have the same meaning and derivative results from the data are done in a standardized way.
Doing all this is not as easy as it sounds. In fact, even today, this is still a very complicated process. As seen in Part 1, many data warehouse projects, as late as just ten years ago, turned out to be very expensive, absolute disasters. However, a lot of progress has been made since then. Vendors are providing much better tools and we in higher education have learned a lot. Building a functional data warehouse, while still a complex undertaking, is now a very doable project.
So, what do you need? First, you will want to look at your transactional system. To be blunt, simplicity and commonality are the key to an easy implementation. While it is certainly possible to build a data warehouse in a “best of breed” environment (an Enterprise Resource Planning (ERP) system comprised of many different components from different vendors linked together either by live, real-time data transactions, such as found in web services, or through time-delayed batch processing), it is a much more complex challenge than building a data warehouse off an integrated, single-vendor-provided ERP. The reasons are twofold: First, in a best of breed environment, it is far more difficult to get the various systems to “agree” to a common syntactic and semantic core. Frankly, this is because various solutions generally have completely different data structures that have to be reconciled for a data warehouse to work correctly. In an integrated ERP, this problem largely goes away because the data structure is already synchronized. Second, in a best of breed environment, transactions between the various system components are generally done in the form of summarized data. In other words, generally only that data that has to be exchanged amongst the systems is exchanged. The result is that a lot of the richness (detail) of the data is lost as it moves through a best of breed environment. A data warehouse, on the other hand, will be far richer and more useful if it has the full data detail. An integrated ERP solution makes it far easier to get this detail into the data warehouse because there is only one system to deal with and all the data within that system is naturally rich as it has not been transformed or exchanged as happens in a best of breed environment. Finally, with only one system, the investment in the extract, transform, load (ETL) tool (that tool that actually takes the data from the transactional system and places it into the data warehouse) will be naturally far less. In a best of breed environment, multiple systems may have to be individually linked into the data warehouse, greatly increasing the complexity and time to launch the system.
Indeed, many vendors of integrated solutions provide pre-built data warehouses for their systems. While it is impossible for a vendor to provide a turnkey, “out-of-the-box” solution, these pre-built systems go a long way towards helping a university deploy their warehouse. They often come delivered with the common data already configured for the ETL tool and, even better, have pre-built database schemas linking the transactional data such that it makes it far easier to build reports. Finally, many come with actual, pre-built reports that can be easily modified to meet an institution’s particular needs.
However, even with a delivered solution, no data warehouse project should be considered an easy project to undertake. The technical aspects can be daunting. For example, getting the ETL tool configured correctly so that it provides the data warehouse accurate information is absolutely critical. Getting it right often requires many iterations. In addition, good coordination with the university’s Institutional Research department as well as the formation of a Data Standards committee will greatly help move things forward. They can provide the resources to make sure that what the data warehouse reports is “true” and accurately matches institutional definitions. Even more importantly, they can resolve disagreements as to what various data points mean. For example, who in the student body is a freshman, sophomore, junior or senior? Believe it or not, that is not always an easy question to answer and sometimes various university departments disagree for various reasons.
Once the common transactional data is in the data warehouse, other sources of data can be added. This is where the term “big data” comes from. Many institutions have added co-curricular activities, internships, volunteer activities information, information gathered through the admissions process, surveys, LMS data, employment information and more. The richer the data, the better.
Why is that important? Because one of the powers of a “big data” solution is the ability to look for unexpected data correlations and to ask “what if” questions. For example, a university may believe that working outside the university excessively while going to school harms a student’s success potential. Properly configured, the data warehouse will show if this is true. Not only that, an institution can also see when outside work statistically harms the student. Is it if they work more than 20 hours? It is 30? And so on.
Of course, in any big data system, privacy will be a major concern. Many big data systems are built with de-identified data. In such a system, while it may be possible to drill down to a particular student, faculty member or staff, the system is configured in such a way that it is impossible to tell exactly who a particular person is. Further, many vendor provided solutions use the security that is built into the transactional system for their data warehouse tool. Thus, if someone does not have access to data in the core ERP, they will not have it in the data warehouse. Finally, and most importantly, all the laws that govern a transactional ERP system, particularly those found in the Family Educational Rights and Privacy Act (FERPA), are still in effect for the data warehouse.
That said, it is a serious mistake to downplay privacy concerns. The best thing to do to ensure these concerns are addressed is to present the issues transparently, get community feedback and build protections into the system right from the start. If people understand that the purpose of the system is to help students, faculty and staff be more successful, they will be supportive particularly if they also know what protections are in place. After all, “big data” sounds like a very scary thing. It may take time to build trust. However, with the potential rewards, it is time well spent.