Everyone has become infatuated with the use of data in business. This trend started with Billy Beane with the Oakland A’s, using simple statistics to govern strategic decision making about how to assemble a team. The idea of using statistics in more aspects of life has become popularized with movies like Moneyball and websites like Fivethirtyeight. The use of analytics or big data is now spreading across businesses and industries. If you are a fan of basketball, you have seen NBA franchises begin to invest millions in departments dedicated solely to using math to give their team an advantage. These types of departments are popping up all over the place in every type of industry. Last summer during my internship with American Family Insurance, I noticed that they were in fact just creating a new division solely for the purpose of using big data to help further their business.
I really think this is a great movement, it is allowing for a whole new set of skills (computer science, statistics, econometrics, etc.) to be involved in industries that have never had a demand for people with those skills. NBA teams are now in the market to hire computer scientists, and data analysts to work in these new departments. I do have a concern however, that this movement will lead to problems with overfitting of data in these areas.
Overfitting occurs when analysts apply models to data, and assume based on high mathematical correlation, that these models accurately represent the data. The reason that I think this may occur is that so many people are now being paid to find trends in data that will help their organization. These people will be put under pressure from leadership to find trends in the data they are working, even when trends may not exist. What would happen if computer engineer was granted access to a $25,000 customer database, and found using good mathematics that no trends existed that would help the company. I would guess that that information would not go over well with his or her boss. This might lead that engineer to find ways to manipulate the data in order to find trends, or to apply models that may not be accurate. If suggestions are then made based on these findings, the organization would be making decisions based on inaccurate information.
Overfitting has always been a problem with forecasting and predictions, but I could see it becoming an even bigger issue with the incoming boom of analysis that will be done by businesses in the coming years. There are many interesting examples of when overfitting of models has caused major issues. Here are a few articles to look at if you are interested.