Data Analytics: Garbage In, Garbage Out

Happy New Year, let's talk garbage.
Publish date:
Updated on

There is a lot of interest in Big Data, Business Intelligence, Predictive Analytics, and other data related fields these days.


Whether it be in distinctly non-finance areas like the Internet of Things or in finance areas like attribution analysis, mutual fund manager selection, earnings forecasting, and hedge fund replication, techniques for using data are clearly changing many aspects of the business world.

This set of tools and techniques as a whole can be generically termed "data analytics" and with major increases in computing power and software access coming soon, 2017 may well be the biggest year yet for data analytics advances. But for most novices in the field, there is a major misunderstanding around what data analytics can and cannot do.

To begin with, all data analytics processes start with a basic truism: Garbage in, garbage out. If the data being analyzed is not accurate and representative of the world, then it’s not useful.

This concept seems simple, but it is often forgotten. For instance, in a risk management function, people often think of data as being useful for extrapolating the likelihood of future events – but that is only true if we have data where the events we are worried about are actually occurring with the same frequency that they do in the world.

Take merger analysis for example – we can use a statistical model called a probit model to figure out the probability of a particular company receiving a buyout offer in the next 12 months (or 3 months or 6 months with slight changes to the model). In order to model that effectively, we need to have data on the company – size, profitability, industry, assets, etc . Once we have that data, we can figure out the likelihood of a takeover being initiated given the prevailing situation in the industry. Equally importantly, data analysis can tell us statistically how confident we are in that outcome.

In other words, we might be 82% sure that company XYZ would get a buyout, while we are only 13% sure that firm ABC would get an offer. Yet in order to build this type of model, we need to have the right underlying data – that means having the right data on the firm, and having the right data on past buyouts that have occurred over a long period of time. In other words, building a data model requires investment of time and money – it is not a simple one-off process in many cases.

Data analytics is powerful but only if we have the right tool for the job. Many industry insiders say that the single biggest problem that is holding back effective use of new data-related tools and technologies is the lack of data.

The second major issue with data analytics is that we need data which is properly cleaned and compiled. Most of the time the data used for analysis comes from different sources, some of which are high quality and others of which are low quality. That means that the datasets have to be cleaned and merged together into a single larger database. This is difficult and time consuming in many cases, especially with large datasets such as those used in investing.

For instance, when trying to replicate hedge funds, one needs to use data on hedge fund returns which come from one source, data on liquid futures and ETF returns which come from a second source, and data on characteristics of those ETFs which comes from a third source. The three sets of data have to all be merged together based on a single unifying factor like date of the returns. Once this is done, the data have to be cleaned to deal with issues like hedge funds that close up shop, or bid-ask bounce in ETF pricing. When you get done with this process, you have a formula that allows you to replicate the performance of any hedge fund category at a much lower cost – but again it requires time and investment to get accurate results.

In finance, as in so many other industries there is often an element of institutional inertia which leads to less interest in new ideas. Those who do embrace new tech like data analytics early on are likely to be the ones that see the most benefit though. The key to such efforts though is investing in new data analytics capacity as a process rather than thinking of it as a one-time effort.

Best of luck in the New Year.

(Note: or those who have tried data analytics techniques with or without success in their firm, I welcome comments and feedback on how your efforts went.)

Mike McDonald is a PhD in finance and a university professor in the subject at Fairfield University in Connecticut. He also runs a consulting company doing work on quantitative investing, big data, and machine learning for a variety of financial firms, asset managers, institutional investors, and government regulators. Prior to getting his PhD, Mike worked for a major Wall Street bank and one of the top hedge funds. Comments, questions, and concerns are always welcome – email Mike at or visit his firm’s website at