Lies, Damned Lies, and Correlation Analysis

There are three kinds of lies - lies, damn lies, and statistics - Benjamin Disraeli

Aruspex uses a number of statistical techniques in its' strategic workforce planning software, one of them is correlation. Correlation explores the strength of a relationship between two set of data - price and sales, for example. It is a fascinating and useful tool when done the right way, but it can be fraught with danger for the uninitiated, particularly when applied to workforce data. Today I'm going to cover a little bit about how we use correlation, and some of the traps.

Aruspex has used correlation to forecast the future workforce requirements for its' clients. We do this in some cases by finding relationships between the size of a workforce and how much that workforce produces. Once we've identified a relationship, we can then apply the formula to identify the size of the future workforce required to be to meet production forecasts. One of our clients used this technique recently to forecast a 10-year staffing plan in a complex environment. The approach worked well for our client because they had variable production levels, and needed to minimize the inefficiencies of being understaffed (the opportunity cost of not being able to meet demand) and its' opposite, the "bench" cost (having a workforce that was too large and could be better utilized in other parts of the business).

Despite its' uses, however, correlation does have its' limitations and traps. Some of these are illustrated by the latest offering from Google Labs, Google Correlate. (The service builds on the technology used by Flu Trends, which analyses what search terms are being run at any moment and can detect possible flu epidemics as they happen). By loading data sets into this service, you can find correlations between your data set and what people search for on google. The results are interesting, but not always useful, for the following reasons:

1. Small data sets lead to a lower degree of confidence in the results.

Intuitively, we know that flipping a coin once and getting a "heads" does not mean that we should expect that coin to always return heads. Similarly, insufficient data points in correlation analysis can lead to useless - or worse, misleading - results.

2. Even with large data sets, correlation can be a coincidence.

It turns out that the workforce participation rate in Australia for the past 10 years, has been highly correlated with google searches for "Portland Craigslist". It's fair to say that this is likely to be coincidental, despite the correlation being very high. Most Australians have not heard of Craigslist, and would wonder what shape an Oregon was. Sometimes correlations will occur that don't meet the common sense test.

3. Correlation is not the same as causation

The US Bureau of Labor Statistics releases a quarterly Employment Cost Index. It turns out that the change in this index since 2000 highly correlates with search terms including "appreciation" - that is, an increase in employment cost coincides with an increase in google searches for the word "appreciation". What this analysis doesn't tell you is whether employees "appreciate" higher remuneration, or whether the higher remuneration is because employers "appreciate" the efforts of their employees. It certainly doesn't explain why either employees or employers would bother to google "appreciation".

4. Sometimes co-correlation is due to a third factor you haven't considered

The Employment Cost Index also correlates highly with terms including "House of Pizza" and "Asian Buffet". You could infer that an increase in discretionary income leads people to have celebratory dinners (in which case, Dominos and Pizza Hut could consider a marketing campaign to the "just got a payrise" market, because the House of Pizza seems to have that market all sewn up. Maybe an adwords campaign where their ads are triggered on searches for "appreciation"?). Alternatively, it could mean that there is a third factor that drives both wage rises and discretionary spending on restaurants - such as economic sentiment.

5. Just because you find a correlation that makes sense, it doesn't mean you can or should use it.

You may find highly correlating factors that do make sense, but aren't helpful or usable. For example, increasing wages correlated with searches for dentists and child support. You can draw some conclusions from this, but those conclusions are not very actionable.

6. Finally, you'll get some gems.

Often, once you sort through the factors above, you will find some things that are relevant, and of more than academic interest. The relationship between voluntary turnover and reducing salary increments that we found for one of our clients, for instance. The relationship between commute times and divorce is an interesting one for consideration in work design (once again, it's not clear whether the long commute is cause or effect). Peaks in voluntary turnover at certain Length Of Service Intervals can be identified via correlation, where this relationship exists in your data. Optimizing productivity by aligning work hours to employee's biorhythms is an interesting field of research that uses correlation.

So if you're looking to identify useful patterns in your data, remember that correlation analysis can be useful where:

You have a significant amount of data points (a lot of employees, a long history of data, or a combination of the two);
You have one or more variables that you want to explore (what are some of the factors related to voluntary turnover, high performance, etc; and
You keep front-of-mind that correlation never tells you what is cause and what is effect.

P.S. It turns out that the shape of Oregon is almost an irregular trapezoid, in case you were wondering.

posted by Alex Hagan

Strategic Workforce Planning

Pages

Features

Lies, Damned Lies, and Correlation Analysis

Alex Hagan

4:03 AM

Strategic Workforce Planning

Lies, Damned Lies, and Correlation Analysis

Alex Hagan

4:03 AM

4:03 AM