Smart Beta’s Foundation Problem

March 15, 2018

Richard Wiggins is an academic who has long been involved in the Chartered Financial Analyst program and appears in various financial publications.

We trust our life savings to financial quants who impress us because we don’t understand them. Quantitative research provides a voluptuous cushion of reassurance, but what if it’s all based on bad science?

Factor-beta investors are full of passionate intensity, but we’ve read so many deceptive reports that we get misled into thinking we know something, when in fact we have no more insight into smart beta than a parrot into its profanities.

Little-known concepts like p-hacking and HARKing (hypothesizing after the results are known) raise uncomfortable questions, but few of us have heard of them; they’re neologisms that address some very serious epistemological problems behind the research driving smart-beta products to record sales. There’s strong evidence the whole smart-beta approach is anchored by Nobel prize-winning work that was wrong in the first place.

Alarming Results

The rise in “evidence-based investing” is exponential, but few people seem to be aware that there is an analog known as “evidence-based medicine”—and the results are alarming.

In 2011, German researchers in the drug company Bayer found in an extensive survey that more than 75% of published findings could not be validated. It gets worse. In 2012, scientists at the American drug company Amgen published the results of a study in which they selected 53 key papers deemed to be “landmark” studies and tried to reproduce them. Only six could be confirmed. This is not a trivial problem.

What do both “evidence-based” approaches have in common? 1) Reproducibility concerns; and 2) the backbone of their methodology is hypothesis significance testing and the computation of a p-value.

It is routine to look at a low p-value, like p=.01, and conclude there is only a 1% chance the null hypothesis is true—or that we have 99% confidence that the effect is true. Both of these interpretations are incorrect.

The p-value has always been easily misinterpreted and was never as reliable as many scientists presumed. It is often equated with the strength of a relationship, but the p-value reveals almost nothing about the magnitude and relative importance of an effect.

A Lot Of Published Research Is Wrong

Practically every modern textbook on scientific research methods teaches some version of this “hypothetico-deductive approach”; it is the gold standard of statistical validity (“Ph.D. standard”) and the ticket to getting into many journals. But we’ve been misapplying it to such an extent that legendary Stanford epidemiologist John Ioannidis wrote a famous paper titled, “Why Most Published Research Findings Are False.”

He wrote it almost 13 years ago, and it is the most widely cited paper that ever appeared in the journal that published it—and you’ve probably never heard of it.

There’s a “truthiness” to research, but it’s widely accepted among data analysts that “much of the scientific literature, perhaps half, may simply be untrue." It’s easy to find a study to support whatever it is you want to believe; but the greater the financial and other interests and prejudices, the less likely the findings are to be true.

When UK statistician Ronald Fisher introduced hypothesis significance testing and the p-value in the 1920s, he did not intend for it to be a definitive test. He was looking for an approach that could objectively separate interesting findings from background noise.


The approach we use today is actually a hybrid system that combines elements of once-competing—and incompatible—approaches introduced by mathematician Jerzy Neyman and statistician Egon Pearson. All this makes a shaky foundation—and then we misapplied it.

Traditional scientific induction, i.e., inferring some relation or principle post hoc from a pattern of data, calls to mind the historic image of Marie Curie sifting pitchblende in a Parisian storage shed, performing basic science, but this approach was designed for a world where you declare what you expect beforehand and then run a test.

You have to posit a theory ahead of time. The data cannot precede the theory; otherwise, you can get a backward-reversing of cause and effect. 

Accidents Of Randomness

The p-value was never intended for a world where you let your computer run overnight and work backward to find your thesis. Presenting a post hoc hypothesis—i.e., one based on or informed by the pattern of results—invalidates the inferential statistical process, because a “happy ending” (confirmation) is guaranteed.

The high rate of nonreplication of research discoveries is a consequence of p-hacking and HARKing. It also explains why market premia generally vanish as soon as they’re discovered—because the premium never existed in the first place. Researchers who fiddle around and keep trying different things until they get the result they are looking for aren’t committing fraud; they just don’t understand how the p-value was intended to work.

This isn’t as obvious as telling you that most of your Facebook friends are not really your friends. It’s subtle: P-curves assess the reported data, not the theories they are meant to be testing.

What scientists really want to know is whether their hypothesis is true, and if so, how strong the finding is, but a p-value does not give you that—it can never give you that. A study's findings can be statistically significant yet have an effect size so weak that, in reality, the results are completely meaningless.

Ersatz Theories

Man's most valuable trait is a judicious sense of what not to believe, and there is no logical reason “quality” stocks should earn a higher return; in fact, I would expect the opposite. Everybody loves to hold consistent earning stocks with high dividends like Johnson & Johnson and Procter & Gamble. We are concocting ersatz theories that accommodate the results and pasting them on after the fact.

What’s wrong with all of this? A simple question and its answer: Regression formulas alone—even if they fit the data well—are not a theory. False discoveries grow proportionally with the number of tests. If multiple comparisons are done or multiple hypotheses tested, the chance of a rare event increases. The statistical framework that associates a probability value of 0.05 with a t-test of 2.0 presumes it was a single test. But often it wasn’t.

Wall Street’s “factor factories” work in a multiple test environment; they’re trying thousands of combinations to see what will work. T-ratios are always higher as the number of tests increases; while a t-ratio of 3.0 is highly significant in a single test, it is not when you take multiple tests into account.


Questionable Products

The mistake they are making is not adjusting thresholds for multiple tests; something that would require authors to report all experimental conditions, including failed manipulations. If these adjustments were made, most smart-beta products would likely fail.

The wrong statistic applied the wrong way makes you just plain wrong. It is unacceptably easy to publish “statistically significant” evidence, because the hypothesis is arriving after the results are known. Studies with incorrect findings are not rare anomalies, because most people think you can keep trying things in a regression until you find a few variables that “work,” but the variables must be determined beforehand or the p-value will be worthless.

When you manipulate all those variables (“selection bias”) and tweak things, you are shaping your result by exploiting “research degrees of freedom.” You will be wrong most of the time.

Importance Of Analytic Choices

Smart-beta results are heavily reliant on analytic choices. In the course of collecting and analyzing data, researchers have many decisions to make. There are boundless methodological choices: separating the weighting scheme decision from the stock selection decision, etc.

Ken French—arguably one of the people who got this whole smart-beta thing started—may have p-hacked. Simple ranks and sorts are never really all that simple.

What was the argument as to why Ken French did a two-way sort to establish the small-cap premium but opted for a three-way sort—with an excluded middle—for value? It was, shall we say, a little artful. Maybe he used that because it worked; but such subtle manipulations are not allowed.



The truth is, after all, not specific, and smart-beta inputs are ripe for manipulation: Momentum has a negative statistical correlation with fundamental-to-price ratios, making it a particularly suitable companion for producing attractive backtest results.

A Beautiful Lie

Intuitively, employing several measures of value or quality is better than using one, but in fact, studies using multiple signals (“overfitting bias”) should be viewed skeptically, because you want to limit dimensionality; that is why academics originally used one simple parsimonious measure—book-to-price—to assess value. The more complicated multifactor models are actually less reliable. It’s a beautiful lie.

We want to make the uncertain certain, but this scientific approach is broken. If smart-beta researchers pick what worked best (p-hacking) and then come up with a theory (HARKing) to explain things a la “low beta” or “quality,” then historical efficacy is exaggerated and its future efficacy may evaporate entirely.

Previous performance won’t be indicative; in fact, it may even have the wrong sign, as strategies that have become far more expensive than in the past suffer negative revaluation alpha.

Ah, regressions; if only it were that easy. Historical average premia are always in-sample estimates. The past is only one version of history, and it is dreadfully meretricious.

The Victorian fallacy that objective facts assemble themselves into a truth can’t handle the reality that long-lived phenomena are often the byproduct of a single accidental moment in history.

Every truth is temporary. The market doesn’t have to adhere to a proscribed return pattern. Advice like this is a form of nostalgia mixed with bad math.

And for those of you excited about the potential of Big Data, I have two words of advice: Be careful.

Richard Wiggins is a thought leader on APViewpoint, which is an online community maintained by Advisor Perspectives; a past president of the CFA Society of Detroit; and a periodic contributor to Barron's, Institutional Investor, the Journal of Indexes and other practitioner journals. He also is an author/contributor/abstractor to the CFA Digest and past member of the Council of Examiners, which authors the Chartered Financial Analyst Exam.

Find your next ETF