10 commandments for judging the merits of an empirical paper: A guide for reviewers

Applied Institute for Research in Economics

Peter Howley is Professor of Economics and Behavioural Science at Leeds University Business School. Much of Peter’s research is at the interface between economics, psychology and sociology.

Peter Howley

This article was first published on “Towards Data Science”.

There are a host of pathologies associated with the current peer review system that has been the subject of much discussion. One of the most substantive issues is that results reported in leading journals are commonly papers with the most exciting and newsworthy findings. The problem here being that what might be novel and newsworthy for some may be overreaching with questionable validity for others. The ability to publish ‘sexy’ findings with questionable validity is often facilitated by a variety of problems in the research design, such as small samples and the winner's curse, multiple comparisons, and the selective reporting of results.

Fortunately, these issues have been the subject of much discussion and self-reflection amongst scientists across all disciplines. While career incentives may lead to researchers being careless with their analysis in order to publish exciting findings, most often issues are the result of misinformation coupled with cognitive biases such as confirmation bias which we are all susceptible to (e.g. we tend to only see the evidence we want to see) rather than any malfeasance. Ultimately, I feel much of this is a problem with statistics education and more generally a focus on the teaching of a technique as opposed to a problem-orientated skillset without the appropriate focus on critical thinking skills. Well certainly upon graduation I could do all manner of analytical techniques without really understanding what I was doing!

Much of the focus in this area has correctly been on trying to get researchers to change their behaviour by being more reflective and also transparent in the presentation of their methods. Less focus has been placed on the behaviour of reviewers and editors. What should reviewers be on the lookout for? It can be hard to distinguish between novel yet valid results versus those of questionable validity, particularly for those without a great deal of experience working with data. This blog is an attempt to provide some general rules to guide reviewers.

In particular, in the spirit of the insightful and engaging 10 commandments of applied econometrics, I have prepared the 10 commandments for reviewers. It is not an exhaustive and by no means complete list and is aimed predominantly at empirical as opposed to purely theoretical papers. One of the main motivations, if not the main one, for how researchers structure and write their papers is that the approach they follow is what they deem most likely to be deemed a publishable paper by editors/reviewers. Researchers respond to incentives, and I suggest that if reviewers follow these simple steps we can change the ‘rules of the game’, and therefore, ultimately change submission practices and behaviour.

1. Be more open to uncertainty: Reviewers, at least in my experience, have a preference for strong statements regarding causality but everything is not always so black and white. It should be okay (encouraged even) for authors to present their findings as suggestive, acknowledge limitations and suggest what future work is needed. Increasingly what reviewers demand, however, is unfailing certainty. Often researchers aim at demonstrating ‘proof’ of the estimated effect and through a series of robustness checks demonstrate no other possible alternative interpretations of the findings. This, in turn, can lead authors to overstate their findings for fear of being punished by reviewers if they were more circumspect.

2. Be more accepting of small/modest effect sizes: Not every study will demonstrate that changes in the key variable of interest will lead to big changes in the outcome under examination. Indeed most will not or at least should not. Research proceeds incrementally and demanding large effect sizes is unrealistic. The main problem here is that effect sizes can often be presented in a number of different ways and such expectations distort incentives so that authors can find creative ways of presenting their estimated effects as ‘large’. While publication should not depend on demonstrating large effect sizes, neither should effect sizes be so small as to be trivial and unimportant. Statistical significance should not be enough and apart from a problem regarding the reporting of the actual magnitude of effect sizes, an issue that is just as problematic is that it is relatively common to make little effort to report effect sizes at all. This should also be discouraged by reviewers.

3. Don’t be fooled/impressed by complexity: Econometric complexity should not be mistaken for rigour. Simple analyses are often not only easier to understand and communicate but also less likely to lead to serious errors or lapses. If complicated models are needed, then ensure that the researchers have presented all the necessary detail so that the ‘technical detail’ can be readily understood. Some examples of prudent questions to ask depending on context might include: What does the simple bivariate relationship look like? Do the results hold up even without the somewhat strange-looking functional specification? What do the results look like from a simple comparison of the treatment with the control group before the addition of control variables?

4. As a natural extension to the above, apply the laugh test: Apply what Kennedy (2002) refers to as the ‘laugh’ test, or what Hamermesh (2000, p. 374) calls the ‘sniff’ test: ‘ask oneself whether, if the findings were carefully explained to a thoughtful layperson, that listener could avoid laughing’. Sometimes if the results appear to be too good to be true, then often they are. This is particularly true in low-powered studies (small sample sizes) as what can look like an unexpected/novel finding can often be just a random fluctuation in a noisy dataset and unlikely to be a reproducible effect.

5. Ask the right questions: Some potentially useful questions that I find myself commonly asking include: Is the explanation for why the results are only observed for a particular sub-group or in a particular situation plausible? Related to this point I often find myself asking “is the explanation presented being derived to fit the results (observational data is inherently noisy!) or is it reasonable for the authors to have had these priors in advance?” Are the substantive results greatly impacted by adopting different procedures that seem more sensible? This might include changes to the functional form or selection of control variables. I want to emphasise that these questions should not be designed to ‘null hack’ findings away (also consider point 10 here) but rather get a general sense of how sensible the analysis is and whether conclusions are warranted.

6. Don’t be fooled by or ask for too much by way of robustness checks: Robustness checks can be important but they should not be needed to persuade you of the veracity of the main findings. It is perhaps also worth noting that from an author perspective, suggesting additional robustness checks can add substantial time with often little by way of additional benefit.

7. Don’t discriminate: Judge the research on its merits, not by whether it’s a topic you like or comes from big-name researchers or institutes. Publication bias is such that the people doing careful and considered research may not always be ones with big ‘reputations’.

8. Related to the point above and one aimed at editors don’t desk reject based on how newsworthy the reported findings are.

9. Be wary of grandiose statements, convoluted language or abstract theoretical frameworks.

10. Don’t obsess over p-values: As reported by the American Statistical Association, “A conclusion does not immediately become ‘true’ on one side of the divide and ‘false’ on the other”. A p-value coupled with an estimated effect size conveys some useful information but it should not be the be all and end all. As the ASA statement concludes, “No single index should substitute for scientific reasoning.” Personally speaking, there are many things I look for in judging the scientific merits of a paper (many of which are discussed above), but the actual reported p-value, whether it be .01 or .10 is far down the list.

Related content

Contact us

If you would like to get in touch regarding any of these blog entries, or are interested in contributing to the blog, please contact:

Email: research.lubs@leeds.ac.uk
Phone: +44 (0)113 343 8754

Click here to view our privacy statement

The views expressed in this article are those of the author and may not reflect the views of Leeds University Business School or the University of Leeds.