AI is being deployed at speed to review scientific work for conferences, journals, and grant panels. But can AI review? We use submissions to the Utah Winter Finance Conference (UWFC), one of the most selective and longest running boutique finance conferences, to find out. Each UWFC paper is scored by two members of the program committee on a 1 (best) to 5 (worst) scale. The committee is a rotating pool of 121 leading finance researchers from 2016 to 2025. AI is asked to score each paper the same way, on the same scale. We repeat the study for every new frontier flagship release. Detailed methodology in the companion paper (coming soon).

Findings

Updated on

  1. AI struggles to identify exceptional papers. Human reviewers use the full 1 to 5 range and give the top score to 12% of papers. AI models give middle, uncontroversial scores of 2 or 3 to 74–80% of papers and almost never give the top score. Prompting AI to be more or less critical does not solve the problem, since it just shifts the whole distribution without singling out the exceptional papers.
  2. AI's rankings track human rankings only weakly. The highest correlation between AI scores and human scores is only 0.31, from Opus 4.7 (0 = random ranking, 1 = complete agreement). Who is right when they disagree? Using future citations as a quality proxy, humans clearly beat AI at flagging both exceptional papers and clear weak papers; AI's predictive advantage is concentrated in the middle of the distribution, where human reviewers tend to give middling scores and the human ranking is largely flat with respect to citations.
  3. The gap is closing fast. From November 2025 to April 2026, OpenAI released GPT‑5.1 → 5.4 → 5.5 and Anthropic released Opus 4.5 → 4.6 → 4.7. Across the two lineages, the distributional distance to the human reference (the L1 distance between the two score histograms) fell by an average of 47%, and the correlation rose by an average of 55%.

Score detail

Look across all six models

Score Distributions All-model average vs human reference

Per-paper Scores: AI vs Human Each dot is one paper