Publications

2023
T. Lin and Y. Chen, “Sample Complexity of Forecast Aggregation,” in Neural Information Processing System, 2023. arXiv
2022
S. Feng, F. - Y. Yu, and Y. Chen, “Peer Prediction for Learning Agents,” Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS). 2022. arXivAbstract
Peer prediction refers to a collection of mechanisms for eliciting information from human agents when direct verification of the obtained information is unavailable. They are designed to have a game-theoretic equilibrium where everyone reveals their private information truthfully. This result holds under the assumption that agents are Bayesian and they each adopt a fixed strategy across all tasks. Human agents however are observed in many domains to exhibit learning behavior in sequential settings. In this paper, we explore the dynamics of sequential peer prediction mechanisms when participants are learning agents. We first show that the notion of no regret alone for the agents' learning algorithms cannot guarantee convergence to the truthful strategy. We then focus on a family of learning algorithms where strategy updates only depend on agents' cumulative rewards and prove that agents' strategies in the popular Correlated Agreement (CA) mechanism converge to truthful reporting when they use algorithms from this family. This family of algorithms is not necessarily no-regret, but includes several familiar no-regret learning algorithms (e.g multiplicative weight update and Follow the Perturbed Leader) as special cases. Simulation of several algorithms in this family as well as the ϵ-greedy algorithm, which is outside of this family, shows convergence to the truthful strategy in the CA mechanism.
Y. Liu, J. Wang, and Y. Chen, “Surrogate Scoring Rules,” ACM Transactions on Economics and Computation. 2022. Publisher's VersionAbstract
Strictly proper scoring rules (SPSR) are incentive compatible for eliciting information about random variables from strategic agents when the principal can reward agents after the realization of the random variables. They also quantify the quality of elicited information, with more accurate predictions receiving higher scores in expectation. In this article, we extend such scoring rules to settings in which a principal elicits private probabilistic beliefs but only has access to agents’ reports. We name our solution Surrogate Scoring Rules (SSR). SSR is built on a bias correction step and an error rate estimation procedure for a reference answer defined using agents’ reports. We show that, with a little information about the prior distribution of the random variables, SSR in a multi-task setting recover SPSR in expectation, as if having access to the ground truth. Therefore, a salient feature of SSR is that they quantify the quality of information despite the lack of ground truth, just as SPSR do for the setting with ground truth. As a by-product, SSR induce dominant uniform strategy truthfulness in reporting. Our method is verified both theoretically and empirically using data collected from real human forecasters.
M. Gordon, et al., “Forecasting the Publication and Citation Outcomes of COVID-19 Preprints,” Royal Society Open Science (RSOS), 2022. Publisher's VersionAbstract
Many publications on COVID-19 were released on preprint servers such as medRxiv and bioRxiv. It is unknown how reliable these preprints are, and which ones will eventually be published in scientific journals. In this study, we use crowdsourced human forecasts to predict publication outcomes and future citation counts for a sample of 400 preprints with high Altmetric score. Most of these preprints were published within 1 year of upload on a preprint server (70%), with a considerable fraction (45%) appearing in a high-impact journal with a journal impact factor of at least 10. On average, the preprints received 162 citations within the first year. We found that forecasters can predict if preprints will be published after 1 year and if the publishing journal has high impact. Forecasts are also informative with respect to Google Scholar citations within 1 year of upload on a preprint server. For both types of assessment, we found statistically significant positive correlations between forecasts and observed outcomes. While the forecasts can help to provide a preliminary assessment of preprints at a faster pace than traditional peer-review, it remains to be investigated if such an assessment is suited to identify methodological problems in preprints.
Y. Chen, A. Eden, and J. Wang, “Cursed yet Satisfied Agents,” Conference on Innovations in Theoretical Computer Science (ITCS), vol. 215, no. issue, pp. 44:1--44:1, 2022. arXivAbstract
In real life auctions, a widely observed phenomenon is the winner's curse -- the winner's high bid implies that the winner often over-estimates the value of the good for sale, resulting in an incurred negative utility. The seminal work of Eyster and Rabin [Econometrica'05] introduced a behavioral model aimed to explain this observed anomaly. We term agents who display this bias "cursed agents". We adopt their model in the interdependent value setting, and aim to devise mechanisms that prevent the cursed agents from obtaining negative utility. We design mechanisms that are cursed ex-post IC, that is, incentivize agents to bid their true signal even though they are cursed, while ensuring that the outcome is individually rational -- the price the agents pay is no more than the agents' true value. Since the agents might over-estimate the good's value, such mechanisms might require the seller to make positive transfers to the agents to prevent agents from over-paying. For revenue maximization, we give the optimal deterministic and anonymous mechanism. For welfare maximization, we require ex-post budget balance (EPBB), as positive transfers might lead to negative revenue. We propose a masking operation that takes any deterministic mechanism, and imposes that the seller would not make positive transfers, enforcing EPBB. We show that in typical settings, EPBB implies that the mechanism cannot make any positive transfers, implying that applying the masking operation on the fully efficient mechanism results in a socially optimal EPBB mechanism. This further implies that if the valuation function is the maximum of agents' signals, the optimal EPBB mechanism obtains zero welfare. In contrast, we show that for sum-concave valuations, which include weighted-sum valuations and l_p-norms, the welfare optimal EPBB mechanism obtains half of the optimal welfare as the number of agents grows large.
Cursed yet Satisfied Agents
2021
X. Yan, “Optimal Crowdfunding Design,” Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), vol. volume, no. issue, pp. 1704–1706, 2021. Long versionAbstract
Crowdfunding has become an effective way of raising capital for developing and producing new products. In a typical crowdfunding campaign, a fundraiser (a seller) sets a product pre-buy price and a campaign success threshold. Consumers can indicate their willingness to pay the pre-buy price now, in exchange of a product in the future. Only when the number of consumers who commit to pre-buy exceeds the threshold, the crowdfunding is successful and the seller gets the corresponding pre-buy payments. Aseries of works has modeled crowdfunding as imperfect information games, where each player (potential contributor) has a private valuation for the product, and has characterized the equilibrium behavior for some specific settings [1–4]. Other works focus on the effectiveness or the moral hazard of the crowdfunding campaigns by regarding it as an option to raise money for either private [5, 9, 10] or public projects [6–8]. However, the strategic aspect of crowdfunding from the seller’s perspective hasn’t attracted its deserved attention. In this paper, we take a mechanism design perspective and explore how a seller can design crowdfunding campaigns to maximize his profit. In addition to choosing the pre-buy price and the campaign success threshold for a crowdfunding campaign, the seller in our paper can consider two richer design variants: (1) choose two price-threshold pairs where a pre-buy price discount is given when the number of committed buyers exceeds the larger threshold, and (2) offer two differentiating products (simplified vs. standard) and set two success thresholds such that the advanced product will be delivered if the larger threshold is reached. We examine the optimal profit achieved in each scheme. Somewhat surprisingly, the richer design variants may not improve the seller’s profit.
Optimal Crowdfunding Design
B. Green, “Algorithmic Risk Assessments Can Alter Human Decision-Making Processes in High-Stakes Government Contexts,” Proceedings of The 24th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW), vol. 5, no. CSCW2, pp. 1–33, 2021.Abstract
Governments are increasingly turning to algorithmic risk assessments when making important decisions, such as whether to release criminal defendants before trial. Policymakers assert that providing public servants with algorithmic advice will improve human risk predictions and thereby lead to better (e.g., fairer) decisions. Yet because many policy decisions require balancing risk-reduction with competing goals, improving the accuracy of predictions may not necessarily improve the quality of decisions. If risk assessments make people more attentive to reducing risk at the expense of other values, these algorithms would diminish the implementation of public policy even as they lead to more accurate predictions. Through an experiment with 2,140 lay participants simulating two high-stakes government contexts, we provide the first direct evidence that risk assessments can systematically alter how people factor risk into their decisions. These shifts counteracted the potential benefits of improved prediction accuracy. In the pretrial setting of our experiment, the risk assessment made participants more sensitive to increases in perceived risk; this shift increased the racial disparity in pretrial detention by 1.9%. In the government loans setting of our experiment, the risk assessment made participants more risk-averse; this shift reduced government aid by 8.3%. These results demonstrate the potential limits and harms of attempts to improve public policy by incorporating predictive algorithms into multifaceted policy decisions. If these observed behaviors occur in practice, presenting risk assessments to public servants would generate unexpected and unjust shifts in public policy without being subject to democratic deliberation or oversight.
Algorithmic Risk Assessments Can Alter Human Decision-Making Processes in High-Stakes Government Contexts
Y. Chen, 8, B. Tao, and F. - Y. Yu, “Cooperation in Threshold Public Projects with Binary Actions,” Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), pp. 104-110, 2021. Publisher's VersionAbstract
When can cooperation arise from self-interested decisions in public goods games? And how can we help agents to act cooperatively? We examine these classical questions in a pivotal participation game, a variant of public good games, where heterogeneous agents make binary participation decisions on contributing their endowments, and the public project succeeds when it has enough contributions. We prove it is NP-complete to decide the existence of a cooperative Nash equilibrium such that the project succeeds. We demonstrate that the decision problem becomes easy if agents are homogeneous enough. We then propose two algorithms to help cooperation in the game. Our first algorithm adds an external investment to the public project, and our second algorithm uses matching funds. We show the cost to induce a cooperative Nash equilibrium is near-optimal for both algorithms. Finally, the cost of matching funds can always be smaller than the cost of adding an external investment. Intuitively, matching funds provide a greater incentive for cooperation than adding an external investment does.
Cooperation in Threshold Public Projects with Binary Actions
J. Wang and Y. Liu, “Forecast Aggregation via Peer Prediction,” Proceedings of The 9th AAAI Conference on Human, 2021. Publisher's VersionAbstract
Crowdsourcing enables the solicitation of forecasts on a variety of prediction tasks from distributed groups of people. How to aggregate the solicited forecasts, which may vary in quality, into an accurate final prediction remains a challenging yet critical question. Studies have found that weighing expert forecasts more in aggregation can improve the accuracy of the aggregated prediction. However, this approach usually requires access to the historical performance data of the forecasters, which are often not available. In this paper, we study the problem of aggregating forecasts without having historical performance data. We propose using peer prediction methods, a family of mechanisms initially designed to truthfully elicit private information in the absence of ground truth verification, to assess the expertise of forecasters, and then using this assessment to improve forecast aggregation. We evaluate our peer-prediction-aided aggregators on a diverse collection of 14 human forecast datasets. Compared with a variety of existing aggregators, our aggregators achieve a significant and consistent improvement on aggregation accuracy measured by the Brier score and the log score. Our results reveal the effectiveness of identifying experts to improve aggregation even without historical data.
Forecast Aggregation via Peer Prediction
S. Zheng, 7, F. - Y. Yu, and Y. Chen, “The Limits of Multi-task Peer Prediction,” Proceedings of the 22nd ACM Conference on Economics and Computation (EC), pp. 907–926, 2021. Publisher's VersionAbstract
Recent advances in multi-task peer prediction have greatly expanded our knowledge about the power of multi-task peer prediction mechanisms. Various mechanisms have been proposed in different settings to elicit different types of information. But we still lack understanding about when desirable mechanisms will exist for a multi-task peer prediction problem. In this work, we study the elicitability of multi-task peer prediction problems. We consider a designer who has certain knowledge about the underlying information structure and wants to elicit certain information from a group of participants. Our goal is to infer the possibility of having a desirable mechanism based on the primitives of the problem. Our contribution is twofold. First, we provide a characterization of the elicitable multi-task peer prediction problems, assuming that the designer only uses scoring mechanisms. Scoring mechanisms are the mechanisms that reward participants' reports for different tasks separately. The characterization uses a geometric approach based on the power diagram characterization in the single-task setting ([Lambert and Shoham, 2009, Frongillo and Witkowski, 2017]). For general mechanisms, we also give a necessary condition for a multi-task problem to be elicitable. Second, we consider the case when the designer aims to elicit some properties that are linear in the participant's posterior about the state of the world. We first show that in some cases, the designer basically can only elicit the posterior itself. We then look into the case when the designer aims to elicit the participants' posteriors. We give a necessary condition for the posterior to be elicitable. This condition implies that the mechanisms proposed by Kong and Schoenebeck are already the best we can hope for in their setting, in the sense that their mechanisms can solve any problem instance that can possibly be elicitable.
The Limits of Multi-task Peer Prediction
S. Zheng, “Optimal Advertising for Information Products,” Proceedings of the 22nd ACM Conference on Economics and Computation (EC), pp. 888–906, 2021. Publisher's VersionAbstract
When selling information products, the seller can provide some free partial information to change people's valuations so that the overall revenue can possibly be increased. We study the general problem of advertising information products by revealing partial information. We consider buyers who are decision-makers. The outcomes of the decision problems depend on the state of the world that is unknown to the buyers. The buyers can make their own observations and thus can hold different personal beliefs about the state of the world. There is an information seller who has access to the state of the world. The seller can promote the information by revealing some partial information. We assume that the seller chooses a long-term advertising strategy and then commits to it. The seller's goal is to maximize the expected revenue. We study the problem in two settings. (1) The seller targets buyers of a certain type. In this case, finding the optimal advertising strategy is equivalent to finding the concave closure of a simple function. The function is a product of two quantities, the likelihood ratio and the cost of uncertainty. Based on this observation, we prove some properties of the optimal mechanism, which allow us to solve for the optimal mechanism by a finite-size convex program. The convex program will have a polynomial-size if the state of the world has a constant number of possible realizations or the buyers face a decision problem with a constant number of options. For the general problem, we prove that it is NP-hard to find the optimal mechanism. (2) When the seller faces buyers of different types and only knows the distribution of their types, we provide an approximation algorithm when it is not too hard to predict the possible type of buyers who will make the purchase. For the general problem, we prove that it is NP-hard to find a constant-factor approximation.
Optimal Advertising for Information Products
2020
B. Green, 1, and Y. Chen, “Algorithm-in-the-Loop Decision Making,” 34st AAAI Conference on Artificial Intelligence (AAAI), Sister Conference Track, vol. 50, pp. 1–24, 2020. Publisher's VersionAbstract
The rise of machine learning has fundamentally altered decision making: rather than being made solely by people, many important decisions are now made through an "algorithm-in-the-loop'' process where machine learning models inform people. Yet insufficient research has considered how the interactions between people and models actually influence human decisions. Society lacks both clear normative principles regarding how people should collaborate with algorithms as well as robust empirical evidence about how people do collaborate with algorithms. Given research suggesting that people struggle to interpret machine learning models and to incorporate them into their decisions---sometimes leading these models to produce unexpected outcomes---it is essential to consider how different ways of presenting models and structuring human-algorithm interactions affect the quality and type of decisions made. This paper contributes to such research in two ways. First, we posited three principles as essential to ethical and responsible algorithm-in-the-loop decision making. Second, through a controlled experimental study on Amazon Mechanical Turk, we evaluated whether people satisfy these principles when making predictions with the aid of a risk assessment. We studied human predictions in two contexts (pretrial release and financial lending) and under several conditions for risk assessment presentation and structure. Although these conditions did influence participant behaviors and in some cases improved performance, only one desideratum was consistently satisfied. Under all conditions, our study participants 1) were unable to effectively evaluate the accuracy of their own or the risk assessment's predictions, 2) did not calibrate their reliance on the risk assessment based on the risk assessment's performance, and 3) exhibited bias in their interactions with the risk assessment. These results highlight the urgent need to expand our analyses of algorithmic decision making aids beyond evaluating the models themselves to investigating the full sociotechnical contexts in which people and algorithms interact.
Algorithm-in-the-Loop Decision Making
L. Hu, “Fair Classification and Social Welfare,” Proceedings of the Third ACM Conference on Fairness, Accountability and Transparency (FAT*), Barcelona, Spain (Supersedes the MD4SG’19 paper.), pp. 535–545, 2020. Publisher's VersionAbstract
Now that machine learning algorithms lie at the center of many important resource allocation pipelines, computer scientists have been unwittingly cast as partial social planners. Given this state of affairs, important questions follow. How do leading notions of fairness as defined by computer scientists map onto longer-standing notions of social welfare? In this paper, we present a welfare-based analysis of fair classification regimes. Our main findings assess the welfare impact of fairness-constrained empirical risk minimization programs on the individuals and groups who are subject to their outputs. We fully characterize the ranges of Δ'e perturbations to a fairness parameter 'e in a fair Soft Margin SVM problem that yield better, worse, and neutral outcomes in utility for individuals and by extension, groups. Our method of analysis allows for fast and efficient computation of "fairness-to-welfare" solution paths, thereby allowing practitioners to easily assess whether and which fair learning procedures result in classification outcomes that make groups better-off. Our analyses show that applying stricter fairness criteria codified as parity constraints can worsen welfare outcomes for both groups. More generally, always preferring "more fair" classifiers does not abide by the Pareto Principle---a fundamental axiom of social choice theory and welfare economics. Recent work in machine learning has rallied around these notions of fairness as critical to ensuring that algorithmic systems do not have disparate negative impact on disadvantaged social groups. By showing that these constraints often fail to translate into improved outcomes for these groups, we cast doubt on their effectiveness as a means to ensure fairness and justice.
Fair Classification and Social Welfare
Y. Liu, et al., “Replication Markets: Results, Lessons, Challenges and Opportunities in AI Replication,” AAAI Workshop on Reproducible AI (RAI), 2020. Publisher's VersionAbstract
The last decade saw the emergence of systematic large-scale replication projects in the social and behavioral sciences, (Camerer et al., 2016, 2018; Ebersole et al., 2016; Klein et al., 2014, 2018; Collaboration, 2015). These projects were driven by theoretical and conceptual concerns about a high fraction of "false positives" in the scientific publications (Ioannidis, 2005) (and a high prevalence of "questionable research practices" (Simmons, Nelson, and Simonsohn, 2011). Concerns about the credibility of research findings are not unique to the behavioral and social sciences; within Computer Science, Artificial Intelligence (AI) and Machine Learning (ML) are areas of particular concern (Lucic et al., 2018; Freire, Bonnet, and Shasha, 2012; Gundersen and Kjensmo, 2018; Henderson et al., 2018). Given the pioneering role of the behavioral and social sciences in the promotion of novel methodologies to improve the credibility of research, it is a promising approach to analyze the lessons learned from this field and adjust strategies for Computer Science, AI and ML In this paper, we review approaches used in the behavioral and social sciences and in the DARPA SCORE project. We particularly focus on the role of human forecasting of replication outcomes, and how forecasting can leverage the information gained from relatively labor and resource-intensive replications. We will discuss opportunities and challenges of using these approaches to monitor and improve the credibility of research areas in Computer Science, AI, and ML.
Replication Markets: Results, Lessons, Challenges and Opportunities in AI Replication
Y. Chen, 1, and H. Xu, “Selling Information Through Consulting,” Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), Salt Lake City, UT, 2020. Publisher's VersionAbstract
We consider a monopoly information holder selling information to a budget-constrained decision maker, who may benefit from the seller's information. The decision maker has a utility function that depends on his action and an uncertain state of the world. The seller and the buyer each observe a private signal regarding the state of the world, which may be correlated with each other. The seller's goal is to sell her private information to the buyer and extract maximum possible revenue, subject to the buyer's budget constraints. We consider three different settings with increasing generality, i.e., the seller's signal and the buyer's signal can be independent, correlated, or follow a general distribution accessed through a black-box sampling oracle. For each setting, we design information selling mechanisms which are both optimal and simple in the sense that they can be naturally interpreted, have succinct representations, and can be efficiently computed. Notably, though the optimal mechanism exhibits slightly increasing complexity as the setting becomes more general, all our mechanisms share the same format of acting as a consultant who recommends the best action to the buyer but uses different and carefully designed payment rules for different settings. Each of our optimal mechanisms can be easily computed by solving a single polynomial-size linear program. This significantly simplifies exponential-size LPs solved by the Ellipsoid method in the previous work, which computes the optimal mechanisms in the same setting but without budget limit. Such simplification is enabled by our new characterizations of the optimal mechanism in the (more realistic) budget-constrained setting.
Selling Information Through Consulting
Y. Liu, 7, J. Wang, and Y. Chen, “Surrogate Scoring Rules,” Proceedings of the 21st ACM Conference on Economics and Computation (EC), 2020, pp. 853–871, 2020. Publisher's VersionAbstract
Strictly proper scoring rules (SPSR) are incentive compatible for eliciting information about random variables from strategic agents when the principal can reward agents after the realization of the random variables. They also quantify the quality of elicited information, with more accurate predictions receiving higher scores in expectation. In this paper, we extend such scoring rules to settings where a principal elicits private probabilistic beliefs but only has access to agents' reports. We name our solution \emph{Surrogate Scoring Rules} (SSR). SSR build on a bias correction step and an error rate estimation procedure for a reference answer defined using agents' reports. We show that, with a single bit of information about the prior distribution of the random variables, SSR in a multi-task setting recover SPSR in expectation, as if having access to the ground truth. Therefore, a salient feature of SSR is that they quantify the quality of information despite the lack of ground truth, just as SPSR do for the setting \emph{with} ground truth. As a by-product, SSR induce \emph{dominant truthfulness} in reporting. Our method is verified both theoretically and empirically using data collected from real human forecasters.
Surrogate Scoring Rules
M. Gordon, et al., “Are Replication Rates the Same across Academic Fields? Community Forecasts from the DARPA SCORE Program,” Royal Society Open Science (RSOS), vol. 7, no. 7, 2020. Publisher's VersionAbstract
The Defense Advanced Research Projects Agency (DARPA) programme ‘Systematizing Confidence in Open Research and Evidence' (SCORE) aims to generate confidence scores for a large number of research claims from empirical studies in the social and behavioural sciences. The confidence scores will provide a quantitative assessment of how likely a claim will hold up in an independent replication. To create the scores, we follow earlier approaches and use prediction markets and surveys to forecast replication outcomes. Based on an initial set of forecasts for the overall replication rate in SCORE and its dependence on the academic discipline and the time of publication, we show that participants expect replication rates to increase over time. Moreover, they expect replication rates to differ between fields, with the highest replication rate in economics (average survey response 58%), and the lowest in psychology and in education (average survey response of 42% for both fields). These results reveal insights into the academic community's views of the replication crisis, including for research fields for which no large-scale replication studies have been undertaken yet.
Are Replication Rates the Same across Academic Fields? Community Forecasts from the DARPA SCORE Program
F. Berlinger and L. Xu, “A High-Performance Graph Model for Near-Optimal Payments for Ecosystem Services,” The 4th Workshop on Mechanism Design for Social Good (MD4SG), 2020.Abstract
abstract
A High-Performance Graph Model for Near-Optimal Payments for Ecosystem Services
Y. Chen, L. - version 1911.04004, and Y. Liu, “Learning Strategy-Aware Linear Classifiers,” Proceedings of the Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS), vol. 33, pp. 15265--15276, 2020. Publisher's VersionAbstract
We address the question of repeatedly learning linear classifiers against agents who are strategically trying to game the deployed classifiers, and we use the Stackelberg regret to measure the performance of our algorithms. First, we show that Stackelberg and external regret for the problem of strategic classification are strongly incompatible: i.e., there exist worst-case scenarios, where any sequence of actions providing sublinear external regret might result in linear Stackelberg regret and vice versa. Second, we present a strategy-aware algorithm for minimizing the Stackelberg regret for which we prove nearly matching upper and lower regret bounds. Finally, we provide simulations to complement our theoretical analysis. Our results advance the growing literature of learning from revealed preferences, which has so far focused on "smoother" assumptions from the perspective of the learner and the agents respectively.
Learning Strategy-Aware Linear Classifiers
C. R. Ebersole, et al., “Many Labs 5: Testing pre-data collection peer review as an intervention to increase replicability,” Advances in Methods and Practices in Psychological Science, 2020. (see complete author list)Abstract
Replications in psychological science sometimes fail to reproduce prior findings. If replications use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replications from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) in which the original authors had expressed concerns about the replication designs before data collection and only one of which was “statistically significant” (p < .05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate (Gilbert et al., 2016). We revised the replication protocols and received formal peer review prior to conducting new replications. We administered the RP:P and Revised protocols in multiple laboratories (Median number of laboratories per original study = 6.5; Range 3 to 9; Median total sample = 1279.5; Range 276 to 3512) for high-powered tests of each original finding with both protocols. Overall, Revised protocols produced similar effect sizes as RP:P protocols following the preregistered analysis plan (Δr = .002 or .014, depending on analytic approach). The median effect size for Revised protocols (r = .05) was similar to RP:P protocols (r = .04) and the original RP:P replications (r = .11), and smaller than the original studies (r = .37). The cumulative evidence of original study and three replication attempts suggests that effect sizes for all 10 (median r = .07; range .00 to .15) are 78% smaller on average than original findings (median r = .37; range .19 to .50), with very precisely estimated effects.
Many Labs 5: Testing pre-data collection peer review as an intervention to increase replicability

Pages