Discussion about this post

User's avatar
Frank's avatar

This criticism misses the mark and seems to come from a place of confirming biases rather than honest interrogation of the data. GPTurk isn't perfect in its methodology, but the findings clearly support the preprint's stated conclusions.

Your interpretation demonstrates a concerning lack of ability to generalize findings:

"the authors' findings suggest that some members of the general population [they only ever claimed it was some] of one crowdsourcing platform (MTurk) [why wouldn't this generalize to other crowd-work platforms?] might be inclined to use ChatGPT [provably do use ChatGPT] for relatively complex text summarization tasks [for their assigned tasks that they elected to perform] when offered average (below average?) pay [when offered market rate pay], in the absence of clear instructions disallowing the use of language models [when using previously successful study protocols]"

I summarize your points (without the use of ChatGPT) below and respond:

1. Small sample

The small n of GPTurk is a drawback for sure, but not for the reason you state. Higher n would be desirable for a more precise estimation of the proportion of workers.

Your issue with the study seems to be that their sample of 44 only includes a tiny slice of Turkers. This ignores the whole idea of sampling. To detect a population-level effect, you can take a small sample, estimate the proportion of that effect, and generalize it to the population. Studies that investigate heart attack interventions that capture 0.04% of the global heart attack population are taken as valid.

2. Clear instructions/study design

The instructions given were typical-use conditions. If anything, the conclusion you could draw from this paper is "Oh, it seems like Turkers frequently use ChatGPT to perform their assigned tasks! It would be wise to include "don't use ChatGPT" in future prompts to discourage their use."

Instead, you bizarrely jump to defend Turkers and claim that only a tiny amount use them, and we don't even really know if they do, and if they do it's because they aren't being paid enough, or because the task was too complex, or because they weren't told they couldn't. It is clear your a priori stance is that Turkers are being maligned by this paper and you are emotionally invented in their defense rather than an objective discussion of the research.

3. Fair pay

Again, the conditions created in GPTurk are the typical-use conditions by social scientists. If anything, the point you're making by saying they should have paid more is that in order to get the same quality of MTurk data as last year, researchers now have to pay more for it. That is a massive finding with huge implications for typical-use MTurk!!

Furthermore, you say workers are "underpaid" based off of what you feel is fair for an hour of labour. What special economic insights are you blessed with that lets you determine fair prices? In this world, we use markets to determine a fair price. Turkers choose tasks to perform based on whether they think it's worth it or not. The very task and price point done in the study are based off of past MTurk studies. Whether you personally would elect to be a Turker at market rates is irrelevant, but this study did not "underpay" them.

In general, the fundamental takeaway from this pre-print is that LLMs are being used by Turkers in typical-use scenarios. This degrades the quality and increases the cost (both in labour and money) of collecting this type of data. Platitudes like "of course MTurk isn't reliable!" and "the onus of good data collection is on the researchers" are irrelevant to the question of whether LLM usage is now prevalent on MTurk (it is) and whether that negatively impacts research (it does).

Expand full comment
1 more comment...

No posts