2 Comments
User's avatar
Frank's avatar

This criticism misses the mark and seems to come from a place of confirming biases rather than honest interrogation of the data. GPTurk isn't perfect in its methodology, but the findings clearly support the preprint's stated conclusions.

Your interpretation demonstrates a concerning lack of ability to generalize findings:

"the authors' findings suggest that some members of the general population [they only ever claimed it was some] of one crowdsourcing platform (MTurk) [why wouldn't this generalize to other crowd-work platforms?] might be inclined to use ChatGPT [provably do use ChatGPT] for relatively complex text summarization tasks [for their assigned tasks that they elected to perform] when offered average (below average?) pay [when offered market rate pay], in the absence of clear instructions disallowing the use of language models [when using previously successful study protocols]"

I summarize your points (without the use of ChatGPT) below and respond:

1. Small sample

The small n of GPTurk is a drawback for sure, but not for the reason you state. Higher n would be desirable for a more precise estimation of the proportion of workers.

Your issue with the study seems to be that their sample of 44 only includes a tiny slice of Turkers. This ignores the whole idea of sampling. To detect a population-level effect, you can take a small sample, estimate the proportion of that effect, and generalize it to the population. Studies that investigate heart attack interventions that capture 0.04% of the global heart attack population are taken as valid.

2. Clear instructions/study design

The instructions given were typical-use conditions. If anything, the conclusion you could draw from this paper is "Oh, it seems like Turkers frequently use ChatGPT to perform their assigned tasks! It would be wise to include "don't use ChatGPT" in future prompts to discourage their use."

Instead, you bizarrely jump to defend Turkers and claim that only a tiny amount use them, and we don't even really know if they do, and if they do it's because they aren't being paid enough, or because the task was too complex, or because they weren't told they couldn't. It is clear your a priori stance is that Turkers are being maligned by this paper and you are emotionally invented in their defense rather than an objective discussion of the research.

3. Fair pay

Again, the conditions created in GPTurk are the typical-use conditions by social scientists. If anything, the point you're making by saying they should have paid more is that in order to get the same quality of MTurk data as last year, researchers now have to pay more for it. That is a massive finding with huge implications for typical-use MTurk!!

Furthermore, you say workers are "underpaid" based off of what you feel is fair for an hour of labour. What special economic insights are you blessed with that lets you determine fair prices? In this world, we use markets to determine a fair price. Turkers choose tasks to perform based on whether they think it's worth it or not. The very task and price point done in the study are based off of past MTurk studies. Whether you personally would elect to be a Turker at market rates is irrelevant, but this study did not "underpay" them.

In general, the fundamental takeaway from this pre-print is that LLMs are being used by Turkers in typical-use scenarios. This degrades the quality and increases the cost (both in labour and money) of collecting this type of data. Platitudes like "of course MTurk isn't reliable!" and "the onus of good data collection is on the researchers" are irrelevant to the question of whether LLM usage is now prevalent on MTurk (it is) and whether that negatively impacts research (it does).

Expand full comment
James McCammon's avatar

Thanks for your comment Frank! Appreciate you taking the time to read and critique my article. I've been in touch with the paper's authors and hope to share their response if they approve.

As far as your comments:

Sampling is common, but it requires making some assumptions to generalize from a small sample to an entire population. I don't think the authors were intending to make statements about the entire population though. They simply posted a task, which some Turkers (likely) completed using ChatGPT, and so the authors concluded that researchers should be cautious when using MTurk. (Typically, researchers seek human responses rather than LLM responses for their research questions or surveys).

Clear instructions:

The need for clear instructions depends on the research question. If the question is whether Turkers use ChatGPT when not explicitly forbidden, this paper contributes to answering that question. However, if a researcher wants human responses from MTurk and wants to avoid ChatGPT usage, they can include an instruction to not use ChatGPT in their guidelines. I think it's still an open question as to whether Turkers would use ChatGPT despite being instructed not to.

Pay:

There might be some confusion between my comments and those of others cited in the paper. The pay comparison benchmarks I used are based on prior research, and I mention that the pay in the study aligns with hourly rates. However, there is a perception that more cognitively demanding tasks require higher pay. In the conclusion of this section, I express uncertainty about the pay issue and remain unsure of what to think about it.

Expand full comment