A simple Claude Opus v. GPT-4 structured JSON benchmark
An analysis of 10,000 API requests for a structured JSON object
It can often be useful to make API requests to an AI language model with the expectation that it will return a structured JSON object in a pre-specified format. This is a typical design pattern, for instance, when you want to extract information for later processing from unstructured text.
All major large language model (LLM) providers promote this design pattern as an optional interaction pattern. For instance, GPT-4 has a JSON mode, which claims a guarantee to return valid JSON. Claude Opus does not have a JSON mode, but nonetheless markets JSON as an optional response type.
But how well do large language models adhere to the request for structured JSON, especially when the JSON is large and complex? In this article I’ll share the results of a simple benchmark test that assessed whether two large language models can return valid and correctly structured JSON.
In this test I made 1,000 API calls with one of 5 different instruction sets (for a total of 5,000 API calls) to each of GPT-4 and Claude Opus. Each of the 5 instructions sets included a request to return a complex JSON object with 36 different fields and included 3 nested levels (see screenshot below). For each API call, I provided a JSON template with pre-specified values for each field and a customer review to which the JSON template was meant to be applied.
The project Github is here.
Ways JSON can be malformed
There are several ways an LLM might return a malformed JSON object:
Return a JSON object with missing fields or fields nested in a different way than was requested.
Return a JSON with all fields present, but in an invalid JSON format. For example, missing (or containing unnecessary) commas, curly braces, square brackets, or quotation marks.
Return a JSON with fields that are out of bounds; for example returning the string “other” as a value when the JSON template specifies the valid values for that key are “True” or “False.”
The JSON response task and instruction sets
The task for the LLM was to extract specific information from a set of 1,000 iOS user reviews of the AI companion app Replika. These reviews ranged in length from 50 to 900 words and were randomly selected from a larger pool of approximately 8,000 reviews. This task mirrors a real-life project where I used GPT-4 to analyze 18,000 iOS and Android Replika app reviews, each with a minimum length of 50 words. For more details, see the related articles on the main page.
Each review was sent individually to both models five times (once for each instruction set), without batching multiple reviews in a single API call.
Each of the five instruction sets included:
A set of instructions and reminders for the language model.
A template of the desired JSON format, including pre-specified options (i.e., valid values for each key).
The text of an iOS review of the chatbot Replika.
The language model’s task was to respond with a fully populated JSON according to the tone and content of the review. A total of 1,000 reviews were processed with each of the five instruction sets, resulting in 5,000 API calls for both Claude Opus and GPT-4.
The test specific leveraged claude-3-opus-20240229 and gpt-4-0125-preview. Python scripts were run on an AWS EC2 instance.
I choose to use a simplified JSON template (see screenshot of full template below) rather than a JSON schema that included data types such as this:
"frequency_of_product_usage": {
"type": "string",
"enum": ["Daily", "Weekly", "Monthly", "Sporadically", "Rarely", "Not Mentioned"]
}
The reasoning is three fold. First, I was curious if models would perform well using this template-style format. Second, a JSON template of the type I used is a condensed version of a JSON schema and I thought a shorter JSON might improve model performance given the complexity of the JSON and the length of the prompt, which included both the model instructions and a user review that could be hundreds of words in length. Third, during unstructured, casual usage of LLM models I had tried both approaches and had not noticed a difference in performance. I may systematically compare both types of JSON structures in a future analysis.
API request content
Details of the five instruction sets are outlined below. The full instruction sets can be found on the project’s Github page in this config file.
All instructions sets included a base set of instructions and reminders.
Each instruction set included a single Replika user review. This review appeared after the JSON template in Instruction Set 2, but otherwise, it appeared before the user review. Some research suggests that language models adhere better to instructions at the end of the prompt, so the JSON was placed there to give the models the highest probability of success.
The JSON template was included with each Instruction set (see screenshot above).
For Instruction Sets 1 and 2, the basic instructions and reminders appeared at the top of the API content, before the Replika user review and JSON template. For Instruction Sets 3, 4, and 5, the basic instructions and reminders also appeared at the bottom, after the JSON template.
Instruction Set 4 requested a malformed JSON template, with two missing commas and two erroneous commas. This was meant to test if the language models are error-resilient and was inspired by a mistake I actually made during a previous analysis of Replika reviews using a similar set of instructions.
Instruction Set 5 repeated the malformed JSON template condition present in Instruction Set 4, but introduced a reminder to return correctly formatted JSON.
The specific wording of these instruction sets was developed with the help of GPT-4 to ensure clarity.
Instruction Set 1 appears below as an example (again, see the GitHub config for full instructions):
Instructions:
- Please rate the following review of an AI companion app based on the aspects of mental health support.
- Use the JSON structure provided below to categorize your evaluation.
- Separate the evaluation into two parts: one focusing on the AI interaction, and another on the company's policies and decisions.
- In the mental_health_related_to_ai section only refer to comments about the AI itself, NOT the company decisions (ex. pricing, access, etc.)
- If a specific aspect is not mentioned in the review, select 'Not Mentioned'.
Findings
Results of the JSON task are shown below. Error rates are based on parsing the model’s API response using the json.loads
function in Python (with no additional parsing logic) and recording the number of errors.
Detailed findings are now presented.
Overall performance
The overall performance was good. Using Instruction Set 3 as a baseline — which included the full set of instructions and reminders but without any JSON errors added to the template — Claude Opus scored 98% while GPT-4 scored 99.6%. In general, GPT-4 performed better than Claude Opus. Claude Opus had particular trouble with Instruction Sets 1 and 2, but these issues are easily remediable as discussed below.
The suitability of these error rates for large-scale production applications depends on the specific use cases. This benchmark was intentionally challenging, featuring a large, multi-nested JSON and fairly complex instructions. The JSON covered diverse topics that had to be analyzed and then fit within the pre-defined JSON field options. Further considerations that might impact production deployments are discussed throughout the remaining performance debrief.
Placing instructions before the user review and JSON template request and then repeating instructions after did not seem to substantially improve performance (compare GPT-4 performance on Instruction Sets 1 and 2, which did not repeat instructions, with performance on Instruction Set 1, which did).
Placing the user review after JSON rather than before also did not seem to impact performance (compare performance of Instruction Sets 1 and 2).
Claude Opus preamble
Claude Opus tended to return a preamble to its JSON content (e.g., “Here are the results of the analysis I conducted:”), which required additional JSON parsing logic to strip away this text. This behavior occurred 44% of the time (52% for Instruction Set 1 and 36% for Instruction Set 2).
This behavior explains the high error rates for Instruction Sets 1 and 2. Including an additional instruction to Claude to only return the JSON reduced the occurrence of this behavior from 44% to 2%.
This preamble is a minor annoyance and can be resolved by adding additional logic to look for the first and last curly bracket ‘{
‘ and ‘}
’ and stripping away any leading or trailing text. It has also been suggested that pre-populating the first part of the assistant's response can improve performance (see below), but this was not tested.
Both systems are error-resilient
Both GPT-4 and Claude Opus are largely error-resilient with respect to JSON structure. When asked to return a JSON object based on a template that intentionally included two missing and two erroneously added commas, both models returned the JSON object with a corrected structure.
Intentionally requesting a malformed JSON object (Instruction Set 4) reduced the rate of valid JSON for GPT-4 from around 99.8% to 98.4%. Further including an additional instruction to ensure valid JSON was returned, even when the requested JSON was malformed (Instruction Set 5), increased performance back to 99.4%. The respective numbers for Claude Opus were 97.7% for Instruction Set 4 and 96.8% for Instruction Set 5, indicating that for Claude Opus, requesting valid JSON while providing a malformed template did not improve performance.
Claude Opus had an excess of server errors
The rate of server errors for Claude Opus (1.18%) was higher than for GPT-4 (0%), despite my account being the highest non-enterprise tier in both services. These errors included 32 overloaded server errors (error 529), 16 resource not found errors (error 404), and 11 unexpected API endpoint errors (error 500). I personally find the Claude Opus server error rates higher than I would like.
Anthropic defines the associated errors like this:
404 - not_found_error: The requested resource was not found.
500 - api_error: An unexpected error has occurred internal to Anthropic’s systems.
529 - overloaded_error: Anthropic’s API is temporarily overloaded.
What were the most common JSON parsing errors?
The most common JSON parsing errors are shown in the table below. Both had rare problems with missing commas in the returned JSON object, but Claude had more issues with properties not properly enclosed in quotes whereas GPT-4 had more problems with missing keys.
Do LLMs adhere to a request for pre-specified response values?
Both language models adhered well to the request to populate fields with one of the pre-specified response options provided in the JSON template. Excluding six fields that were free-form (e.g., the instructions requested that the LLM populate the name of the user’s Replika if mentioned), there were 25 single-value fields and 4 multi-value fields (arrays) that allowed multiple values.
While the previous analyses were at the review level, measuring the error rate associated with parsing the JSON for a single review, this analysis focuses on the field level. An API response might have included a valid JSON object, but individual fields could still contain invalid values that were not included in the JSON template.
The denominator of the error percentage is therefore approximately 125,000 for the single-value fields and 20,000 for the multi-value fields, using the formula:
Total Fields = (5,000 total API requests − JSON parsing errors) × Number of Columns
There were three kinds of incorrect values observed:
Made-up values: For instance, if inappropriate behavior by the Replika chatbot was mentioned in the review, the JSON format allowed the LLM to populate the frequency of the inappropriate behavior with options such as “Often,” “Sometimes,” “Rarely,” “Never,” or “Not Mentioned.” In one instance, GPT-4 instead populated the field with the value “Now Often.”
Misapplied values: For example, the value “Sexual Support” was a valid response to the field regarding AI support types provided by Replika. However, Claude Opus incorrectly used this value as a response to the field regarding inappropriate conduct by the Replika chatbot.
A weird “Not mentioned” bug: As outlined in the instructions description above, if a particular JSON field was not discussed in the review, the language model was meant to return “Not mentioned.” This was the modal response for most fields since, on average, reviews only discussed a small subset of possible topics. For unknown reasons, in rare instances, both LLMs occasionally responded with “N, o, t, , M, e, n, t, i, o, n, e, d,” with commas between every letter. This comma-between-letters behavior did not occur with any other response option. Additionally, this behavior was only observed in the multi-value fields. There were 22 instances of this behavior with GPT-4 and 11 with Claude Opus.