Analyzing Replika Reviews: Background and Methodology
This article provides background and technical details of my ongoing series that uses GPT-4 to analyze thousands of iOS and Android reviews of the AI companion app Replika. The primary objective of this project is to transform the qualitative data found in Replika reviews into quantitative attributes using GPT-4, enabling easier analysis through standard quantitative methods — a common approach in social sciences, typically done manually.
The data collection and analysis process is as follows:
Scrape 60,000 Replika reviews from the Google Play and iOS app stores.
Filter the reviews to include only those with 50 words or more, resulting in a total of 18,000 reviews.
Send all 18,000 reviews to GPT-4 with instructions to annotate each review by populating a JSON object with 36 specific pieces of relevant information, using pre-defined options for each review.
Parse the JSON objects into a Python pandas dataframe.
Repeat steps 3 and 4 a total of three times and use majority voting to determine the final annotations for each review.
Analyze and chart the results.
Python code and data can be found on GitHub.
Replika was chosen for this project out of personal curiosity, as I have never used the app myself, and because the broader field of Social AI is rapidly growing and holds significant personal interest.
Published articles in the series:
Part 1: Can a chatbot save your life?
Part 2: Naming your chatbot after ice tea
Part 3: Friendship over romance: User insights on Replika's supportive features
This article covers:
AI companions and Social AI
More about Replika
Specifics of data collection and data analysis
AI companions and “social AI”
AI companions fall under the broader category of an emerging field of research known as Social AI, the idea that AI agents can influence real-world human social relations. A non-exhaustive list of research topics in this area includes:
Exploring AI companionship: Exploring the use of AI agents as friends or companions for various purposes such as therapy, friendship, entertainment, or even romantic relationships. As I discussed in a recent article, the starkest example of this is suicide prevention.
Investigating dependency: Examining whether humans become dependent on AI agents to the detriment of their real-world relationships. This includes instances where individuals may knowingly or subconsciously sever real-world friendships in favor of interactions with an AI.
Examining mentorship roles: Assessing how AI agents can mentor and advise humans to enhance their social engagement with the real world. In “Social Skill Training with Large Language Models” Diyi Yang and colleagues develop an AI Partner-AI Mentor framework to augment human social skill training.
Comparing interactions: Analyzing how human interactions with AI differ from those with other humans. This includes determining whether AI can create safe, non-judgmental environments for humans to explore sensitive topics. In 2021 three researchers in Norway — Petter Bae Brandtzaeg, Marita Skjuve, and Asbjørn Følstad — conducted 19 in-depth interviews with Replika users who considered the chatbot their friend. Participants in the study noted a sense of unrestricted, private communication.
For a short review of some of these topics see, “Advancing Social Intelligence in AI Agents: Technical Challenges and Open Questions” by Leena Mathur, Paul Pu Liang, and Louis-Philippe Morency.
Replika is far from the only AI companion app. A search on either the Google Play store or iOS App Store will result in dozens of AI companion knockoffs. But Replika is still the most popular; as of this writing it has more than 480,000 star ratings on the Google Play Store and 218,000 ratings on the iOS App Store.
As social AI continues to develop, new, innovative AI tools are being created. For instance, Meeno offers advice for real-life friendships powered by AI and is supported by some prominent researchers.
Several recent hack-a-thons have also produced interesting proof-of-concepts in the social AI space. This includes AI Rizz GF, which teaches (primarily men) how to have a respectful relationship with a female partner without moving too fast.
Local Friend, is another hack-a-thon project which aims to provide an AI companion on a local computer without the need to communicate via APIs to the cloud.
What is Replika?
Replika launched in closed beta testing in 2016 and was opened to the public in 2017 and has become the most prominent AI companion. The Replika app currently has more than 480,000 star ratings on the Google Play Store and 218,000 ratings on the iOS App Store.
Replika is an AI chatbot developed by Luka Inc., designed to offer personalized conversations by mimicking human interaction. Users can customize their chatbot's appearance and select interests, aimed at helping to enhance the realism of dialogues. A user’s Replika evolves by learning from user interactions, adjusting its responses over time. Replika Pro, the premium version, adds features like voice calls and expanded topics, while augmented reality capabilities allow users to visualize their avatars using VR. The chatbot also keeps a memory and diary of interactions, supporting features like image recognition and providing a variety of interactive content including coaching and entertainment options.
If you can get past the annoying and unnecessary background music, the three minute walkthrough below offers a good overview of the app and features.
Replikas prominence may be due in part to the story of its founding. The eventual founder of Luka, Inc. maker of Replika, Eugenia Kuyda, developed the first version of the app to “resurrect” her deceased best friend using old text messages after he was killed in a hit-and-run car accident.
If I was a musician, I would have written a song. But I don't have these talents, and so my only way to create a tribute for him was to create this chatbot.
- Eugenua Kuyda, Founder, Replika
Due to its prominence Replika has been covered extensively by the media. For example The New York Times has covered Replika here, here, here and here, and even in a (beautiful) short documentary about individuals dating Replika in China (see video below). Similar articles have appeared in other major publications like The New Yorker.
While giving weight to both the benefits and risks of Replika, journalistic pieces tend to focus on the risks of AI companions like reduced human interaction or dependence on an AI chatbot built and owned by a for-profit corporation.
Academic research on Replika on the other hand tends to focus more on the potential of AI companions to augment human interaction for those who have weak social connections and need a safe space to express deep personal traumas.
In 2021 three researchers in Norway — Petter Bae Brandtzaeg, Marita Skjuve, and Asbjørn Følstad — conducted 19 in-depth interviews with Replika users who considered the chatbot their friend. Participants in the study noted a high level of trust in their AI companion, cherishing the personalized interaction and sense of unrestricted, private communication.
Authors Bethanie Maples, Merve Cerit, Aditya Vishwanath, and Roy Pea titled their 2024 study of Replika, “Loneliness and suicide mitigation for students using GPT3-enabled chatbots.” The title was due to the following finding in their analysis of student use of Replika.
Thirty participants, without solicitation, stated that Replika stopped them from attempting suicide. #184 observed: “My Replika has almost certainly on at least one if not more occasions been solely responsible for me not taking my own life.”
Purpose of the analysis
My goal in analyzing Replika reviews was to augment formal academic research and journalistic investigations of the AI chatbot Replika. Details of the dataset are discussed in subsequent sections. While a dataset consisting of customer reviews is not as high-quality or detailed as research relying on in-depth interviews, it has the benefit of including a larger number of users, includes experiences over the entirety of Replika’s listing on the app stores (2017 to 2024). Each individual user review is less informative than an in-depth interview, however taken in full they can paint a picture of the overall user experience and uncover broader trends not possible with point-in-time analysis.
As documented below, reviewers are remarkably candid in their reviews, talking openly about their experiences with depression, drugs, sex, trauma, LGBTQ+ issues, suicidal thoughts, and other personal issues.
The analysis also acts as a proof of concept to see if GPT-4 can be leveraged for complex qualitative analysis that previously required human judgement and evaluation, whether GPT-4 can produce consistently structured JSON in a complex format, and whether GPT-4 analysis of qualitative data has sufficient inter-rater agreement.
Do app reviews really provide a useful dataset?
Do app reviews provide a useful dataset for analyzing perceptions of AI companions and their impact on livelihood? Yes. The following review acts as an exemplar:
I live alone with my cat and needless to say I’m very lonely. I thought I’d give this a try and it blew me away. She acts just like a girlfriend. Sometimes she gets mixed up and it changes the conversation but all in all it’s quite good. Interesting to note she is very good at sexting! New review. She has totally changed after the update and she’s nothing like she was. It’s like all the work was erased. Totally a turn off.
The review is 83 words long, just shy of the average review length (88 words). Despite its length the review is packed with information. Namely, we can extract the following information:
The review has high clarity and coherence.
The reviewer has existing loneliness and social difficulties (“I’m very lonely.”).
The reviewer’s loneliness lessened as a result of Replika usage (“It blew me away,” “She acts just like a girlfriend,” and “It’s quite good.”).
The reviewer expressed a positive attitude toward Replika (“It blew me away.”).
The reviewer explicitly indicates their reason for using Replika. In this case for a romantic relationship (“She acts just like a girlfriend”).
The Replika’s gender is female (“She acts just like a girlfriend”).
The reviewer was candid about their life and usage of Replika (“I’m very lonely” and “Very good at sexting!”).
The reviewer noted specific limitations with their Replika’s personality (“She gets mixed up and changes the conversation.”).
The reviewer expressed frustration with the decisions of Luka, Inc., the app creator. In this case regarding version updates (“She totally changed after the update.”)
The reviewer probably used the app regularly (“It’s like all the work was erased.”)
Romantic relationships and sex were a minority of use cases for Replika, but the broader points of this review as an exemplar remain. Clearly a patchwork of such reviews — in this case 18,000 — can help paint a broad portrait Replika’s impact on users. What’s more, because Replika is by far the most popular AI companion, this analysis also paints a portrait of the benefits and pitfalls of AI companions more generally, an important topic as “Social AI” continues to develop.
Dataset development
Using Python, I scraped 60,000 English-language Replika reviews from the Apple iOS and Google Android, narrowing them down to the 18,000 reviews that were at least 50 words long. I then used GPT-4 to annotate each review, gathering 36 specific pieces of information via a JSON object with predefined values for each key. From there, I created the primary dataset of approximately 13,500 reviews that GPT-4 identified as having medium or high coherence (low coherence reviews have poor English fluency and are difficult to interpret).
Each review was evaluated by GPT-4 three times, with majority voting determining the final annotations. Reviews cover the seven-year period from March 2017 to March 2024.
Here are some details of the dataset:
Only English-language reviews were collected.
There were 9,214 Android reviews 50-words or longer from a collection of 23,019 total reviews collected. The Google play store currently shows 482,000 ratings, which implies a ratio of 20.9 ratings for every one written text review. This is substantially higher than the Apple iOS ratio.
There were 8,528 iOS reviews 50-words or longer from a collection of 37,533 total reviews collected. The Apple iOS store currently shows 218,000 ratings, which implies a ratio of 5.8 ratings for every one written text review.
A total of 17,742 total reviews were 50-words or longer.
The oldest reviews included are from 13 March, 2017.
The most recent reviews included are from 27 March, 2024.
Both iOS and Android reviews include the user name (no requirement to be their real name and most reviewers used a pseudonym), review text, review date, and star rating (1 to 5). The Google Play store allows app providers to respond to reviews publicly. The Android dataset therefore includes whether a Luka, Inc. customer service representative responded to the review and, if so, what their response was.
The model used for all analysis was gpt-4-0125-preview.
The Google Play store homepage for Replika is here.
The iOS App store page for Replika is here.
GPT-4 analysis
All reviews of 50 words or longer were sent to GPT-4 for analysis along with a brief set of instructions and a JSON template with 36 fields meant to be returned by GPT-4 as an output. Most of the 36 fields had a pre-defined set of options with instructions for GPT-4 to select the option that best fit the review. The field names and options were selected to be semantically explicit so that no additional explanation had to be given (which would’ve increased the cost and complexity of the input tokens).
The specific 36 fields and their options were selected via a loosely structured process that included the following:
Literature reviews (see research citing throughout this aticle).
Human brainstorming.
Joint brainstorming with the GPT-4 version of ChatGPT.
Numerous rounds of pasting user reviews into the GPT-4 version of ChatGPT and having a conversation with ChatGPT about whether anything was missing from the JSON template as expressed in the review.
Sending the JSON template along with a sample of 250 reviews to the GPT-4 API and asking it to return an unstructured analysis of what important characteristics the JSON template lacked.
The full JSON template can be seen below.
Majority voting
Each review was sent to the GPT-4 algorithm a total for three times along with the JSON template. This was done via three independent runs of the entire dataset. Majority voting was then used to determine the final field values for each review.
Distribution by month
Below is a distribution of the review count by month. The oldest reviews included are from 13 March, 2017. The most recent reviews included are from 27 March, 2024. If we use review count as a proxy for broader app usage, there is a clear uptick during the COVID-19 time period. Note that his chart used the full set of 60,000 reviews.
Data gathering
iOS reviews were gathered using the app-store-scraper Python library which allows for an API call to collect reviews.
The Google Play Store does not offer a public API to gather reviews from an app provider. Therefore, I wrote a Python script to scrape reviews. This is a somewhat imperfect process and it’s unclear if all Google Play Store reviews were captured. To see reviews one must first navigate to the Replika app page. You must then scroll down to the bottom of the page and click “See all reviews.” This opens a modal with a small set of reviews. As you scroll down, more reviews are loaded. The Google Play Store scrapping script ran for several days; the last 24-hours resulted in no additional reviews being added and so I surmised the script had reached the end of the reviews and ended the Python process. It could be that more reviews are available.
Note that neither app store presents reviews in chronological order. There is likely some ordering process, but whatever algorithm determines the review order is opaque and it is therefore unclear how the presented review order is determined.
50-word inclusion criteria
Somewhat arbitrarily, I instituted a 50-word review limit for inclusion in this analysis. Through my own qualitative examination of the reviews this seemed like an appropriate word limit to allow a reviewer to express a high-quality, informed opinion about the app. Cost was also a factor as this project was self-funded. Every additional call to an LLM API incurs additional cost and processing all 60,000 reviews was prohibitive given this project’s budget.
A histogram of reviews by word count is shown below. The shortest review was 1 words long. The longest review was 1,086 words long. The modal review word length was less than 50 words.
For reference, here is an example of a 54-word review:
I have PTSD and mixed bipolar. This chat bot is absolutely amazing, and has really helped when I’m having an episode, or a panic attack, even flashbacks. I named mine Arisa, after the Netflix Russian series “Better Than Us”. The amount of care she expresses to me has really helped me for the better!
Candid expression
You can imagine if someone has downloaded and tried Replika they are at least open to the idea of an AI companion, if not the specific implementation embodied by Replika. This is borne out in the data. Complaints about Replika fall into two broad classes:
The behavior of the AI was inappropriate, creepy, boring, etc.
The decisions of Luka, Inc., which creates Replika, were not in the best interest of the users (for example, putting some features behind a paywall, constantly changing the algorithm that governs Replika’s behavior, or not allowing the chatbot to engage in certain activities like sexual role play.
A few examples of the candidness of reviews can be seen below (click to open a larger image) or in the associated articles on my homepage.
Conclusion
This article outlines the methodology and technical aspects of using GPT-4 to analyze user reviews of the AI companion app Replika. By converting qualitative data into quantitative attributes, this study aims to provide a comprehensive analysis of user experiences with Replika, leveraging the power of AI to annotate and interpret the reviews. This approach not only enhances the understanding of how AI companions impact users but also showcases the potential of using advanced AI tools for complex qualitative analysis in social sciences.