Picking up where we left off in March, Allison and I return armed with a newfound respect for large language models (LLMs). Tremendous progress has been made since our earlier blog post. With the more advanced capabilities of GPT-4 come exciting possibilities, potential hazards, and the accompanying uncertainties of a society struggling to define acceptable uses for LLMs. Recently, I submitted a letter of recommendation to an online portal, an action that, for the first time in my experience, required verifying that the letter was not crafted, nor even edited, by generative AI!
With a healthy degree of AI risk awareness, we are incorporating AI into our work. Since we believe in the capacity of generative AI to accelerate scientific progress, we want to share Four Useful Prompts for Researchers in this blog. Except where specified, we used Chat GPT-4. We hope these prompts prove beneficial in your work, and we encourage you to share your favorite research prompts in the Comments.
Prompt #1: Use GPT-4 as your “Data Cleanup Crew”
GPT-4’s Code Interpreter is a remarkable tool that can be enabled in GPT-4 settings.
To say it is astounding is not, in our opinion, hyperbolic. It can: run code on files you've uploaded, analyze data, visualize data, create & sort tables based on mathematical guidance, act as a data analyst, and assist with the tedious but essential task of data cleaning.
To keep it fun, we uploaded a survey about Halloween candy that was fielded in 2015, 2016, and 2017 and asked GPT-4 to help us clean it.
The data were as messy as those weird marshmallow candies that did not rate well in any of the surveys. To name a few issues: the 2015 dataset left out a response option (“meh”) that was included in the other years, variables were inconsistent across years, some data were strings, and some of the metadata had dates that were outside the range of fielding. GPT-4 was up to the task, with some non-trivial caveats. If you have ever analyzed survey data, you may have encountered free text responses provided for an “other” field that should have been captured in the pre-set multi-select response options. For example, the survey asks about education, and while “some college” is an option, the respondent selects “other” and writes, “I left college after my junior year”. Recoding such responses is standard data cleaning that takes a lot of time. In our sample analysis of an “other” field, an example was GPT-4 recognizing the multiple free texts of “tootsie rolls” (tootsie roll, Tootsie roll, etc.). We asked GPT-4 to find all instances of tootsie roll without being case sensitive and using lemmatization (cool, new word taught to us by Code Interpreter, it means to collapse different forms of a word into one category). For such work, it is essential that you instruct GPT-4 on how to handle common misspellings, upper and lower case discrepancies, and singular/plural discrepancies. Be specific in your prompts to make sure you know what you are getting and that it is what you need. And even when you get it, check and check again. Small miscommunications and misunderstandings during our exploration of the 2015 dataset led to some real headaches.
That said, GPT-4 has real potential here, given its natural language processing skills. For fun, we asked it to visualize a free text candy variable, and rather charmingly, it produced a word cloud:
It also (without specific prompting) cleaned the 2015 data of non-sensical values and created a box and whiskers to visualize the age distribution:
We asked for a histogram of the frequency of survey responses by date:
Code Interpreter offered options for missing data, and when asked if it could perform multiple imputation, it offered some Python resources. It also included a slightly patronizing and wholly unsolicited admonition that we had better understand our data and the missing-at-random assumption should we even consider using multiple imputation. We can take it from a stats professor, but it’s weird to be chided by a bot.
In any case, while still in the experimental stage, Code Interpreter seems destined to become an invaluable asset for researchers. However, initially, it may not be a timesaver, given the need to check its work. While our limited experience with Code Interpreter uncovered less frequent hallucinations, miscommunication is not uncommon and sometimes led us to believe that the output we expected was the output we were viewing; though, that was not always the case, especially with some of the aforementioned natural language processing gone rogue. Bottom line: Any GPT-4 Code Interpreter use must include checking output carefully and thoroughly.
Prompts and probes for data cleaning:
Prompt: Review these data and suggest data cleaning steps. Probe: Visualize these data in meaningful ways. Probe: Visualize the free text from the field…
Prompt #2: Harnessing Human Insights with ChatGPT
Use ChatGPT to conduct and write up an analysis from qualitative data, such as a focus group or interview*.
LLM’s can summarize and even synthesize complex information, including unstructured textual data. While the result might require fine-tuning, it’s a great (and quick!) way to produce a solid first draft.
We link out to this example, too wordy for this space (see prompts at the end of this section). Briefly: We used a former Data-Driven blog post as training on the desired writing style and uploaded a partial transcript from an NPR interview with Surgeon General Vivek Murthy. We then prompted GPT-4 to use the blog style to generate a report detailing the interview themes, including direct quotes; later, we pushed for a more active voice and for a fourth paragraph that ties in a new way of applying the themes identified. These prompts yielded a nice draft upon which to iterate. To be fair, this draft needs substantial revision before approaching what we would seek in a final product. However, allowing ChatGPT to extract the initial themes as well as write a first and second draft in mere minutes constitutes noteworthy assistance in qualitative research writing**.
Prompts and probe for summarizing interview themes:
Note: Modify the prompts below to fit your interview, desired report format, and a “training template”
Prompt: Below Interviewee responses are preceded by "REPLY" whereas the interview questions are preceded by "QUESTION". Please review and reply with "Read"..” <<Insert transcript>>
Prompt: See writing sample below. Be sure to note the style, cadence, vocabulary level, voice, and other characteristics of this writing style so you can replicate it. We will call this writing style "Style1". Once you have read the sample below, reply with "Read" if you understand the characteristics of "Style1". Otherwise, reply with a clarifying question.” <<Insert writing sample template>>
Prompt: Reconsider the interview with <<Interviewee>>. Identify 3 themes that are present in their interview replies. Write a report no longer than 3 paragraphs long characterizing the themes in the <<Interviewee>> replies. Be sure to include at least two direct quotes from <<Interviewee>>. The report should use the Style1 style of writing.
Probe: Keep the report above except transition the prose to active tense wherever possible. Plus add an additional 4th paragraph that ties in the themes identified about how the concerns discussed may be exacerbated by <<new issue>>.
Prompt #3: Your Very Own Best / Worst Critic
Ask ChatGPT to point out your study’s weaknesses by writing the Limitations section of a manuscript.
After analyzing your data and drawing conclusions, you are ready to publish. So, how do you highlight the importance of your findings while also being upfront about the limitations of your work? This critical part of research papers is easily overlooked. But it's important for transparency and for contributing to a larger conversation and that means being honest about where our work might fall short. But how can ChatGPT help you in this task?
We asked ChatGPT to identify the limitations of a joint hypermobility study involving patients diagnosed with myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS). Preliminary results were first presented by our colleague, Kathleen Mudie, in 2022, and the full study awaits publication, making us confident that ChatGPT’s training data (which ended in September 2021) didn’t include information from this study. We put a summary of the study methods, results, and conclusions into ChatGPT as the prompt and let it go to work. The initial list was a bit…cookie-cutter for a limitations section (Bias from self-reported data? Not exactly a groundbreaking insight). Initially, ChatGPT also made some inaccurate assertions. However, refining the prompt to include the study aims improved the list of limitations. Interestingly, when we repeated an identical prompt in a new chat window multiple times, ChatGPT varied the list of limitations, highlighting a hallmark of generative AI: while the output is probabilistic, there remains an element of randomness*** too. The final output can be found here.
Take home: Whether you were aware of the limitations before the study or they caught you by surprise, your expertise will exceed ChatGPT's - for this use case, the chatbot is best used to reinforce your thinking, ensure you aren’t missing anything important, and make the task of writing a limitations section a bit lighter. However, the details you provide on study aims, design, and methods need to be comprehensive and detailed enough for ChatGPT to even attempt a meaningful critique. It will likely produce some generic ideas that you can refine with further prompting (e.g., challenge ChatGPT on limitations that don’t seem right to you, ask it which limitations have the greatest potential impact on the quality of your findings). Most importantly, ChatGPT can produce plausible-sounding but incorrect or misleading responses so you will also need to check the output closely for accuracy.
Prompts for writing a limitations section:
Prompt: I have just completed a study that <<< brief description of study >>>. Based on the study aims, methods, and findings summarized below, what are the limitations of my research? <<< Summary pasted in >>
Prompt: Which limitations had the greatest potential impact on: (1) the quality of our findings, and (2) our ability to answer the research question? Name up to two limitations for quality and up to two limitations for ability to answer the research question.
Prompt: Imagine you are the editor of a renowned medical journal and have been asked to offer critique of these study findings in the form of a limitations section. Write up the top limitations of this study in paragraph format. Include one paragraph about how future studies could overcome the limitations of this study. Do not exceed 500 words.
Prompt #4: Keeping data organization on the table.
Task ChatGPT with creating, titling, footnoting, and sorting a table for you
Research tasks and reports often involve tables, so we tested table creation using GPT-3.5 and GPT-4.0. Complex table formats for grants or journals may be beyond the capabilities of ChatGPT, for now, so we kept the tables simple. Our prompt is too long for this blog and is included at the end of this section. For our initial attempt in GPT-3.5, data had to be pasted in (not uploaded), the table format was fine but a bit bland, the footnote was not formatted as instructed, and the sort order of the first subtable was incorrect (not in ascending caloric order). But that is far from all that was incorrect: the drink selection is unexpectedly off, selecting the wrong options for low calorie or high protein.
Using GPT-4 with Code Interpreter engaged resolved these issues, most notably the mathematical ones. It is sometimes confusing to people that LLMs are, on their own, not mathematically astute. In a way, this is unsurprising. LLMs are working to generate something new by predicting the next “token” (for example, the next word or, in an image, the next pixel) based on probabilities. Math has one correct answer, not many answers, and each time you ask a math question, the response should be the same. For any tables or data organization tasks that engage even basic addition or sorting, use GPT-4.0 with Code Interpreter
Prompt & dataset for table creation, with Code Interpreter:
Dataset uploaded so you can recreate the table with the prompt below.
Prompt: “Create a table outside of the codeblock that includes the 6 lowest calorie and 6 lowest sodium drinks. Below are table specifications:
title=Starbucks drink choices to minimize calories or maximize protein
First list the 6 low calorie drinks under a table subheading of "Low calorie options"
Start with the lowest calorie drink, end with the highest calorie drink and go alphabetically when calories are equal.
Columns delineated by ";" = Drink name; Calories; Fat(g); Carb(g); Protein; Sodium
rows = table subheaders and drink names
Under the 6 low calorie drinks add a table subheader ="High Protein Options"
Under the "High Protein Options" subheader list the 6 highest protein drinks in descending order from most to least protein, going alphabetically when protein content is identical
Use the same column headers as you created for low calorie drinks
Footnote the table title with an "*"
Add this footnote after the table: *Data from January 2021 and may no longer represent current calorie or protein counts”
Bonus task: Use ChatGPT to AI-luminate your research.
Whether it is for a press release or social media post, researchers seek to reach an audience wider than their fellow scientists and ChatGPT has skills there too. We input a preprint abstract from Dr. Bhupesh Prusty’s ME/CFS and Long Covid research and asked it to write tweets for two different audiences, and it did, complete with emojis :-)
The BOT-tom line:
There are a number of ways to use ChatGPT for research, and it was challenging to limit ourselves to highlighting just four (plus a bonus). We believe they demonstrate the versatility and range of ChatGPT, from cleaning data to writing it up and disseminating results to different audiences. With LLMs comes the possibility of increased efficiency and enhanced research quality – a prospect as encouraging as it is intimidating. And one that, at least for the time being, leaves us with many questions:
When and how will researchers gain efficiencies from LLMs when checking for hallucination and/or misunderstanding is a non-negotiable need? Will editors accept LLMs as co-authors? Will AI’s contributions to writing be discernible or even necessary to identify? Will roles in natural language processing, analysis, statistical coding, and technical or research writing transform or become obsolete as LLMs take over these tasks? Did Allison and Leslie even write this blog post?
What do you want to learn about AI for research or business?
Take our one-question survey.
* A note about privacy: Your inputted prompts will be visible to OpenAI.
** For more creative writing needs, your mileage will vary. You might want to start with a human first draft and use ChatGPT as an editorial assistant.
*** A key sampling parameter is called Temperature. It affects how “random” the model’s output is. A value of 0 makes the engine deterministic, which means it generates the same output for a given input text. A value of 1 results in more variation and creativity. The default temperature is 0.7 in ChatGPT. To directly experiment with Temperature and other parameters like Top-P, which limits the selection pool of tokens to the likeliest, you can use OpenAI's GPT API. You can also include a temperature setting directly in a ChatGPT prompt (e.g., "temperature = 0.4") but we experienced some inconsistencies with this; something that others have also observed. Link: Temperature Check: A Guide to the Best ChatGPT Feature You're Probably (Not) Using