Menu Back to Poster-Presentations-Details

P107: ChatGPT: A Reliable Tool to Identify Regulatory Precedents?

Poster Presenter

Paul Bolot

Global Regulatory Strategist
Bayer Consumer Care AG
Switzerland

Objectives

Assess and compare the performance and reliability of Large Language Models, such as ChatGPT, in identifying regulatory precedents (approved drugs in US and EU, based on indication, approval type, regulatory designations, etc.), in the field of oncology.

Method

The study was conducted in Jan. 2024, assessing two large language models: GPT-3.5 (within ChatGPT) and GPT-4 (within Bayer internal system). We asked the models 50 questions to identify regulatory precedents and compared their answers to the precedents retrieved from an internal database.

Results

In order to assess the models’ performance, we only considered valid answers, i.e., an answer providing at least one precedent (correct or not). Indeed, although both models provided answers to all fifty questions, the number of valid answers were very different between the two models. GPT-3.5 provided only 6 valid answers (12%), whereas GPT-4 was able to provide at least one precedent to 46 questions (92%). This gap can be explained by the nature of the questions. For instance, the fifty questions were split half and half between US and EU regulatory precedents. However, GPT-3.5 was not able to provide any European precedent. Likewise, GPT-3.5 failed when asking for precedents based on trial design, primary endpoint, or regulatory designation criteria (e.g., “List all drugs approved for the treatment of prostate cancer in the US based on ORR”). GPT-4 failed specifically on questions about European precedents, with regulatory designation criteria (e.g., “List all EU initial approvals of medicines for the treatment of NSCLC with Orphan Drug Designation”). Then, we assessed if the models answers were correct or not, by comparing them to our internal cancer drug approval database. To do so, we defined the concordance rate as the ratio between the number of correct precedents from the model, and the number of correct precedents from our database. On average, GPT-4 had a concordance rate of 60% (min. 0%; max. 100%). Indeed , across all precedents listed by the models, we frequently encountered errors. For instance, GPT-4 could often not make a difference between initial and supplemental approvals. Eventually, it should be noted that the models’ answers always included a statement mentioning that the provided list of drugs may not be exhaustive.

Conclusion

In this study, we assessed and compared the performance of two large language models, GPT-3.5 (within ChatGPT) and GPT-4 (within Bayer internal system) in answering a list of questions on regulatory precedents. The models were not able to identify precedents for all questions, but the most recent model (GPT-4) performed significantly better than its older version (GPT-3.5), answering most of the questions. Regarding the answers’ accuracy, the list of precedents provided by the models often contained errors. Nonetheless, these limits should be regarded together with the strengths of these large language models. These models provided very quick answers, whereas retrieving regulatory precedents from primary sources can take some time. This is particularly true when there is no access to any tools or database collecting regulatory approvals. In such case, large language models can offer a quick and useful alternative. However, regulatory professionals should consider answers from large language models with extreme caution and should systematically double check the information provided . Defining a sound regulatory strategy often means identifying and assessing relevant regulatory precedents. Nonetheless, finding such precedents is not an easy task, in particular in the absence of any tools or databases. Used with caution and rigor, large language models could help in the search for regulatory precedents. One could use them as a first step, a starting point, to quickly get a list of potential precedents, which would then need to be cautiously checked and verified. Moreover, with the quick development of large language models, we can expect them to continuously improve overtime, and to play a greater role as assistant to regulatory strategists.