Washington State University professor Mesut Cicek and his team repeatedly evaluated ChatGPT by giving it hypotheses drawn from scientific studies. The AI was asked to decide whether each statement was supported by research — essentially judging if it was true or false.

In total, the researchers tested more than 700 hypotheses and submitted each one 10 times to examine how consistent the responses would be.

In the initial 2024 experiment, ChatGPT answered correctly 76.5% of the time. When the study was repeated in 2025, accuracy rose slightly to 80%. However, once the results were adjusted for random guessing, the performance looked far less reliable. The AI was only about 60% better than chance, which the researchers described as closer to a low D than strong performance.

The system had particular difficulty identifying false statements, correctly labeling them only 16.4% of the time. It also showed inconsistency. When given the exact same prompt 10 times, ChatGPT produced consistent results for only about 73% of the cases.

To read more, click here.