Artificial intelligence (AI) chatbots are worse at retrieving accurate information and reasoning when trained on large amounts of low-quality content, particularly if the content is popular on social media1, finds a preprint posted on arXiv on 15 October.
In data science, good-quality data need to meet certain criteria, such as being grammatically correct and understandable, says co-author Zhangyang Wang, who studies generative AI at the University of Texas at Austin. But these criteria fail to capture differences in content quality, he says.
Wang and his colleagues wanted to see the effects of large language models (LLMs) trained on low-quality data — defined as short, popular social-media posts, or those containing superficial or sensationalist content. They looked at how these data affected model reasoning, retrieval of information from long inputs, the ethics of responses and model personality traits.
The team reports that models given low-quality data skip steps in their reasoning process — or don’t use reasoning at all — resulting in the model providing incorrect information about a topic, or when the authors presented a multiple choice question, the model would pick the wrong answer. In data sets with a mix of junk and high-quality data, the negative effect on reasoning increased as the proportion of junk data increased. The work has not been peer-reviewed.
The findings support a long-held tenet of AI: the importance of data quality, says Mehwish Nasim, an AI researcher at the University of Western Australia in Perth. “Even before people started to work on large language models, we used to say that, if you give garbage to an AI model, it’s going to produce garbage,” she adds.
To read more, click here.