Government websites are loaded with misinformation, and that’s a big problem for AI
Sometimes information that looks reliable because it’s from a federal agency website has not in fact been vetted, and AI systems are only as trustworthy as the data they’re fed.
As artificial intelligence proliferates, so do concerns about AI-grown misinformation. AI systems like ChatGPT and other algorithms learn from the text and data that they’re fed. If the input data is bad, so it the output—thus the aphorism “garbage in; garbage out.”
The Washington Post recently published a report analyzing the sources of Google’s C4 data set, a large collection of information used to train many AI models. There was bad news and some supposed good news. The bad news: many sources in Google’s data set ranked low on trustworthiness scales and promoted conspiracy theories, feeding misinformation and propaganda into AI models. But the top source of information in the dataset, patents published by governments across the world, is more trustworthy. Many other government websites are also major sources of information for Google’s data set.
Is this good news? Government-based sources like patents are probably more reliable than, say, 4chan.org, an anonymous message board that was also incorporated into the training data. But government sources, including those from the U.S. government, are surprisingly unreliable and contain copious information that is outright false.
I’m a law professor who researches the reliability of government information, and I’ve found that we too often blindly assume that information we find on a U.S. government website is correct. I’m not talking about propaganda but about more mundane misinformation—from patents, reports, lists and databases. It’s crucial to recognize the flaws of government information both because it is an important input into AI systems and because government information is a source that we interact with in many other ways—and are predisposed to trust.
Heuristics—mental shortcuts—make me and you (assuming you are someone who generally trusts the government) inclined to assume that information published by the government is reviewed or generated by experts and therefore somewhat trustworthy. But that isn’t necessarily true. A huge amount of information published by the government is generated by third parties and isn’t reviewed at all for accuracy before publication.
Take U.S. patents—reminder: patents are the number one source of data in Google’s training set—which are routinely published with fictional, fraudulent and incorrect data (for example, the U.S. Patent Office granted patents from Theranos on their now-discredited medical technology, even after the falsity of the company’s claims were highly publicized). Almost one quarter of U.S. life sciences patents include fictional experiments, but they’re often interpreted as factual because readers tend to trust patents.
And it’s not just patents. The Environmental Protection Agency publishes data on industrial pollution and suggests that, before you buy a house, you check pollution levels in the area. Where does the pollution data come from? Companies self-report it. Does the EPA check it? No, and a report from the Government Accountability Office, a nonpartisan watchdog, found frequent errors.
Another example comes from the National Institutes of Health, which publishes a list of clinical trials to help patients find new medical treatments. Many trials are reviewed for safety by the Food and Drug Administration before they’re posted, but not all, and the NIH does not review the information it posts for either safety or accuracy. Companies peddling unapproved treatments have taken advantage of this loophole and listed procedures with the NIH in an attempt to enhance their legitimacy. Several patients who were tragically blinded after undergoing an unapproved stem-cell treatment reported that they had believed the treatment was a government-reviewed clinical trial because it was posted on the NIH’s clinical trials site.
In yet another example, our trust in government-published information is exploited by opponents of vaccination. When former Fox News host Tucker Carlson claimed that data from the Centers for Disease Control and Prevention showed that thousands of people had died after taking the Covid vaccine, he wasn’t exactly wrong. CDC data do say that. But the CDC database is an aggregation of reports that can be submitted by anyone and are not checked for accuracy by the CDC (not to mention, Carlson’s claim seriously confuses correlation and causation). Further, there’s evidence that opponents of vaccination deliberately submit reports of vaccine side effects to the CDC’s database so that they can later cite the CDC’s authority to back up their claims that vaccines are dangerous—essentially laundering information through the government to make it look more legitimate.
When we think of misinformation from the government, we often think about deliberately false propaganda. But in the United States, a much more widespread source of misinformation is information generated by third parties and published without vetting by the government. It’s on a government website so it looks trustworthy, but it’s sometimes not. Government agencies should dedicate more resources to vetting this data. The rise of AI makes this effort more important now than ever. But in the meantime, dear reader, be a cautious consumer of information, whether from an AI model or a government website.
Janet Freilich is a professor at Fordham Law School who writes and teaches in the areas of patent law, intellectual property and civil procedure. She is the author of a paper titled “Government Misinformation Platforms,” which is scheduled to be published in the spring of 2024 in the Pennsylvania Law Review.