Google DeepMind unveils 'superhuman' AI system that excels in fact-checking, saving costs and improving accuracy

Be part of us in Atlanta on April tenth and discover the panorama of safety workforce. We are going to discover the imaginative and prescient, advantages, and use instances of AI for safety groups. Request an invitation right here.

Contents

‘Superhuman’ efficiency sparks debate Price financial savings and benchmarking high fashions Transparency and human baselines are essential

A brand new examine from Google’s DeepMind analysis unit has discovered that a synthetic intelligence system can outperform human fact-checkers when evaluating the accuracy of knowledge generated by giant language fashions.

The paper, titled “Long-form factuality in large language models” and printed on the pre-print server arXiv, introduces a technique referred to as Search-Augmented Factuality Evaluator (SAFE). SAFE makes use of a big language mannequin to interrupt down generated textual content into particular person information, after which makes use of Google Search outcomes to find out the accuracy of every declare.

“SAFE makes use of an LLM to interrupt down a long-form response right into a set of particular person information and to guage the accuracy of every reality utilizing a multi-step reasoning course of comprising sending search queries to Google Search and figuring out whether or not a reality is supported by the search outcomes,” the authors defined.

‘Superhuman’ efficiency sparks debate

The researchers pitted SAFE in opposition to human annotators on a dataset of roughly 16,000 information, discovering that SAFE’s assessments matched the human rankings 72% of the time. Much more notably, in a pattern of 100 disagreements between SAFE and the human raters, SAFE’s judgment was discovered to be appropriate in 76% of instances.

Whereas the paper asserts that “LLM brokers can obtain superhuman score efficiency,” some specialists are questioning what “superhuman” actually means right here.

On a fast learn I can’t work out a lot in regards to the human topics, nevertheless it seems to be like superhuman means higher than an underpaid crowd employee, fairly a real human reality checker? That makes the characterization deceptive. (Like saying that 1985 chess software program was superhuman).…

— Gary Marcus (@GaryMarcus) March 28, 2024

Gary Marcus, a well known AI researcher and frequent critic of overhyped claims, recommended on Twitter that on this case, “superhuman” could merely imply “higher than an underpaid crowd employee, fairly a real human reality checker.”

“That makes the characterization deceptive,” he stated. “Like saying that 1985 chess software program was superhuman.”

Marcus raises a legitimate level. To actually exhibit superhuman efficiency, SAFE would have to be benchmarked in opposition to professional human fact-checkers, not simply crowdsourced staff. The particular particulars of the human raters, resembling their {qualifications}, compensation, and fact-checking course of, are essential for correctly contextualizing the outcomes.

Price financial savings and benchmarking high fashions

One clear benefit of SAFE is price — the researchers discovered that utilizing the AI system was about 20 occasions cheaper than human fact-checkers. As the amount of knowledge generated by language fashions continues to blow up, having a cost-effective and scalable solution to confirm claims will likely be more and more very important.

The DeepMind workforce used SAFE to guage the factual accuracy of 13 high language fashions throughout 4 households (Gemini, GPT, Claude, and PaLM-2) on a brand new benchmark referred to as LongFact. Their outcomes point out that bigger fashions typically produced fewer factual errors.

Nonetheless, even the best-performing fashions generated a major variety of false claims. This underscores the dangers of over-relying on language fashions that may fluently specific inaccurate info. Computerized fact-checking instruments like SAFE might play a key position in mitigating these dangers.

Transparency and human baselines are essential

Whereas the SAFE code and LongFact dataset have been open-sourced on GitHub, permitting different researchers to scrutinize and construct upon the work, extra transparency continues to be wanted across the human baselines used within the examine. Understanding the specifics of the crowdworkers’ background and course of is crucial for assessing SAFE’s capabilities in correct context.

Because the tech giants race to develop ever extra highly effective language fashions for purposes starting from search to digital assistants, the flexibility to mechanically fact-check the outputs of those techniques might show pivotal. Instruments like SAFE symbolize an essential step in direction of constructing a brand new layer of belief and accountability.

Nonetheless, it’s essential that the event of such consequential applied sciences occurs within the open, with enter from a broad vary of stakeholders past the partitions of anyone firm. Rigorous, clear benchmarking in opposition to human specialists — not simply crowdworkers — will likely be important to measure true progress. Solely then can we gauge the real-world impression of automated fact-checking on the battle in opposition to misinformation.

Source link

Artificial Intelligence
in Action

Top Stories

How Meta’s CyberSecEval 3 can help combat weaponized LLMs

Forrester’s CISO budget priorities include API, supply chain security

Table-augmented generation shows promise for complex dataset querying, outperforms text-to-SQL

Google DeepMind unveils ‘superhuman’ AI system that excels in fact-checking, saving costs and improving accuracy

‘Superhuman’ efficiency sparks debate

Price financial savings and benchmarking high fashions

Transparency and human baselines are essential

Leave a Reply Cancel reply

Related Strories

How Meta’s CyberSecEval 3 can help combat weaponized LLMs

Forrester’s CISO budget priorities include API, supply chain security

Table-augmented generation shows promise for complex dataset querying, outperforms text-to-SQL

Asynchronous LLM API Calls in Python: A Comprehensive Guide

Quick links

Popular Categories

Follow Socials

Artificial Intelligence in Action

Top Stories

How Meta’s CyberSecEval 3 can help combat weaponized LLMs

Forrester’s CISO budget priorities include API, supply chain security

Table-augmented generation shows promise for complex dataset querying, outperforms text-to-SQL

Google DeepMind unveils ‘superhuman’ AI system that excels in fact-checking, saving costs and improving accuracy

‘Superhuman’ efficiency sparks debate

Price financial savings and benchmarking high fashions

Transparency and human baselines are essential

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

How Meta’s CyberSecEval 3 can help combat weaponized LLMs

Forrester’s CISO budget priorities include API, supply chain security

Table-augmented generation shows promise for complex dataset querying, outperforms text-to-SQL

Asynchronous LLM API Calls in Python: A Comprehensive Guide

Get Insider Tips and Tricks in Our Newsletter!

Artificial Intelligence
in Action