MIT, Cohere for AI, others launch platform to track and filter audited AI datasets

VentureBeat presents: AI Unleashed – An unique government occasion for enterprise information leaders. Community and be taught with trade friends. Learn More

Contents

Dataset collections don’t acknowledge lineage Coaching datasets have been underneath scrutiny in 2023

Researchers from MIT, Cohere for AI and 11 different establishments launched the Information Provenance Platform right this moment with a view to “sort out the info transparency disaster within the AI area.”

They audited and traced practically 2,000 of essentially the most broadly used fine-tuning datasets, which collectively have been downloaded tens of tens of millions of instances, and are the “spine of many printed NLP breakthroughs,” based on a message from authors Shayne Longpre, a Ph.D candidate at MIT Media Lab, and Sara Hooker, head of Cohere for AI.

“The results of this multidisciplinary initiative is the one largest audit thus far of AI dataset,” they stated. “For the primary time, these datasets embrace tags to the unique information sources, quite a few re-licensings, creators, and different information properties.”

To make this info sensible and accessible, an interactive platform, the Data Provenance Explorer, permits builders to trace and filter 1000’s of datasets for authorized and moral issues, and permits students and journalists to discover the composition and information lineage of in style AI datasets.

Dataset collections don’t acknowledge lineage

The group launched a paper, The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI, which says:

“More and more, broadly used dataset collections are handled as monolithic, as an alternative of a lineage of knowledge sources, scraped (or mannequin generated), curated, and annotated, typically with a number of rounds of re-packaging (and re-licensing) by successive practitioners. The disincentives to acknowledge this lineage stem each from the size of contemporary information assortment (the hassle to correctly attribute it), and the elevated copyright scrutiny. Collectively, these components have seen fewer Datasheets, non-disclosure of coaching sources and in the end a decline in understanding coaching information.

This lack of know-how can result in information leakages between coaching and check information; expose personally identifiable info (PII), current unintended biases or behaviours; and usually lead to decrease
high quality fashions than anticipated. Past these sensible challenges, info gaps and documentation
debt incur substantial moral and authorized dangers. As an example, mannequin releases seem to contradict information phrases of use. As coaching fashions on information is each costly and largely irreversible, these dangers and challenges should not simply remedied.”

Coaching datasets have been underneath scrutiny in 2023

VentureBeat has deeply lined points associated to information provenance and transparency of coaching datasets: Again in March, Lightning AI CEO William Falcon slammed OpenAI’s GPT-4 paper as ‘masquerading as analysis.”

Many stated the report was notable principally for what it did not embrace. In a bit referred to as Scope and Limitations of this Technical Report, it says: “Given each the aggressive panorama and the protection implications of large-scale fashions like GPT-4, this report incorporates no additional particulars in regards to the structure (together with mannequin dimension), {hardware}, coaching compute, dataset building, coaching technique, or related.”

And in September, we printed a deep dive into the copyright points looming in generative AI coaching information.

The explosion of generative AI over the previous 12 months has change into an “‘oh, shit!” second relating to coping with the info that skilled massive language and diffusion fashions, together with mass quantities of copyrighted content material gathered with out consent, Dr. Alex Hanna, director of analysis on the Distributed AI Research Institute (DAIR), instructed VentureBeat.

Source link

Artificial Intelligence
in Action

Top Stories

How Meta’s CyberSecEval 3 can help combat weaponized LLMs

Forrester’s CISO budget priorities include API, supply chain security

Table-augmented generation shows promise for complex dataset querying, outperforms text-to-SQL

MIT, Cohere for AI, others launch platform to track and filter audited AI datasets

Dataset collections don’t acknowledge lineage

Coaching datasets have been underneath scrutiny in 2023

Leave a Reply Cancel reply

Related Strories

11 Things a Clinical AI Platform Must Deliver – Healthcare AI

MIT spinout teaches AI to admit when it’s clueless

Real-World Impact: IVC Filter Management Experience at Mount Sinai – Healthcare AI

Clinical AI Platform vs. Marketplace: What’s the Difference — and Why It Matters – Healthcare AI

Quick links

Popular Categories

Follow Socials

Artificial Intelligence in Action

Top Stories

How Meta’s CyberSecEval 3 can help combat weaponized LLMs

Forrester’s CISO budget priorities include API, supply chain security

Table-augmented generation shows promise for complex dataset querying, outperforms text-to-SQL

MIT, Cohere for AI, others launch platform to track and filter audited AI datasets

Dataset collections don’t acknowledge lineage

Coaching datasets have been underneath scrutiny in 2023

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

11 Things a Clinical AI Platform Must Deliver – Healthcare AI

MIT spinout teaches AI to admit when it’s clueless

Real-World Impact: IVC Filter Management Experience at Mount Sinai – Healthcare AI

Clinical AI Platform vs. Marketplace: What’s the Difference — and Why It Matters – Healthcare AI

Get Insider Tips and Tricks in Our Newsletter!

Artificial Intelligence
in Action