AI models fed AI-generated data quickly spew nonsense

ArcticDagger@feddit.dk · 2 months ago

AI models fed AI-generated data quickly spew nonsense

Fleur__@lemmy.world · 2 months ago

And screenshotting a jpeg over and over again reduces the quality?

Evotech@lemmy.world · 2 months ago

As long as you verify the output to be correct before feeding it back is probably not bad.

eleitl@lemm.ee · 2 months ago

How do you verify novel content generated by AI? How do you verify content harvested from the Internet to “be correct”?

Evotech@lemmy.world · 2 months ago

Same way you verified the input to begin with. Human labor

Pennomi@lemmy.world · 2 months ago

That’s correct, and the paper supports this. But people don’t want to believe it’s true so they keep propagating this myth.

Training on AI outputs is fine as long as you filter the outputs to only things you want to see.

Binette@lemmy.ml · 2 months ago

The issue is that A.I. always does a certain amount of mistakes when outputting something. It may even be the tiniest, most insignificant mistake. But if it internalizes it, it’ll make another mistake including the one it internalized. So on and so forth.

Also this is more with scraping in mind. So like, the A.I. goes on the internet, scrapes other A.I. images because there’s a lot of them now, and becomes worse.

ArcticDagger@feddit.dk · 2 months ago

From the article:

To demonstrate model collapse, the researchers took a pre-trained LLM and fine-tuned it by training it using a data set based on Wikipedia entries. They then asked the resulting model to generate its own Wikipedia-style articles. To train the next generation of the model, they started with the same pre-trained LLM, but fine-tuned it on the articles created by its predecessor. They judged the performance of each model by giving it an opening paragraph and asking it to predict the next few sentences, then comparing the output to that of the model trained on real data. The team expected to see errors crop up, says Shumaylov, but were surprised to see “things go wrong very quickly”, he says.

Hamartiogonic@sopuli.xyz · edit-2 2 months ago

A few years ago, people assumed that these AIs will continue to get better every year. Seems that we are already hitting some limits, and improving the models keeps getting harder and harder. It’s like the linewidth limits we have with CPU design.

ArcticDagger@feddit.dk · 2 months ago

I think that hypothesis still holds as it has always assumed training data of sufficient quality. This study is more saying that the places where we’ve traditionally harvested training data from are beginning to be polluted by low-quality training data

HowManyNimons@lemmy.world · 2 months ago

It’s almost like we need some kind of flag on AI-generated content to prevent it from ruining things.

Hamartiogonic@sopuli.xyz · 2 months ago

If that gets implemented, it would help AI devs and common people hanging online.

HowManyNimons@lemmy.world · edit-2 2 months ago

File it under “too good to happen”. Most writing jobs are proofreading AI-generated shit these days. We’ll need to wait until there’s real money in writing scripts to de-pollute content.

KeenFlame@feddit.nu · 2 months ago

No they are increasingly getting better, mostly they fit in a bigger context of other discoveries

0laura@lemmy.world · edit-2 2 months ago

no, not really. the improvement gets less noticeable as it approaches the limit, but I’d say the speed at which it improves is still the same. especially smaller models and context window size. there’s now models comparable to chatgpt or maybe even gpt 4.0 (I don’t remember, one or the other) with context window size of 128k tokens, that you can run on a GPU with 16gb of vram. 128k tokens is around 90k words I think. that’s more than 4 bee movie scripts. it can “comprehend” all of that at once.

metaStatic@kbin.earth · 2 months ago

we have to be very careful about what ends up in our training data

Don’t worry, the big tech companies took a snapshot of the internet before it was poisoned so they can easily profit from LLMs without allowing competitors into the market. That’s who “We” is right?

VeganPizza69 Ⓥ@lemmy.world · 2 months ago

The retroactive enclosure of the digital commons.

WhatAmLemmy@lemmy.world · edit-2 2 months ago

It’s impossible for any of them to have taken a sufficient snapshot. A snapshot of all unique data on the clearnet would have probably been in the scale of hundreds to thousands of exabytes, which is (apparently) more storage than any single cloud provider.

That’s forgetting the prohibitively expensive cost to process all that data for any single model.

The reality is that, like what we’ve done to the natural world, they’re polluting and corrupting the internet without taking a sufficient snapshot — just like the natural world, everything that’s lost is lost FOREVER… all in the name of short term profit!

m3t00🌎@lemmy.world · 2 months ago

interesting refinement of the old GIGO effect.

Whirling_Cloudburst@lemmy.world · 2 months ago

When you fed your AI too much mescaline.

reattach@lemmy.world · edit-2 2 months ago

Serious salvia flashbacks from that headline image.

RIPandTERROR@sh.itjust.works · 2 months ago

Deadpool spoilers

MarauderIIC@dormi.zone · 2 months ago

I wonder if the speed at which it degrades can be used to detect AI-generated content.

maniclucky@lemmy.world · 2 months ago

I wouldn’t be surprised if someone is working on that as a PhD thesis right now.

Ferris@infosec.pub · 2 months ago

how are you going to write a thesis on writing a FLAC to disc and ripping it over and over?

maniclucky@lemmy.world · 2 months ago

By measuring how it does with real images vs generated ones to start. The goal would be to show a method to reliably detect ai images. Gotta prove that it works.

KeenFlame@feddit.nu · 2 months ago

How would it detect, you would need the model and if you do you can already detect

maniclucky@lemmy.world · 2 months ago

It’s an issue with the machine learning technique, not the specific model. The hypothetical thesis would be how to use this knowledge in general.

Why are you so agitated by my off hand comment?

AllNewTypeFace@leminal.space · 2 months ago

The Habsburg Singularity

Willy@sh.itjust.works · 2 months ago

Huh. Who would have thought talking mostly or only to yourself would drive you mad?

Boxscape@lemmy.sdf.org · 2 months ago

AI like:

umbrella@lemmy.ml · 2 months ago

that shit will pave the way for new age horror movies i swear

VeganPizza69 Ⓥ@lemmy.world · edit-2 2 months ago

GOOD.

This “informational incest” is present in many aspects of society and needs to be stopped (one of the worst places is in the Intelligence sector).

Etterra@lemmy.world · 2 months ago

Informational Incest is my least favorite IT company.

runeko@programming.dev · 2 months ago

Damn. I just bought 200 shares of ININ.

funkless_eck@sh.itjust.works · 2 months ago

they’ll be acquired by McKinsey soon enough

credit crazy@lemmy.world · 2 months ago

Too bad they only operate in Alabama

RIPandTERROR@sh.itjust.works · edit-2 2 months ago

WHAT ARE YOU DOING STEP SYS ADMIN?

Etterra@lemmy.world · 2 months ago

How many times did you say this went through a copy machine?