• 2 Posts
  • 136 Comments
Joined 1 year ago
cake
Cake day: June 16th, 2023

help-circle








  • Speaking for LLMs, given that they operate on a next-token basis, there will be some statistical likelihood of spitting out original training data that can’t be avoided. The normal counter-argument being that in theory, the odds of a particular piece of training data coming back out intact for more than a handful of words should be extremely low.

    Of course, in this case, Google’s researchers took advantage of the repeat discouragement mechanism to make that unlikelihood occur reliably, showing that there are indeed flaws to make it happen.


  • I’m not an expert, but I would say that it is going to be less likely for a diffusion model to spit out training data in a completely intact way. The way that LLMs versus diffusion models work are very different.

    LLMs work by predicting the next statistically likely token, they take all of the previous text, then predict what the next token will be based on that. So, if you can trick it into a state where the next subsequent tokens are something verbatim from training data, then that’s what you get.

    Diffusion models work by taking a randomly generated latent, combining it with the CLIP interpretation of the user’s prompt, then trying to turn the randomly generated information into a new latent which the VAE will then decode into something a human can see, because the latents the model is dealing with are meaningless numbers to humans.

    In other words, there’s a lot more randomness to deal with in a diffusion model. You could probably get a specific source image back if you specially crafted a latent and a prompt, which one guy did do by basically running img2img on a specific image that was in the training set and giving it a prompt to spit the same image out again. But that required having the original image in the first place, so it’s not really a weakness in the same way this was for GPT.









  • For those in the US: Learn how to file your own taxes. It’s really simple for the large majority of people, and usually just consists of copying numbers into boxes off a sheet your employer made for you. After you’ve done it once, subsequent times you’ll probably have it done yourself in less than half an hour.

    You can do it for free on a ton of sites unless you make significant income, freetaxusa is typically the most highly recommended one.