• lunarul@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    ·
    3 months ago

    Instead of writing captions, the team asked annotators to record 60- to 90-second verbal descriptions answering a list of questions about each image. They then transcribed the descriptions—which often stretched across several pages—and used other large language models to clean up, crunch down, and standardize them.

    So those other LLMs are needed to train this one?