• fubarx@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    2
    ·
    9 days ago

    Was working on a simulator and needed random interaction data. Statistical randomness didn’t capture likely scenarios (bell curves and all that). Switched to LLM synthetic data generation. Seemed better, but wait… seemed off 🤔. Checked it for clustering and entropy vs human data. JFC. Waaaaaay off.

    Lesson: synthetic data for training is a Bad Idea. There are no shortcuts. Humans are lovely and messy.