Was working on a simulator and needed random interaction data. Statistical randomness didn’t capture likely scenarios (bell curves and all that). Switched to LLM synthetic data generation. Seemed better, but wait… seemed off 🤔. Checked it for clustering and entropy vs human data. JFC. Waaaaaay off.
Lesson: synthetic data for training is a Bad Idea. There are no shortcuts. Humans are lovely and messy.
Was working on a simulator and needed random interaction data. Statistical randomness didn’t capture likely scenarios (bell curves and all that). Switched to LLM synthetic data generation. Seemed better, but wait… seemed off 🤔. Checked it for clustering and entropy vs human data. JFC. Waaaaaay off.
Lesson: synthetic data for training is a Bad Idea. There are no shortcuts. Humans are lovely and messy.