Can you give me an example? I'm still dumb on this.
Well, of course
In a former post you used the prompt ""(((full body visible))) 20 yr old woman with pink dyed hair dressed as an assassin in a burning factory, intricate, highly detailed,8k ultra-realistic, colorful, painting burst, beautiful symmetrical face, a nonchalant kind look, realistic round eyes, tone mapped, intricate, elegant, highly detailed, digital painting, art station, concept art, smooth, sharp focus, illustration, dreamy magical atmosphere,4k, looking at the viewer"
This style of prompt generate tokens that the model use but it doesn't know the relationship between them. That's why earlier model either performed better when they had a single subjects (all the keyword pointed to the only subject or the image in general) or they faced concept bleed. If you had need two women, adding "wearing a yellow totebag" would lead the AI to randomly decide which girl should wear it. You could improve the odds by removing colons and putting the keywords close together, but the limitations of the "text encoder" part of the AI model showed quickly. The original text encoding model (clip) was 500 MB in size and could only do limited encoding of text into token usable by the image-generating model.
Newer models use a much larger text encoder, T5-XXL in the case of Flux, that is around 45 GB in size unpruned. It is able to understand much, much more natural text and relationship between words in order to generate tokens that can represent more complex things. However, if you prompt this newer encoder with prompts in the old style, he still doesn't know which woman is holding the yellow totebag, because it wasn't in the prompt. Also, to make sure the large encoder is used well, the image-generation part of the model is trained on image with much, much longer description, so it can learn concept more easily, so it responds better if he's prompted in a natural language. I'll post a few images and prompts to illustrate later today.