Toward verifiable and reproducible human evaluation for text-to-image generation