Enhancing Diffusion Models with 3D Perspective Geometry Constraints

Rishi Upadhyay1 Howard Zhang1 Yunhao Ba12 Ethan Yang1 Blake Gella1 Sicheng Jiang1 Alex Wong3 Achuta Kadambi1

University of California, Los Angeles1 Sony2 Yale University3

SIGGRAPH Asia 2023, Sydney

Images from our model preserve straight lines and perspective. Traditional diffusion models include no constraints on physical accuracy and rely entirely on large datasets to generate realistic images. Our proposed geometric constraint explicitly encodes perspective constraints and results in improved image generation and downstream task performance.

While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective. We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy. We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images. Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer.


An overview of our proposed loss function. Given an input image with vanishing points, we sweep lines (shown in red and green) across the image and calculate the sum of image gradients across this line. The full distribution of summed gradients is then compared to the original image.



We ask users around the world to rank sets of images in terms of photo-realism:


Images generated by the baseline model (StableDiffusion v2) and our enhanced model. Images from our model are more consistent and have straighter lines and more accurate perspective.


To evaluate the quality of our images as synthetic data, we generate synthetic datasets using our model and the baseline model and fine-tune SOTA depth estimation models on them: Depth estimation models trained on our images capture more high-frequency detail and consistently achieve lower RMSE.

  author = {Upadhyay, Rishi and Zhang, Howard and Ba, Yunhao and Yang, Ethan and Gella, Blake and Jiang, Sicheng and Wong, Alex and Kadambi, Achuta},
  title = {Enhancing Diffusion Models with 3D Perspective Geometry Constraints},
  year = {2023},
  issue_date = {December 2023},
  volume = {42},
  number = {6},
  doi = {10.1145/3618389},
  journal = {ACM Trans. Graph.},
  month = {dec},
  articleno = {237},
  numpages = {15},

Rishi Upadhyay
Computer Science Department