Detailed Syllabus and Lectures

Lecture 13: Multimodal Pretraining (slides)

feature representations for vision and language, model architectures, pre-training tasks, downstream tasks, what's next

Please study the following material in preparation for the class:

Required Reading:

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee, NeurIPS 2019.
What Does BERT with Vision Look At?, Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang, ACL 2020.
Fusion of Detected Objects in Text for Visual Question Answering, Chris Alberti, Jeffrey Ling, Michael Collins, David Reitter, EMNLP 2019.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, Hao Tan, Mohit Bansal, EMNLP-IJCNLP 2019.
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou, AAAI 2020.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai, ICLR 2020.
Unified Vision-Language Pre-Training for Image Captioning and VQA, Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao, AAAI 2020.
12-in-1: Multi-Task Vision and Language Representation Learning, Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee, CVPR 2020.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao, ECCV 2020.
Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020.
VideoBERT: A Joint Model for Video and Language Representation Learning, Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid, ICCV 2019.
Learning Video Representations using Contrastive Bidirectional Transformer, Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid, arXiv preprint arXiv:1906.05743, 2019.
End-to-End Learning of Visual Representations from Uncurated Instructional Videos, Antoine Miech, Ivan Laptev, Jean-Baptiste Alayrac, Lucas Smaira, Josef Sivic, Andrew Zisserman, CVPR 2020.
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou, arXiv preprint arXiv:2002.06353.
Cross-lingual Visual Pre-training for Multimodal Machine Translation, Ozan Caglayan, Menekse Kuyu, Mustafa Sercan Amac, Pranava Madhyastha, Erkut Erdem, Aykut Erdem, Lucia Specia, EACL 2021.

Additional Resources:

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models, Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu, ECCV 2020.
Are we pretraining it right? Digging deeper into visio-linguistic pretraining, Amanpreet Singh, Vedanuj Goswami, Devi Parikh, arXiv preprint arXiv:2004.08744
Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs, Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, Desmond Elliott, arXiv preprint arXiv:2011.15124, 2020.

Lecture 12: Pretraining Language Models (slides)

RNN-based language models, contextualized word embeddings, scaling up generative pretraining (GPT-1, GPT-2, GPT-3) models, masked language modeling and BERT-based models

Please study the following material in preparation for the class:

Required Reading (more s denote higher priority):

Learned in Translation: Contextualized Word Vectors, Bryan McCann, James Bradbury, Caiming Xiong, Richard Socher, NIPS 2016.
A Neural Probabilistic Language Model, Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin, JMLR, Vol 3., 2003.
Recurrent neural network based language model, Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan “Honza” Černocky, Sanjeev Khudanpur, Interspeech 2010.
Generating Text with Recurrent Neural Networks, Ilya Sutskever, James Martens, Geoffrey Hinton, ICML 2011.
Generating Sequence with Recurrent Neural Networks, A. Graves, ArXiV
Skip-Thought Vectors, Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler, NIPS 2015.
Semi-supervised Sequence Learning, Andrew M. Dai, Quoc V. Le, NIPS 2015.
Exploring the Limits of Language Modeling, Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu, ArXiv preprint arXiv:1602.02410, 2016.
Learning to Generate Reviews and Discovering Sentiment, Alec Radford, Rafal Jozefowicz, Ilya Sutskever, arXiv preprint 1704.01444, 2017.
Deep Contextualized Word Representations, Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer, NAACL 2018./li>
Improving Language Understanding by Generative Pre-Training, Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, OpenAI Report, 2018.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, NAACL 2019.
, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, ArXiv preprint ArXiv:1907.11692, 2019.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning, ICLR 2020.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, JMLR 21(140), 2020.
Language Models are Unsupervised Multitask Learners, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, OpenAI Report, 2019.
Language Models are Few-Shot Learners, Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah et al., NeurIPS 2020.

Additional Resources:

[Blog post] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning), Jay Alammar.
[Blog post] Generalized Language Models, Lilian Weng.
A Primer in BERTology: What we know about how BERT works, Anna Rogers, Olga Kovaleva, Anna Rumshisky, TACL, Vol. 8, 2020.
Unifying Language Learning Paradigms, Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler, arXiv preprint, 2024.

Lecture 11: Self-Supervised Learning (slides)

denoising autoencoder, in-painting, colorization, split-brain autoencoder, proxy tasks in computer vision: relative patch prediction, jigjaw puzzles, rotations, contrastive learning: word2vec, contrastive predictive coding, instance discrimination, current instance discrimination models

Please study the following material in preparation for the class:

Key Readings:

Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion, Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, Journal of Machine Learning Research 11, 2010.
Context Encoders: Feature Learning by Inpainting, Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, Alexei A. Efros, CVPR 2016.
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction, Richard Zhang, Phillip Isola, Alexei A. Efros, CVPR 2017.
Unsupervised Visual Representation Learning by Context Prediction, Carl Doersch, Abhinav Gupta, Alexei A. Efros, ICCV 2015.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, Mehdi Noroozi, Paolo Favaro, ECCV 2016.
Unsupervised representation learning by predicting image rotations, Spyros Gidaris, Praveer Singh, Nikos Komodakis, ICLR 2018.
Tracking Emerges by Colorizing Videos, Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, Kevin Murphy, ECCV 2018.
Efficient Estimation of Word Representations in Vector Space, Tomás Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, ICLR Workshop Poster, 2013.
Learning Deep Representations by Mutual Information Estimation and Maximization, R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, Yoshua Bengio, ICLR 2019.
Representation Learning with Contrastive Predictive Coding, Aaron van den Oord, Yazhe Li, Oriol Vinyals, arXiv Preprint arXiv:1807:1807.03748v2, 2019
Data-Efficient Image Recognition with Contrastive Predictive Coding, Olivier Henaff et al., ICML 2020.
Momentum Contrast for Unsupervised Visual Representation Learning, Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick, CVPR 2020.
A Simple Framework for Contrastive Learning of Visual Representations, Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton, ICML 2020.
Improved Baselines with Momentum Contrastive Learning, Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He, arXiv preprint arXiv:2003.04297, 2020
Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning, Jean-Bastien Grill et al., NeurIPS 2020.
Emerging Properties in Self-Supervised Vision Transformers, Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, Armand Joulin, ICCV 2021.
Barlow Twins: Self-Supervised Learning via Redundancy Reduction, Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stephane Deny, ICML 2021.
Learning Transferable Visual Models From Natural Language Supervision, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, ICML 2021.

Additional Resources:

[Blog post] Self-supervised learning: The dark matter of intelligence, Yann LeCun and Ishan Misra.
[Blog post] Self-Supervised Representation Learning, Lilian Weng.
[Blog post] Contrastive Representation Learning, Lilian Weng
[Blog post] Contrastive Self-Supervised Learning, Ankesh Anand.
SSL Demos, Meta AI.
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey, Longlong Jing, Yingli Tian, IEEE Transactions on Pattern Analysis and Machine Intelligence, Nov. 2021.

Lecture 10: Strengths and Weaknesses of Current Models (slides)

a critique of autoregressive models, flow-based models, latent variable models, implicit models, and diffusion models.

Please study the following material in preparation for the class:

Additional Resources:

Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models, Sam Bond-Taylor, Adam Leach, Yang Long, Chris G. Willcocks, arXiv Preprint arXiv:2103.04922, 2021.
Musings on typicality, Sander Dieleman

Lecture 9: Diffusion Models (slides)

denoising diffusion models, latent diffusion models, classifier-free guidance, video diffusion models, diffusion GANs

Please study the following material in preparation for the class

Key Readings:

Denoising Diffusion Probabilistic Models, Jonathan Ho, Ajay Jain, Pieter Abbeel, NeurIPS 2020.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics, Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli, ICML, 2015
Generative Modeling by Estimating Gradients of the Data Distribution, Yang Song, Stefano Ermon, NeurIPS 2019.
Diffusion Models Beat GANs on Image Synthesis, Prafulla Dhariwal, Alex Nichol, NeurIPS 2021.
Classifier-free diffusion guidance, Jonathan Ho and Tim Salimans, NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen, arXiv preprint arXiv:2112.10741, 2021.
Zero-Shot Text-to-Image Generation, Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever, ICML 2021
Hierarchical Text-Conditional Image Generation with CLIP Latents, Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen, arXiv:2204.06125, 2022.
Improving Image Generation with Better Captions, James Betker et al., OpenAI Technical Report, 2023.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Chitwan Saharia et al., NeurIPS 2022.
Denoising Diffusion Implicit Models, Jiaming Song, Chenlin Meng, Stefano Ermonm ICLR 2021.
Progressive distillation for fast sampling of diffusion models., Tim Salimans and Jonathan Ho, ICLR 2022.
Common Diffusion Noise Schedules and Sample Steps are Flawed, Shanchuan Lin, Bingchen Liu, Jiashi Li, Xiao Yang, WACV 2024.
High-resolution image synthesis with latent diffusion models., Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bjorn Ommer, CVPR 2022.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets, Andreas Blattmann et al., arXiv:2311.15127, 2023.
Cascaded diffusion models for high fidelity image generation, Jonathan Ho et al., JMLR 23, 2022.
Imagen Video: High Definition Video Generation with Diffusion Models, Jonathan Ho et al., arXiv:2210.02303, 2022.
Scalable Diffusion Models with Transformers, William Peebles, Saining Xie, ICCV 2023.
Photorealistic Video Generation with Diffusion Models, Agrim Gupta, arXiv:2312.06662, 2023.
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon, ICLR 2024.
Prompt-to-prompt image editing with cross attention control, Amir Hertz, et al., ICLR 2023.
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, Nataniel Ruiz et al., CVPR 2023.
Adding conditional control to text-to-image diffusion models, Lvmin Zhang, Anyi Rao, and Maneesh Agrawala, ICCV 2023.
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, Jay Zhangjie Wu, et al., ICCV 2023.
Dreamix: Video diffusion models are general video editors, Eyal Modal et al., arXiv:2302.01329, 2023.
DreamFusion: Text-to-3D using 2D Diffusion, Ben Poole et al., ICLR 2023.
Probabilistic Adaptation of Text-to-Video Models, Sherry Yang et al., ICLR 2023.
Tackling the Generative Learning Trilemma with Denoising Diffusion GANs, Zhisheng Xiao, Karsten Kreis, Arash Vahda, ICLR 2024.
MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices, Yang Zhao et al., arXiv preprint arXiv:2311.16567, 2023.
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers, Katherine Crowson et al., arXiv preprint arXiv:2401.11605, 2024.

Additional Resources:

[Blog post] Generative Modeling by Estimating Gradients of the Data Distribution, Yang Song.
[Blog post] What are Diffusion Models?, Lilian Weng.
[Blog post] Diffusion Models as a kind of VAE, Angus Turner.
Understanding Diffusion Models: A Unified Perspective, Calvin Luo, arXiv preprint arXiv:2208.11970, 2022.

Lecture 8: Generative Adversarial Networks (slides)

implicit models, generative adversarial networks (GANs), evaluation metrics, theory behind GANs, GAN architectures, conditional GANs, cycle-consistent adversarial networks, representation learning in GANs, applications

Please study the following material in preparation for the class:

Key Readings:

Sections 20.10.4 of the Deep Learning textbook.
Generative Adversarial Networks, Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, NIPS 2014.
Unrolled Generative Adversarial Networks, Luke Metz, Ben Poole, David Pfau, Jascha Sohl-Dickstein, ICLR 2017.
A note on the evaluation of generative models, Lucas Theis, Aäron van den Oord, Matthias Bethge, ICLR 2016.
On the Robustness of Quality Measures for GANs, Motasem Alfarra, Juan C. Pérez, Anna Frühstück, Philip H. S. Torr, Peter Wonka, Bernard Ghanem, arXiv Preprint arXiv:2201.13019, 2024.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, Alec Radford, Luke Metz, Soumith Chintala, ICLR 2016.
Improved Techniques for Training GANs, Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, Xi Chen, NIPS 2016.
Projected GANs Converge Faster, Axel Sauer, Kashyap Chitta, Jens Müller, Andreas Geiger, NeurIPS 2021.
Wasserstein Generative Adversarial Networks, Martin Arjovsky, Soumith Chintala, Léon Bottou, ICML 2017.
Improved Training of Wasserstein GANs, Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron C. Courville, NIPS 2017.
Progressive Growing of GANs for Improved Quality, Stability, and Variation, Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen, ICLR 2018.
Spectral Normalization for Generative Adversarial Networks , Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, ICLR 2018.
Self-Attention Generative Adversarial Networks, Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena, ICML 2019.
Large Scale GAN Training for High Fidelity Natural Image Synthesis, Andrew Brock, Jeff Donahue, Karen Simonyan, ICLR 2019.
A Style-Based Generator Architecture for Generative Adversarial Networks, Tero Karras, Samuli Laine, Timo Aila, CVPR 2019.
Analyzing and Improving the Image Quality of StyleGAN, Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila, CVPR 2020.
Alias-Free Generative Adversarial Networks , Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, Timo Aila, NeurIPS 2021.
StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets, Axel Sauer, Katja Schwarz, Andreas Geiger, arXiv preprint arXiv:2202.00273, 2024.
Self-Distilled StyleGAN: Towards Generation from Internet Photos, Ron Mokady, Michal Yarom, Omer Tov, Oran Lang, Daniel Cohen-Or, Tali Dekel, Michal Irani and Inbar Mosseri, arXiv preprint arXiv:2202.12211, 2024.
Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow, Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, Sergey Levine, ICLR 2019.
Image-to-Image Translation with Conditional Adversarial Networks, Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros, CVPR 2017
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, ICCV 2017
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets, Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel, NIPS 2016.
Adversarially Learned Inference, Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, Aaron Courville, ICLR 2017.
Large Scale Adversarial Representation Learning, Jeff Donahue, Karen Simonyan, NeurIPS 2019.

Additional Resources:

GAN Lab, Minsuk Kahng, Nikhil Thorat, Polo Chau, Fernanda Viégas, and Martin Wattenberg, 2019.
[Blog post] A Gentle Introduction to BigGAN the Big Generative Adversarial Network , Jason Brownlee

[Blog post] GANs and Divergence Minimization, Colin Raffel.
[Blog post] From GAN to WGAN, Lilian Weng
[Blog post] An Alternative Update Rule for Generative Adversarial Networks, Ferenc Huszár
Open Questions about Generative Adversarial Networks, Distill, 2019.
Generating Videos with Scene Dynamics, Carl Vondrick, Hamed Pirsiavash, Antonio Torralba, NIPS 2016.
Adversarial Video Generation on Complex Datasets, Aidan Clark, Jeff Donahue, Karen Simonyan, arXiv preprint arXiv:1907.06571, 2019.
Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling, Jiajun Wu. Chengkai Zhang, Tianfan Xue, William T. Freeman, Joshua B. Tenenbaum, NIPS 2016.
HoloGAN: Unsupervised Learning of 3D Representations From Natural Images, Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, Yong-Liang Yang, ICCV 2019.
Video-to-Video Synthesis, Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro, NeurIPS 2018.
Everybody Dance Now, Caroline Chan, Shiry Ginosar, Tinghui Zhou, Alexei A. Efros, ICCV 2019.
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitris Metaxas, ICCV 2017.
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi, CVPR 2017.
Context Encoders: Feature Learning by Inpainting, Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros, CVPR 2016.
Domain Separation Networks, Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, Dumitru Erhan, NIPS 2016.
Semantic Image Synthesis with Spatially-Adaptive Normalization, Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu, CVPR 2019.
Manipulating Attributes of Natural Scenes via Hallucination, Levent Karacan, Zeynep Akata, Aykut Erdem, Erkut Erdem, ACM Transactions on Graphics, November 2019, Article No: 7.
Image Synthesis in Multi-Contrast MRI with Conditional Generative Adversarial Networks, Salman Ul Hassan Dar, Mahmut Yurt, Levent Karacan, Aykut Erdem, Erkut Erdem, Tolga Çukur, IEEE Trans. Med. Imag., Vol. 38, Issue 10, pp. 2375-2388, October 2019.
Adversarial Audio Synthesis, Chris Donahue, Julian McAuley, Miller Puckette, ICLR 2019.
MaskGAN: Better Text Generation via Filling in the _______ , William Fedus, Ian Goodfellow, Andrew M. Dai, ICLR 2018.

Lecture 7: Variational Autoencoders (slides)

latent variable models, variational autoencoders, importance weighted autoencoders, variational lower bound/evidence lower bound, likelihood ratio gradients vs. reparameterization trick gradients, Beta-VAE, variational dequantization

Please study the following material in preparation for the class:

Ket Readings:

Sections 20.10.3 of the Deep Learning textbook.
Chapter 2 of An Introduction to Variational Autoencoders, Kingma and Welling.
Importance Weighted Autoencoders, Yuri Burda, Roger B. Grosse, Ruslan Salakhutdinov
Auto-Encoding Variational Bayes, Diederik P. Kingma, Max Welling, ICLR 2014.
Neural Discrete Representation Learning, Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu, NIPS 2017.
Generating Diverse High-Fidelity Images with VQ-VAE-2, Ali Razavi, Aäron van den Oord, Oriol Vinyals, NeurIPS 2019.
beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework, Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, Alexander Lerchner, ICLR 2017.

Additional Resources:

Variational Inference lecture notes by David Blei.
[Blog post] How I learned to stop worrying and write ELBO (and its gradients) in a billion ways, Yuge Shi.
[Blog post] Intuitively Understanding Variational Autoencoders, Irhum Shafkat.
[Blog post] A Beginner's Guide to Variational Methods: Mean-Field Approximation, Eric Jang.
[Blog post] Tutorial - What is a variational autoencoder?, Jaan Altosaar
[Blog post] MusicVAE: Creating a palette for musical scores with machine learning, Adam Roberts, Jesse Engel, Colin Raffel, Ian Simon, Curtis Hawthorne
Discrete VAE’s, John Thickstun.
Jukebox: A Generative Model for Music, Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever, arXiv preprint arXiv:2005.00341, 2019.
Improved Variational Inference with Inverse Autoregressive Flow, Durk P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling, NIPS 2016.
PixelVAE: A Latent Variable Model for Natural Images, Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, Aaron Courville, ICLR 2017.
Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design, Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, Pieter Abbeel, ICML 2019.

Lecture 6: Normalizing Flow Models (slides)

1-D flows, change of variables, autoregressive flows, inverse autoregressive flows, affine flows, RealNVP, Glow, Flow++, FFJORD, multi-scale flows, dequantization

Please study the following material in preparation for the class:

Key Readings:

NICE: NICE: Non-linear Independent Components Estimation, Laurent Dinh, David Krueger, and Yoshua Bengio, ICLR 2015.
IAF: Improved variational inference with inverse autoregressive flow, Durk P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling, NIPS 2016.
RealNVP: Density estimation using Real NVP, Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio, ICLR 2017.
Masked Autoregressive Flow for Density Estimation, George Papamakarios, Theo Pavlakou, Iain Murray, NIPS 2017.
Neural autoregressive flows, Chin-Wei Huang, David Krueger, Alexandre Lacoste, Aaron Courville, ICML 2018.
Glow: Generative Flow with Invertible 1×1 Convolutions, Diederik P. Kingma, Prafulla Dhariwal, NeurIPS 2018.
Flow++L Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design, Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, Pieter Abbeel, ICML 2019.
Neural Importance Sampling, Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus Gross, Jan Novák, SIGGRAPH 2019.
Ffjord: Ffjord: Free-form continuous dynamics for scalable reversible generative models, Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, David Duvenaud, ICLR 2019.
Residual Flows for Invertible Generative Modeling, Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, Jörn-Henrik Jacobsen, NeurIPS 2019.
MintNet: MintNet: Building Invertible Neural Networks with Masked Convolutions, Yang Song, Chenlin Meng, Stefano Ermon, NeurIPS 2019.
SRFLow: SRFlow: Learning the Super-Resolution Space with Normalizing Flow, Andreas Lugmayr, Martin Danelljan, Luc Van Gool, Radu Timofte, ECCV 2020.
.
Continuous Language Generative Flow, Zineng Tang, Shiyue Zhang, Hyounghun Kim, Mohit Bansal. ACL 2021.
FloWaveNet: FloWaveNet : A Generative Flow for Raw Audio, Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, Sungroh Yoon, ICML 2019.
Go with the Flows: Mixtures of Normalizing Flows for Point Cloud Generation and Reconstruction, Janis Postels, Mengya Liu, Riccardo Spezialetti, Luc Van Gool, Federico Tombari, 3DV 2021.

Additional Resources:

Normalizing Flows: An Introduction and Review of Current Methods, Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker, IEEE PAMI, 2021.
Normalizing Flows for Probabilistic Modeling and Inference, George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, Balaji Lakshminarayanan, JMLR, 2021.
[Blog post] Glow: Better Reversible Generative Models, OpenAI
[Blog post] Normalizing Flows Tutorial, Part 1: Distributions and Determinants, Eric Jang
[Blog post] Normalizing Flows Tutorial, Part 2: Modern Normalizing Flows, Eric Jang
[Blog post] Flow-based Deep Generative Models, Lilian Weng

Lecture 5: Autoregressive Models (slides)

histograms as simple generative models, parameterized distributions and maximum likelihood, Bayes’ Nets, MADE, Causal Masked Neural Models, RNN-based autoregressive models, masking-based autoregressive models

Please study the following material in preparation for the class:

Key Readings:

Sections 20.10.5-20.10.10 of the Deep Learning textbook.
char-rnn: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
MADE: MADE: Masked Autoencoder for Distribution Estimation, Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle. ICML 2015.
WaveNet: WaveNet: A Generative Model for Raw Audio, Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu. arXiv preprint arXiv:1609.03499, 2016.
PixelCNN: Pixel Recurrent Neural Networks, Aaron Van Oord, Nal Kalchbrenner, Koray Kavukcuoglu. ICML 2016.
Gated PixelCNN: Conditional Image Generation with PixelCNN Decoders, Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, koray kavukcuoglu, Oriol Vinyals, Alex Graves, NIPS 2016.
PixelCNN++: PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications, Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P. Kingma, ICLR 2017.
PixelSNAIL: PixelSNAIL: An Improved Autoregressive Generative Model, XI Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel. ICML 2018.
Fast PixelCNN++: Fast Generation for Convolutional Autoregressive Models, Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A. Hasegawa-Johnson, Roy H. Campbell, Thomas S. Huang. ICLR 2017 Workshop.
Multiscale PixelCNN: Parallel Multiscale Autoregressive Density Estimation, Scott Reed, Aäron Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Yutian Chen, Dan Belov, Nando Freitas. ICML 2017.
Grayscale PixelCNN: PixelCNN Models with Auxiliary Variables for Natural Image Modeling, Alexander Kolesnikov, Christoph H. Lampert. ICML 2017.
Subscale Pixel Network: Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling, Jacob Menick, Nal Kalchbrenner. ICLR 2019.
Scaling Autoregressive Video Models, Dirk Weissenborn, Oscar Täckström, Jakob Uszkoreit. ICLR 2020.
Sparse Attention: Generating Long Sequences with Sparse Transformers, Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever. arXiv preprint arXiv:1904.10509, 2019.
PixelCNN Super Resolution: Pixel Recursive Super Resolution, Ryan Dahl, Mohammad Norouzi, Jonathon Shlens. ICCV 2017.
Colorization Transformer: Colorization Transformer, Manoj Kumar, Dirk Weissenborn, Nal Kalchbrenner, ICLR 2021.
PixelTransformer: PixelTransformer: Sample Conditioned Signal Generation, Shubham Tulsiani. Abhinav Gupta, ICML 2021.
GPT-1: Improving Language Understanding by Generative Pre-Training, Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, OpenAI Report, 2018.
GPT-2: Language Models are Unsupervised Multitask Learners, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, OpenAI Report, 2019.
GPT-3: Language Models are Few-Shot Learners, Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah et al., NeurIPS 2020.
iGPT: , Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever, ICML 2020.
VQ-VAE: Neural Discrete Representation Learning, Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu, NIPS 2017.
VQ-GAN: Taming transformers for high-resolution image synthesis, Patrick Esser, Robin Rombach, Björn Ommer, CVPR 2021.
MAGVIT-v2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation, Lijun Yu, José Lezama, Nitesh B. Gundavarapu et al., ICLR 2024.
VideoPoet: VideoPoet: A Large Language Model for Zero-Shot Video Generation, Dan Kondratyuk, Lijun Yu, Xiuye Gu et al., arXiv Preprint arXiv:2312.14125, 2023.
S4: Efficiently Modeling Long Sequences with Structured State Spaces, Albert Gu, Karan Goel, Christopher Ré, ICLR 2022.
Linear Attention: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret, ICML 2020.
FSQ: Finite Scalar Quantization: VQ-VAE Made Simple, Fabian Mentzer, David Minnen, Eirikur Agustsson, Michael Tschannen, ICLR 2024.
Gumbel-Softmax: Categorical reparameterization with gumbel-softmax, Eric Jang, Shixiang Gu, Ben Poole, ICLR 2017.
Concrete Distribution: The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables, Chris J. Maddison, Andriy Mnih, Yee Whye Teh, ICLR 2017.
Image Transformer: Image transformerNiki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran, ICML 2018.
Sparse Transformer: Generating Long Sequences with Sparse Transformers, Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever,arXiv preprint arXiv:1904.10509, 2019.
LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models, Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros, arXiv Preprint arXiv:2312.00785, 2023.

Additional Resources:

[Blog post] Auto-Regressive Generative Models (PixelRNN, PixelCNN++), Harshit Sharma, Saurabh Mishra

Lecture 4: Neural Building Blocks III: Attention and Transformers (slides)

content-based attention, location-based attention, soft vs. hard attention, self-attention, attention for image captioning, transformer networks

Please study the following material in preparation for the class:

Key Readings:

Neural Machine Translation by Jointly Learning to Align and Translate, D. Bahdanau, K. Cho, Y. Bengio, ICLR 2015
Section 5 of Generating Sequence with Recurrent Neural Networks, A. Graves, ArXiV
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, NIPS 2017
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. ICLR 2021.

Additional Resources:

Attention and Augmented Recurrent Neural Networks, Chris Olah and Shan Carter. Distill, 2016
[Blog post] The Illustrated Transformer, Jay Alammar
[Blog post] The Transformer Family, Lilian Weng
Do Transformer Modifications Transfer Across Implementations and Applications?, Sharan Narang et al., arXiv preprint arXiv:2102.11972, 2021.
Transformers in Vision: A Survey, Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah, arXiv preprint arXiv:2101.01169, 2021

Lecture 3: Neural Building Blocks II: Sequential Processing with Recurrent Neural Networks (slides)

sequence modeling, recurrent neural networks (RNNs), RNN applications, vanilla RNN, training RNNs, long short-term memory (LSTM), LSTM variants, gated recurrent unit (GRU)

Please study the following material in preparation for the class:

Key Readings:

Chapter #10 of the Deep Learning text book.
Section 1-3 of Generating Sequence with Recurrent Neural Networks, A. Graves, ArXiV

Additional Resources:

[Blog post] Understanding LSTM Networks, Chris Olah.
[Blog post] The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy.
Learning Long-Term Dependencies with Gradient Descest is Difficult, Yoshua Bengio, Patrice Simard, and Paolo Frasconi.
Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber.

Lecture 2: Neural Building Blocks I: Spatial Processing with CNNs (slides)

deep learning, computation in a neural net, optimization, backpropagation, convolutional neural networks, residual connections, training tricks

Please study the following material in preparation for the class:

Key Readings:

Chapter #8 and Chapter #9 of the Deep Learning text book.

Additional Resources:

Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review, Waseem Rawat and Zenghui Wang. Neural Computation, Vol. 29 , No. 9, 2017
Why Momentum Really Works, Gabrial Goh. Distill.
A guide to convolution arithmetic for deep learning, Vincent Dumoulin and Francesco Visin.
Multi-Scale Context Aggregation by Dilated Convolutions, Fisher Yu and Vladlen Koltun. ICLR 2016
High-Performance Large-Scale Image Recognition Without Normalization, Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan
[Blog post] In-layer normalization techniques for training very deep neural networks, Nikolas Adaloglou
[Blog post] Understanding Convolutions, Christopher Olah.
[Blog post] Deconvolution and Checkerboard Artifacts, Augustus Odena, Vincent Dumoulin, Chris Olah.

Lecture 1: Introduction to the course (slides)

course information, unsupervised learning

Please study the following material in preparation for the class:

Key Readings:

[Blog post] Why generative modeling?, Jakub Tomczak.
The Bandwagon, Claude E. Shannon. IRE Transactions on Information Theory, Vol. 2, Issue 3, 1956.
The Bitter Lesson, Rich Sutton, March 13, 2019.

COMP547

COMP547: Deep Unsupervised Learning

Detailed Syllabus and Lectures

Lecture 13: Multimodal Pretraining (slides)

Required Reading:

Suggested Video Material:

Additional Resources:

Lecture 12: Pretraining Language Models (slides)

Required Reading (more s denote higher priority):

Suggested Video Material:

Additional Resources:

Lecture 11: Self-Supervised Learning (slides)

Key Readings:

Suggested Video Material:

Additional Resources:

Lecture 10: Strengths and Weaknesses of Current Models (slides)

Suggested Video Material:

Additional Resources:

Lecture 9: Diffusion Models (slides)

Key Readings:

Suggested Video Material:

Additional Resources:

Lecture 8: Generative Adversarial Networks (slides)

Key Readings:

Suggested Video Material:

Additional Resources:

Lecture 7: Variational Autoencoders (slides)

Ket Readings:

Suggested Video Material:

Additional Resources:

Lecture 6: Normalizing Flow Models (slides)

Key Readings:

Suggested Video Material:

Additional Resources:

Lecture 5: Autoregressive Models (slides)

Key Readings:

Suggested Video Material:

Additional Resources:

Lecture 4: Neural Building Blocks III: Attention and Transformers (slides)

Key Readings:

Suggested Video Material:

Additional Resources:

Lecture 3: Neural Building Blocks II: Sequential Processing with Recurrent Neural Networks (slides)

Key Readings:

Suggested Video Material:

Additional Resources:

Lecture 2: Neural Building Blocks I: Spatial Processing with CNNs (slides)

Key Readings:

Suggested Video Material:

Additional Resources:

Lecture 1: Introduction to the course (slides)

Key Readings: