# Image GPT

> 2020 Transformer image model

**Wikidata**: [Q96369916](https://www.wikidata.org/wiki/Q96369916)  
**Source**: https://4ort.xyz/entity/image-gpt

## Summary
Image GPT (iGPT) is a transformer-based generative model developed by OpenAI in 2020 to process and generate images using autoregressive techniques. It adapts the language modeling approach of GPT-2 to vision tasks, demonstrating the potential of self-supervised learning for image generation and understanding.

## Key Facts
- **Inception Date**: June 17, 2020
- **Developer**: OpenAI
- **Model Type**: Transformer, autoregressive model, generative model
- **Versions**: iGPT-S (76M parameters), iGPT-M (455M), iGPT-L (1.4B), iGPT-XL (6.8B)
- **Based On**: GPT-2 architecture
- **Training Approach**: Unsupervised learning on ImageNet dataset
- **License**: MIT License
- **Source Code**: Hosted on GitHub at [https://github.com/openai/image-gpt](https://github.com/openai/image-gpt)

## FAQs
### Q: Who created Image GPT?
A: Image GPT was developed by OpenAI, a U.S.-based artificial intelligence research organization.

### Q: How does Image GPT relate to GPT-2?
A: Image GPT directly adapts the GPT-2 language modeling framework to image data, applying transformer-based autoregressive techniques to pixel prediction tasks.

### Q: What makes Image GPT notable?
A: It was one of the first models to demonstrate the effectiveness of transformer architectures for image generation, achieving competitive performance with supervised models using self-supervised training.

## Why It Matters
Image GPT represents a foundational step in applying transformer models to visual data, proving that techniques originally designed for text could be extended to images. By training on pixel prediction tasks without labeled data, it highlighted the potential of self-supervised learning to reduce reliance on large annotated datasets. While its direct applications were limited by computational costs and resolution constraints, its success influenced later advancements in multimodal models like DALL-E and CLIP. Image GPT’s work underscored the importance of scale and unsupervised learning in AI research, contributing to the broader shift toward general-purpose visual-language systems.

## Notable For
- First major demonstration of transformer models for image generation (2020)
- Autoregressive pixel prediction approach, treating images as sequences of pixels
- Scaled model variants (up to 6.8B parameters) to study performance improvements with size
- Competitive results with supervised models on CIFAR-10/CIFAR-100 benchmarks using self-supervised training

## Body
### Development Context
Image GPT was released by OpenAI in June 2020 as part of research exploring the application of transformer architectures to vision tasks. The model built on the success of GPT-2, repurposing its language modeling framework for pixel data.

### Technical Foundation
- **Architecture**: Transformer-based decoder with autoregressive sampling
- **Input Representation**: Images split into 8x8 pixel patches, flattened into sequences
- **Training Objective**: Predict pixel values sequentially, similar to language token prediction

### Model Variants
- **iGPT-S**: 76 million parameters
- **iGPT-M**: 455 million parameters
- **iGPT-L**: 1.4 billion parameters
- **iGPT-XL**: 6.8 billion parameters (largest variant)

### Training Approach
- **Dataset**: Trained on ImageNet (14M images) without labels
- **Method**: Unsupervised learning via pixel prediction
- **Compute**: Required significant resources due to sequential processing

### Applications and Legacy
- **Performance**: Achieved 72.3% accuracy on CIFAR-10 with self-supervised learning
- **Limitations**: High computational cost and low output resolution (32x32 pixels)
- **Influence**: Paved the way for efficient vision transformers (ViT) and later diffusion models

## References

1. Generative Pretraining from Pixels