# Document Understanding Transformer

> OCR-free end-to-end Transformer model

**Wikidata**: [Q128801599](https://www.wikidata.org/wiki/Q128801599)  
**Source**: https://4ort.xyz/entity/document-understanding-transformer

## Summary
The Document Understanding Transformer (Donut) is an OCR-free end-to-end Transformer model designed for optical character recognition and document understanding tasks. Unlike traditional pipelines that rely on external OCR engines, it processes document images directly using a Transformer architecture. It is licensed under the MIT License and achieves this without the need for separate text recognition modules.

## Key Facts
- **Aliases:** Donut
- **Architecture Class:** Transformer (machine-learning model architecture)
- **Primary Function:** Optical Character Recognition (OCR)
- **Methodology:** OCR-free end-to-end modeling
- **Source Code:** Available at [github.com/clovaai/donut](https://github.com/clovaai/donut)
- **License:** MIT License
- **Latest Stable Version:** 1.0.9 (Released November 14, 2022)
- **Documentation:** Available via [Hugging Face Transformers](https://huggingface.co/docs/transformers/en/model_doc/donut)
- **Academic Paper:** Titled "OCR-free Document Understanding Transformer" ([arXiv:2111.15664](https://arxiv.org/abs/2111.15664))

## FAQs
### Q: What makes the Document Understanding Transformer different from traditional OCR tools?
A: The Document Understanding Transformer is "OCR-free," meaning it does not require a separate optical character recognition engine to preprocess text. Instead, it uses an end-to-end Transformer model to recognize and understand text directly from images.

### Q: Is the Document Understanding Transformer free to use?
A: Yes, the software is released under the MIT License, allowing for broad usage and modification.

### Q: Where can the technical paper for Document Understanding Transformer be found?
A: The model is described in the paper "OCR-free Document Understanding Transformer," which is accessible via arXiv (ID: 2111.15664).

## Why It Matters
The Document Understanding Transformer represents a significant methodological shift in the field of document analysis. Traditional document understanding systems typically operate in stages: an OCR engine first extracts text, and a subsequent model attempts to understand the layout or meaning. This multi-stage approach can suffer from error propagation—if the OCR step fails, the understanding step fails. By implementing an OCR-free, end-to-end architecture, the Document Understanding Transformer simplifies the pipeline and potentially reduces cascading errors associated with separate recognition modules.

Furthermore, it serves as a foundational architecture for subsequent specialized models, such as Nougat, which applies similar principles to scientific papers. Its implementation as a Transformer model leverages the robust capabilities of modern attention mechanisms, originally popularized by Google Brain, to handle complex visual and textual relationships within documents. The active release cycle in 2022 (versions 1.0.3 through 1.0.9) and its integration into major libraries like Hugging Face demonstrate its immediate utility and relevance to the machine learning community.

## Notable For
- **OCR-Free Architecture:** Eliminates the need for separate, modular optical character recognition engines.
- **End-to-End Modeling:** Performs document understanding in a unified process rather than a sequence of distinct tasks.
- **Open Source Accessibility:** Released under the permissive MIT License with a public GitHub repository.
- **Integration:** Officially documented within the Hugging Face Transformers library.
- **Relation to Nougat:** Acts as a conceptual predecessor or related class of model to Nougat, which targets scientific paper OCR.

## Body

### Architecture and Methodology
The Document Understanding Transformer, often referred to by its alias **Donut**, is a machine learning model classified as a **Transformer**. This architecture relies on self-attention mechanisms, a standard in modern deep learning first developed by Google Brain.

Unlike standard Optical Character Recognition (OCR) systems, which identify characters visually and convert them to text, Donut is an **OCR-free end-to-end model**. This indicates that the model interprets visual document inputs and outputs structured data or text without an intermediate step dedicated solely to character recognition.

### Development and Version History
The source code for the model is maintained by Clova AI and is publicly accessible. The development timeline shows a series of stable releases throughout mid-to-late 2022:

*   **1.0.3** (July 20, 2022)
*   **1.0.4** (July 29, 2022)
*   **1.0.5** (August 4, 2022)
*   **1.0.7** (August 23, 2022)
*   **1.0.8** (October 5, 2022)
*   **1.0.9** (November 14, 2022) - Identified as the preferred stable version.

### Resources and Related Entities
The theoretical foundation of the model is detailed in the publication "OCR-free Document Understanding Transformer." Technical implementation details are available through the Hugging Face documentation. The model is related to **Nougat**, another image Transformer encoder model specifically designed for the OCR of scientific papers.

**Repository:** [https://github.com/clovaai/donut](https://github.com/clovaai/donut)
**Paper:** [https://arxiv.org/abs/2111.15664](https://arxiv.org/abs/2111.15664)

## References

1. [Source](https://api.github.com/repos/clovaai/donut)
2. [Release 1.0.3. 2022](https://github.com/clovaai/donut/releases/tag/1.0.3)
3. [Release 1.0.4. 2022](https://github.com/clovaai/donut/releases/tag/1.0.4)
4. [Release 1.0.5. 2022](https://github.com/clovaai/donut/releases/tag/1.0.5)
5. [Release 1.0.7. 2022](https://github.com/clovaai/donut/releases/tag/1.0.7)
6. [Release 1.0.8. 2022](https://github.com/clovaai/donut/releases/tag/1.0.8)
7. [Release 1.0.9. 2022](https://github.com/clovaai/donut/releases/tag/1.0.9)