# RoBERTa

> deep learning neural network for natural language processing

**Wikidata**: [Q85124095](https://www.wikidata.org/wiki/Q85124095)  
**Source**: https://4ort.xyz/entity/roberta

## Summary
RoBERTa is a robustly-optimized variant of the BERT language model that improves self-supervised pre-training for natural-language processing tasks. Released by Facebook AI in 2019, it belongs to the “bidirectional encoder representations from transformers” family and is implemented in the open-source fairseq toolkit.

## Key Facts
- Subclass of: bidirectional encoder representations from transformers (BERT family)
- Instance of: language model (confirmed 30 May 2021, source: Facebook AI blog)
- Described at: https://pytorch.org/hub/pytorch_fairseq_roberta/ (English documentation)
- Introduced in: 2019 by Facebook AI Research
- Parent class inception: 2018 (24 sitelinks on Wikidata)

## FAQs
### Q: How is RoBERTa different from BERT?
A: RoBERTa keeps BERT’s architecture but trains longer on more data, uses larger mini-batches, removes the next-sentence-prediction objective, and dynamically changes the masking pattern, yielding higher downstream-task scores.

### Q: Where can I download a ready-to-use RoBERTa model?
A: The official PyTorch Hub page hosts fairseq checkpoints; links and usage snippets are at https://pytorch.org/hub/pytorch_fairseq_roberta/.

### Q: Is RoBERTa open source?
A: Yes. Facebook AI released both the pre-trained weights and the fairseq training code under an MIT-style license.

## Why It Matters
RoBERTa showed that BERT’s original training recipe left significant performance on the table. By systematically tuning hyper-parameters and training data size without altering the core architecture, Facebook AI achieved new state-of-the-art results on GLUE, RACE, and SQuAD benchmarks at the time of publication. This work shifted community expectations: practitioners now treat extensive hyper-parameter sweeps and longer training as standard practice, not optional refinements. Because fairseq checkpoints are freely available, RoBERTa became a drop-in replacement for BERT in production pipelines, giving engineers better accuracy with no model-code changes. Its release also underscored the value of reproducible, large-scale pre-training, accelerating research into even larger transformer models.

## Notable For
- First major BERT re-training to break the 90-point average on the GLUE benchmark
- Demonstrated that removing the next-sentence-prediction task improves downstream accuracy
- Released alongside fully documented fairseq training scripts, enabling exact replication
- Served as the backbone for many winning entries in 2019–2020 NLP competitions

## Body
### Background and Family
RoBERTa is explicitly positioned as a subclass of “bidirectional encoder representations from transformers,” the 2018 transformer-based language-model family created by Google. It therefore inherits BERT’s encoder-only, masked-language-modeling approach.

### Training Improvements
Facebook AI kept the original 12- and 24-layer architectures but:
- Trained on 160 GB of text (versus 16 GB for BERT)
- Increased batch size to 8 K sequences
- Removed the next-sentence-prediction auxiliary task
- Applied dynamic masking so each training epoch sees different masked positions
- Used byte-level BPE tokenization with a 50 K vocabulary

### Availability
Pre-trained base and large checkpoints are hosted on PyTorch Hub under the fairseq namespace. Documentation, example code, and evaluation scripts are provided at https://pytorch.org/hub/pytorch_fairseq_roberta/. The accompanying research paper is titled “RoBERTa: A Robustly Optimized BERT Pretraining Approach.”

## References

1. [Source](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/)