# document structuring

> computational linguistics

**Wikidata**: [Q5287648](https://www.wikidata.org/wiki/Q5287648)  
**Wikipedia**: [English](https://en.wikipedia.org/wiki/Document_structuring)  
**Source**: https://4ort.xyz/entity/document-structuring

## Summary
Document structuring is a sub-field of computational linguistics that focuses on the automatic organization and arrangement of textual content. It provides the algorithms and models needed to turn raw information into coherent, logically ordered documents.

## Key Facts
- Document structuring is classified as an instance of computational linguistics.
- It is also known as “text structuring” in English and “structuration de texte” in French.
- The topic has a single Wikipedia article (English) and one Wikidata sitelink.
- Freebase ID /m/0fq2t_h and discontinued Microsoft Academic ID 2780113422 map to this concept.
- Computational linguistics, the parent discipline, holds 68 Wikipedia sitelinks, underscoring its broader scope.

## FAQs
### Q: What does document structuring actually do?
A: It automatically decides the order and grouping of content so that a text reads coherently to a human. This step sits between content selection and surface realization in natural-language generation pipelines.

### Q: Is document structuring the same as text summarization?
A: No. Summarization decides what to keep; document structuring decides how to arrange what has been chosen, often producing an outline or paragraph order rather than a condensed text.

### Q: Why is it considered part of computational linguistics?
A: Because it relies on linguistic theories of discourse relations, coherence, and rhetoric, and it requires algorithmic methods to operationalize those theories on real texts.

## Why It Matters
Without document structuring, data-to-text systems would deliver accurate but jumbled paragraphs. By providing a principled way to sequence information, the field underpins applications ranging from automated journalism and medical report generation to chatbot responses and compliance documentation. It bridges the gap between selecting relevant facts and generating readable language, ensuring that the final text follows human expectations of flow, emphasis, and coherence. As organizations amass ever-larger datasets, automatic structuring becomes critical to scalable content creation, reducing editorial labor while maintaining clarity and consistency.

## Notable For
- One of the earliest explicitly named stages in canonical natural-language generation architectures (Document Planner → Micro-Planner → Surface Realizer).
- Treats ordering as a computational optimization problem, often using metrics such as coherence scores or rhetorical similarity.
- Provides cross-lingual benefits because rhetorical structures show stability across many languages, letting the same algorithms serve multilingual generation.
- Acts as the “architect” of text: unlike lexical choice or grammar, it determines the high-level blueprint readers use for navigation.

## Body
### Position in NLG Pipelines
Document structuring is conventionally positioned after content selection and before linguistic realization. It receives a set of messages or facts and outputs an ordered tree or sequence that respects discourse constraints such as given-new, theme-rheme, and rhetorical relations.

### Core Techniques
Early systems used handcrafted schemas (e.g., schema-based NLG). Current research frames the task as search or optimization over candidate orderings, employing:
- Genetic algorithms
- Integer linear programming
- Reinforcement learning with coherence rewards

### Evaluation
Because human judgment of coherence is subjective, automatic metrics include:
- Entity-grid coherence scores
- Rhetorical Structure Theory (RST) parse similarity
- Task-specific downstream utility (e.g., reading-comprehension accuracy)

### Relation to Discourse Theory
The field borrows heavily from:
- RST
- Penn Discourse Treebank relations
- Segmented Discourse Representation Theory (SDRT)

These supply the relation inventories that guide ordering decisions.

### Multilingual Aspect
Although most published work targets English, document structuring algorithms often transfer to new languages with minimal adaptation because rhetorical relations exhibit cross-linguistic regularities.

### Data Scarcity
Large corpora annotated with acceptable orderings are rare; consequently, unsupervised or weakly supervised methods remain active research areas.