optimization · Article
The perfect structure for a 2,000-word prompt
Jan 12, 2025
Disclaimer
This content is provided for educational purposes only and does not constitute professional, legal, financial, or technical advice. Results may vary, and you should conduct your own research and consult qualified professionals before making decisions.
Long prompts as configuration, not prose
Most long prompts fail because they are written as essays instead of configuration. From the model’s perspective, 2,000 words of loosely structured text are just a long, ambiguous prefix. Small phrasing changes in the middle can have outsized effects on the continuation.
We treat long prompts as hierarchical configuration files. Every section has a narrow purpose, clear delimiters, and a predictable ordering. This makes the prompt easier to reason about, easier to diff, and easier to optimize over time.
The high-level outline
A robust 2,000-word prompt typically decomposes into:
- System role and safety envelope. What the model is for and explicit boundaries around what it must not do.
- Task declaration. A concise description of the request in domain-specific language.
- Operational context. The environment, data shapes, and any persistent constraints (latency, cost, tooling).
- Definitions and ontology. Local meanings of key terms, metrics, and entities so the model aligns with your vocabulary.
- Examples and counterexamples. A small number of carefully chosen input/output pairs that bracket the desired behavior.
- Output contract. The exact schema, formatting, and failure-handling expectations.
Example skeleton
Below is an abbreviated skeleton; in practice each section can be expanded with highly specific domain detail:
[ROLE]
You are a model assisting with...
[SAFETY ENVELOPE]
You must not...
[TASK]
You will receive...
[CONTEXT]
The organization operates...
[DEFINITIONS]
"Critical error" means...
[EXAMPLES]
Example 1: ...
Counterexample A: ...
[OUTPUT CONTRACT]
Return a JSON object with fields...
The use of explicit section headers, brackets, and schemas keeps the prompt inspectable even as it grows beyond 2,000 words.
Measuring structural quality
We score long prompts along three dimensions:
- Determinism. Small, benign edits do not flip the overall behavior.
- Diffability. Changes show up as obvious additions/removals rather than scattered wording tweaks.
- Transferability. Operators new to the system can understand the prompt’s intent in minutes, not hours.
Prompts that score well on these axes are easier to A/B test, easier to localize, and easier to hand off between teams.
Operator checklist
- Re-run the same task 5–10 times before drawing conclusions.
- Change one variable at a time (prompt, model, tool, or retrieval).
- Record failures explicitly; they are the fastest route to signal.