Markup spec

From MediaWiki.org

Jump to: navigation, search

EBNF grammar project

Markup spec

ANTLR
BNF

MediaWiki markup spec project:

1 Goals
2 Compatibility
3 Difficulties
4 Resources
5 The Markup Language

[edit] Goals

Produce a specification of MediaWiki's markup format that is sufficiently complete and consistent that multiple compatible parser implementations can be built from it.
- Spec may or may not use EBNF etc. Might have to just use lots of words. ;) ANTLR is looking very promising.
Define a data model for a parse tree
- The data model should be representable in XML, though an official XML schema for such a representation may or may not be defined.
- Round-trip conversion between source code and the data model must be possible. There may be a many-to-one relationship between source code and parse trees, but the canonical transformation from parse tree to source code should always parse back to the same parse tree.
A parser built from this spec will replace MediaWiki's current parser in the future.
- This will potentially allow lots of nifty new features that are currently either Not Possible or Very Hard, e.g. WYSIWYG editing.

[edit] Compatibility

In general, the spec will strive to avoid deviating from present behavior where it is reasonable and well-defined, and will seek to avoid adding new behaviour without considering whether it may break already existent pages.
Where the current parser's behavior is undefined or obviously buggy, the spec may define new behavior which is different.

[edit] Difficulties

Some of the syntax is kind of hairy. Bleah!
Language-sensitive and otherwise customizable keywords.
Extensions...
Integrated HTML and HTML-like tags.
Lots of scary context-sensitivity (see e.g. m:MediaWiki lexer).

[edit] Resources

Raid Magnus's wiki2xml work for some starting points; examine how his parser works (and how it differs from the main one) and the intermediate XML format he uses
http://www.mediawiki.org/wiki/User:HappyDog/WikiText_parsing - some observations based on 1.3.10 by HappyDog
meta:Help:Editing - it's a start
An attempt to describe the markup in BNF form: Markup spec/BNF
http://jamwiki.org/wiki/en/StartingPoints Aims to have Mediawiki compatible syntax see http://svn.sourceforge.net/viewvc/jamwiki/wiki/trunk/src/lex/ for an attempt to write a parser.

MetaWiki: MediaWiki lexer

[edit] The Markup Language

The MediaWiki markup language (commonly referred to within the MediaWiki community as wikitext, though this usage is ambiguous within the larger wiki community) uses sometimes paired non-textual ASCII characters to indicate to the parser how the editor wishes an item or section of text to be displayed. The parser translates these tokens into (X)HTML as closely as semantically possible.

[edit] v1.6 markup tokens

The markup tokens fall into two broad categories: unary tokens (like : or * used at the beginning of a line), which stand alone, and binary tokens (like those for italic or boldface) which must be used in matched pairs. Unary tokens may only be preceded by comments or whitespace; otherwise, they will not be interpreted.

[edit] Unary

[edit] Can be used anywhere

"Magic words", e.g. __FORCETOC__, __NOEDITSECTION__ (see m:Help:Magic words)
Signatures:
- ~~~ Replaced with your username
- ~~~~ Replaced with your username and the date
- ~~~~~ Replaced with the date.
Notes:
- These tags are replaced at the point the edit is saved.
Magic links: ISBN ..., RFC ..., PMID ... (see BNF/Magic links)

[edit] Binary

The ellipses (...) are used to indicate where the content goes and are not part of the markup.

[edit] Beginning of a line

Equals signs are used for headings (must be at start of line)
- 1st level heading: = ... =
- 2nd level heading: == ... ==
- 3rd level heading: === ... ===
- 4th level heading: ==== ... ====
- 5th level heading: ===== ... =====
- 6th level heading: ====== ... ======
- Specified in /BNF/Article#Heading

[edit] Anywhere

Square brackets are used for links:
- Internal/interwiki link + language links + category links + images: [[ ... ]] (see also Namespaces below)
  vertical bars separate optional parameters, which are:
  - link: first parameter: display text (also defaulted using "pipe trick") (also trailing concatenated text included in display, e.g. s for plural)
  - image: many parameters; see w:Wikipedia:Extended image syntax
  - category: first parameter: sort order in category list
  link contents have to be parsed for whether they're dates if $wgUseDynamicDates is on
- External link: [ ... ]
  
  space separates optional first parameter, which is display text
- undecorated URLs are also recognized and hotlinked
- Specified in /BNF/Links
Apostrophes are used for formatting:
- Italic: '' ... ''
- Bold: ''' ... '''
- Bold + Italic: ''''' ... '''''
- Note that improper nesting of bold and italics is currently permitted.
Curly braces are used for transclusion:
- Include template: {{ ... }} (see also Namespaces below)
  - Unlimited number of optional pipe-delimited parameters, each of which may optionally start with a parameter name preceding an equals sign
- Include template parameter: {{{ ... }}}
  - Optionally including a pipe followed by the parameter default
- Interpolate built-in variable: {{PAGENAME}} (see m:Help:Variable)
Various HTML style tags:
- <nowiki> do not interpret wiki markup, do allow newline in list and indent elements (but still flow text, still allow SGML entities)
- <pre> do not interpret wiki markup, do not flow text (but still allow SGML entities)
- <math> if $wgUseTeX is set
- <html> if $wgRawHtml is set
- <gallery>
- <onlyinclude> <noinclude> <includeonly>
- Parser extension tags, like <ref> (using Cite.php)
- Plus most 'non-dangerous' HTML tags: 'b', 'del', 'i', 'ins', 'u', 'font', 'big', 'small', 'sub', 'sup', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'cite', 'code', 'em', 's', 'strike', 'strong', 'tt', 'var', 'div', 'center', 'blockquote', 'ol', 'ul', 'dl', 'table', 'caption', 'pre', 'ruby', 'rt' , 'rb' , 'rp', 'p', 'span', 'u', 'br', 'hr', 'li', 'dt', 'dd', 'td', 'th', 'tr'
 HTML-style comments
SGML entities: &...;

[edit] Namespaces

In wikilinks and template inclusions, colons set off namespaces and other modifiers:

proper namespaces: Talk:, User:, project, etc.
"special" namespaces: Image:, Category:, Template:
pseudo-namespaces: Special:, Media:
lone/leading :
- lone : forces main namespace
- leading : allows link to image page rather than inline image, or similarly to category or template page
interwiki links:
- same project, different language: code of two or more letters
- different project, same language: w: for Wikipedia, wikt: for Wiktionary, m: for Meta, etc. -- see m:Help:Interwiki_linking for more information (especially when using in templates; transwiki transclusion, iw_trans)
subst: force one-time template substitution upon edit, rather than dynamic expansion on each view
int:, msg:, msgnw:, raw: -- see m:Help:Magic words#Template modifiers
MediaWiki: magically access mediawiki formatting and boilerplate text (e.g. MediaWiki:copyrightwarning)
Colon functions: UC:, LC:, etc. (see m:Help:Colon function)
Parser functions: #expr:, #if:, #switch:, etc. (see m:ParserFunctions)
other extensions?

Several combinations of the above are possible, e.g. m:Help:Variable -- help namespace within Meta project.

[edit] From MetaWiki

The following text was at m:Wikitext Metasyntax and needs to be merged in here.

[edit] Document element declaration

AnyText = InlineText | BlockText ;
InlineText = Line | InlineTextNormal | InlineTextExtra | EitherText ;
InlineTextNormal = Line | Bold | Italic | BoldItalic ;
InlineTextExtra = Line | InternalLink | ExternalLink | InlineHTML ;
BlockText = Text | Image | Media | Table | Heading | Separator | Gallery | List | BlockHTML | EitherText ;
EitherText = Extension | Template | NoWiki | Parameter | Comment ;

[edit] Basic Markup

Define markups

[edit] Either

Template = "{{" [ "msg:" | "msgnw:" ] PageName { "|" [ ParameterName "=" AnyText | AnyText ] } "}}" ;
Extension = "<" ? extension ? ">" AnyText "</" ? extension ? ">" ;
NoWiki = "<nowiki />" | "<nowiki><nowiki></nowiki>" ( InlineText | BlockText ) "<nowiki></nowiki></nowiki>" ;
Parameter = "{{{" ParameterName { Parameter } [ "|" { AnyText | Parameter } ] "}}}" ;
Comment = "<!--" InlineText "-->" | "<!--" BlockText "//-->" ;

ParameterName = ? uppercase, lowercase, numbers, no spaces, some special chars ? ;

[edit] Parser outline

Another way to check whether we've covered everything in the grammar is to look at the steps the parser actually goes through:

The preprocessor does

Strip (hooks before/after)
Remove HTML comments
Replace variables
1. Subst
2. MSG, MSGNW, RAW
3. Parser functions
4. Templates

The parser does

Strip (hooks before, after)
1. treats nowiki, pre, math and possibly other with "userfunc tag hooks" hiero)
2. Removes HTML comments
  - HTML comments are removed. (this text by HappyDog)
  - Any tags that are not allowed by the software (e.g. <script> tags) are replaced by HTML entitities, so they display as literals and are not treated as HTML by the browser.
  - Any badly formed tags (e.g. nested tags that shouldn't be nested, <tr> tags outside a <table> tag, etc.) are also replaced by HTML entitities so they are not treated as HTML.
  - Any attributes that are not allowed by the software (e.g. onMouseOver) are removed from otherwise valid tags.
  - A small amount of minor source formatting is applied (basically, the removal of unnecessary whitespace).
  - A closing tag is added at the end for all tags that are not closed properly. Note that some tags (e.g. <br>) don't need to be closed.

Internal parse
1. Noinclude/onlyinclude/includeonly sections
2. Remove HTML tags
3. Replace variables
  1. Hooks: Internalparsebeforelinks
4. Tables
5. Magic words
  1. Strip TOC (__NOTOC__, __TOC__)
  2. Strip no gallery (__NOGALLERY__)
6. do headings
7. Do dynamic dates
8. Do quotes ('' and ''')
9. Replace internal links
  1. Process images (do the caption recursively as it might contain links, or even other images...)
  2. Process categories
10. Replace external links
11. Re-replace masked internal links
12. Do magic links (ISBN, RFC...)
13. Format headings (__NEWSECTIONLINK__, __FORCETOC__...)
Unstrip general
Fix tags (french spaces, guillemet)
Blocks (lists etc)
Replace link holders
Language converter:
1. Normal text converted on a word by word basis(?) if autoconvert is enabled
2. Text in -{code1:text1;code2:text2;...}- blocks converted manually
3. Text in -{...}- not converted at all.
Unstrip no wiki
Extra tags and params
User funcs?
Un strip general
Normalise char references
Tidy + hook

The save parser does

Convert newlines
Strips
Pass 2
1. Substs
2. Strip again? gallery something.
3. Signatures
4. Pipe tricks
5. Trim trailing whitespace
Unstrips

Retrieved from "http://www.mediawiki.org/wiki/Markup_spec"

Category: Parser

Markup spec

From MediaWiki.org

Contents

[edit] Goals

[edit] Compatibility

[edit] Difficulties

[edit] Resources

[edit] The Markup Language

[edit] v1.6 markup tokens

[edit] Unary

[edit] Can be used anywhere

[edit] Binary

[edit] Beginning of a line

[edit] Anywhere

[edit] Namespaces

[edit] From MetaWiki

[edit] Document element declaration

[edit] Basic Markup

[edit] Either

[edit] Parser outline

Views

Personal tools

site

download

support

Development

communication

Search

Toolbox