Manual:PAGENAMEE encoding
MediaWiki pages name encoding is a complicated topic. MediaWiki magic words PAGENAME, PAGENAMEE and urlencode: have distinct implementations, each with their own peculiarities.
A MediaWiki page name can have a leading space but not a trailing space. The characters that are not allowed in MediaWiki page names are the three types of brackets, pound sign, underscore and vertical bar.
- # < > [ ] _ { | }
This article shall refer to these as the "not-allowed pagename characters". For clarity, we will present other ASCII 7-bit values for characters as the URL-style encoding of percent-hex-hex form known as percent-encoding.
PAGENAME[edit | edit source]
Some characters returned by PAGENAME are HTML-style encoded:
- " (double quote %22) is converted to "
- ' (single quote %27) is converted to ' (39 is the decimal value of hexadecimal 27 )
- & (ampersand %3B) is converted to &
We will refer to these as the "three special pagename characters".
PAGENAMEE[edit | edit source]
PAGENAMEE converts spaces to underscore and percent-encodes a set of characters:
- It converts " % & ' + = ? \ ^ ` ~ to %22 %25 %26 %27 %2B %3D %3F %5C %5E %60 %7E
- It does not convert alphanumerics and the characters: ! $ ( ) * , - . / : ; @
When preparing a pagename for embedding in the "searchpart" of a URL (see RFC1738 and/or RFC3986), it might have to be both percent-encoded and all space characters converted %20 or plus sign which we will call "searchpart-encoded". This avoids the problematic coding of the three special pagename characters by encoding, for instance, ampersand as %3B, but the typical searchpart-encoding of space is the plus sign (or sometimes as %20). If no MediaWiki string manipulation extensions for string manipulation, then PAGENAMEE might only be useful for constructing a URL back into one's own wiki, to other wikis or to other sites where the page they provide use the same name and use underscores.
urlencode[edit | edit source]
The urlencode function percent-encodes many more characters than PAGENAMEE. It can also be used to allow the wikisource editor to work with multilingual characters they are accustomed to rather than deal with the more opaque percent-encoded characters. When considering using urlencode to construct an external link URL, especially within a template, there are two design style where that might be appropriate. Which one is appropriate is a matter the trade-offs between generality and ease-of-use.
- For maximum generality, there is no simple combination of PAGENAME and other default wiki magic words to provide a general solution and to handle names that include all possible characters in pagenames. The not-allowed pagename characters and the three special pagename characters both present issues. If a desired name uses any of those characters, then the actual pagename would have to be different. The most general design for a template would be a template with two parameters: a URL-style searchpart-encoded parameter for the URL link and an HTML-style parameter for the link label. The URL-style parameter would be added to a search or lookup URL and the HTML-style parameter would be used to label the link. For instance, a template called OrgName that looks up an organization by name with the unusual 10-character organization name of
a%23b> {c}
would call the template as{{OrgName|a%2523b%3E+%7Bc%7D|a%23b> {c}}}
. Variations on this might use %20 instead of + in the URL-style parameter for space and/or > instead of > for the greater-than character in the HTML-style parameter or just the plain characters when they work OK. To be rigorous, one might argue that having two mandatory arguments is the best style for long-term stability in case the page is moved or translated to some other wiki where where the naming style of pages is different such as where a different alphabet is used for naming pages. - The urlencode parser function can be used to create a template that might be easy-to-use but not perfectly general. The urlencode function converts almost all characters except alphanumerics and three of the RFC1738 URL "safe" characters: - . (dash, period) and it converts blank to plus. The technique of embedding the code fragment
{{urlencode:{{{userparam|{{PAGENAME}}}))))
into a template to create an external link URL can be useful (i.e. treating simple pagenames as data). A pagename with any of the three special pagename characters might be a problem. For example, a pagename with an ampersand, this would result in an HTML-style ampersand (&) being converted into to the URL-style %26amp%3B which most remote web site would not handle successfully. For names with the problematic characters, one could simply not use the template and provide a direct link in the wikisource or by adding appropriate templates or extensions to the wiki to support string manipulations. - A compromise between these two styles is a variation on the above code fragment, such as
{{{userparam|{{urlencode:{{PAGENAME}}}))))
where the userparam is optional but when explicitly supplied would have to be search-encoded.
Web browser URL and wiki web server HTTP interface[edit | edit source]
The URL you type in or cut/paste into your web browser URL is similar but not exactly the same as PAGENAMEE .
- In order to type in a pagename as a URL in your web browser that will go directly to the page, the following two characters must be URL-style encoded while being typed in: % ? as %25 %3F . A typical example is a pagename that ends in question mark where the wiki editors will create a wiki redirect without the question mark so that it works anyway. If you type in a space in the middle of a URL, you browser will convert it to %20 before sending it to any sort of web server. The same for that double-quote character
"
which is converted to %22. Depending on your browser, it may also encode some of the "unsafe" characters such as%&'`
. See RFC1738 for details but note that this behavior is browser-dependent. Compared to browsers that support only http, browsers that support schemes other than http such as ftp tend to convert more of these characters. - How a URL with percent-encoding is displayed in a web browser's address box depends on whether the wiki web server has used URL redirection. The characters of the PAGENAMEE character set will be converted only if they are adjacent to a space. For instance, If you type in a URL into your web browser ending in A_=_B or A=B then it will send that URL directly and you will get to the wiki page if it exists. If you enter a URL into your web browser ending in
A = B
(with spaces around the equals sign), then your web browser encodes spaces to %20, and thus sends A%20=%20B to the wiki web server. The wiki web server, then converts the string to A_%3D_B and sends that back to the wiki web browser via URL redirection. Now you can see why on a slow Internet link you might see the spaces in a pagename change first to a %20 and then to an underscore because your browser does the first conversion and the wiki web server does the second. You can try to see the real URL by copying the URL in the browser and pasting it as text into a simple text editor but you may find that even this technique produces browser-dependent results. - While not specific to the wiki web server, for wide characters, the browser performs a partial urldecode action on the real URL. This urldecoding is essential for the usability of wide characters in URLs. As an example, for an otherwise simple URL ending in a UTF-8 string percent-encoded as
%E6%9D%B1%E4%BA%AC
, your browser will usually urldecode that part and display it as 東京 (Unicode U+6771 U+5EAC), which are the two Kanji characters for Tokyo. This result can apply to both 7-bit and wide characters but is browser-dependent. For instance if you visit the eight-character pagename ofA!*-. ~A
ashttp://en.wikipedia.org/wiki/A%21%2A%2D%2E%5F%7EA
you may find that your web browser then displays a URL that has urldecoded none, some or all of the percent-encoded characters and that a cut-and-paste of the browser URL into simple text will include none, some or all of this urldecoding. How much of this urldecodding occurs during cut-and-paste is browser-dependent.