Wikidot Normal Form
Wikidot normalizes page names in its URLs to fit a specific form. Internally it calls these “UNIX names”, SCUTTLE and other projects have called these names “slugs”. (See glossary)
The https://github.com/scpwiki/wikidot-normalize codebase is presently the most complete normalization utility outside of Wikidot itself, and this page will document its findings.
Introduction
Wikidot URLs have a number of restrictions. Slugs are only in lowercase, and do not permit any punctuation or spaces, with specific exceptions for underscores (_
) and colons (:
). The basic transformation is to lowercase all alphanumeric characters, and convert all others into dashes.
Below is a table listing some basic transformations:
Original | Result |
---|---|
|
|
|
|
|
|
However, Wikidot URLs can be more complicated. For instance, they may specify any number of categories that the page is in, in a specific order:
Original | Result |
---|---|
|
|
|
|
|
|
Multiple colons are merged into one, and any trailing or leading colons are stripped.
This also applies to dashes, multiple will be merged into one, and any leading or trailing dashes will be stripped. Because spaces and extraneous characters are converted to dashes, this essentially removes them entirely. This also occurs at category boundaries:
Original | Result |
---|---|
|
|
|
|
|
|
|
|
|
|
Underscores
It is notable that, unlike dashes, underscores are treated specially. Effectively they are treated as any other non-normal character, and converted into dashes. However, a single underscore is permitted at the start of any given section of a name. This allows for special pages like _template
or _404
to exist, even in categories.
For instance, the following slugs are considered already normalized:
_template
fragment:_template
fragment:_category:_template
And the following are not. Their conversions are also shown:
Original | Intermediate | Result |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Character Transformations
In addition to the transformations above, Wikidot also converts several Latin Unicode characters to their simplified ASCII variants, removing diacritics and other modifiers. For instance ě
to e
and À
to A
. There are some notable cases, such as characters like Ö
becoming Oe
or Ü
becoming Ue
.
This step also converts various punctuation like spaces, commas, and slashes to dashes. This is unusual given that a later conversion step achieves the same result.
Full Procedure
All of the steps performed by the normalization process are as follows:
Trim leading and trailing whitespace
Transform characters to their ASCII equivalents (see above)
Lowercase all ASCII characters
Convert all non-normal characters (alphanumeric, dashes, underscores, colons) to dashes
Remove all leading and trailing dashes
Merge multiple dashes into a single dash
Merge multiple colons into a single colon
Remove all leading and trailing dashes next to colons (e.g.
fragment:-test
→fragment:test
)Remove all leading and trailing dashes next to underscores (e.g.
_-template-
→_template
)Remove all leading and trailing colons
Modifications by Wikijump
Because the current normalization strategy is very ASCII / Latin-centric, languages such as CJK ones are entirely excluded from having valid characters in page slugs, which makes it difficult to create native pages. As part of WJ-792, Wikijump added additional characters which are supported by normal form, while still retaining the general properties of normalization.
See Wikijump Normal Form for more information on internationalized normal form.