Wikidot Normal Form

Wikidot normalizes page names in its URLs to fit a specific form. Internally it calls these “UNIX names”, SCUTTLE and other projects have called these names “slugs”. (See glossary)

The GitHub - scpwiki/wikidot-normalize: Simple library to provide Wikidot-compatible string normalization. codebase is presently the most complete normalization utility outside of Wikidot itself, and this page will document its findings.

Introduction

Wikidot URLs have a number of restrictions. Slugs are only in lowercase, and do not permit any punctuation or spaces, with specific exceptions for underscores (_) and colons (:). The basic transformation is to lowercase all alphanumeric characters, and convert all others into dashes.

Below is a table listing some basic transformations:

Original

Result

Original

Result

SCP-001

scp-001

User Curated Lists

user-curated-lists

Kate McTiriss's Proposal

kate-mctiriss-s-proposal

However, Wikidot URLs can be more complicated. For instance, they may specify any number of categories that the page is in, in a specific order:

Original

Result

Original

Result

FRAGMENT:some page (1)

fragment:some-page-1

deleted:Spc 1059

deleted:spc-1059

:fragment::page:

fragment:page

Multiple colons are merged into one, and any trailing or leading colons are stripped.

This also applies to dashes, multiple will be merged into one, and any leading or trailing dashes will be stripped. Because spaces and extraneous characters are converted to dashes, this essentially removes them entirely. This also occurs at category boundaries:

Original

Result

Original

Result

some--page

some-page

-spaghetti

spaghetti

(TOP SECRET) Special File!

top-secret-special-file

fragment: !Page

fragment:page

-category-:-page-

category:page

Underscores

It is notable that, unlike dashes, underscores are treated specially. Effectively they are treated as any other non-normal character, and converted into dashes. However, a single underscore is permitted at the start of any given section of a name. This allows for special pages like _template or _404 to exist, even in categories.

For instance, the following slugs are considered already normalized:

  • _template

  • fragment:_template

  • fragment:_category:_template

And the following are not. Their conversions are also shown:

Original

Intermediate

Result

Original

Intermediate

Result

__template

_-template

_template

apple_

apple-

apple

fragment_:page

fragment-:page

fragment:page

category__:_fragment_:page_

category--:_fragment-:page-

category:_fragment:page

Character Transformations

In addition to the transformations above, Wikidot also converts several Latin Unicode characters to their simplified ASCII variants, removing diacritics and other modifiers. For instance ě to e and À to A. There are some notable cases, such as characters like Ö becoming Oe or Ü becoming Ue.

This step also converts various punctuation like spaces, commas, and slashes to dashes. This is unusual given that a later conversion step achieves the same result.

Full Procedure

All of the steps performed by the normalization process are as follows:

  • Trim leading and trailing whitespace

  • Transform characters to their ASCII equivalents (see above)

  • Lowercase all ASCII characters

  • Convert all non-normal characters (alphanumeric, dashes, underscores, colons) to dashes

  • Remove all leading and trailing dashes

  • Merge multiple dashes into a single dash

  • Merge multiple colons into a single colon

  • Remove all leading and trailing dashes next to colons (e.g. fragment:-testfragment:test)

  • Remove all leading and trailing dashes next to underscores (e.g. _-template-_template)

  • Remove all leading and trailing colons

Modifications by Wikijump

Because the current normalization strategy is very ASCII / Latin-centric, languages such as CJK ones are entirely excluded from having valid characters in page slugs, which makes it difficult to create native pages. As part of WJ-792, Wikijump added additional characters which are supported by normal form, while still retaining the general properties of normalization.

See Wikijump Normal Form for more information on internationalized normal form.