Talk:Canonical JSON
JSON specification
Thanks for the effort, but there are several flaws in this specification: There are instances of JSON that cannot be encoded as "canonical JSON" and vice versa. Here is a copy of the JSON specification (RFC 4627), section 2 and 3. Any canonical JSON should refine this. -- User:JakobVoss
JSON Grammar
A JSON text is a sequence of tokens. The set of tokens includes six structural characters, strings, numbers, and three literal names.
A JSON text is a serialized object or array.
JSON-text = object / array
These are the six structural characters:
begin-array = ws %x5B ws ; [ left square bracket begin-object = ws %x7B ws ; { left curly bracket end-array = ws %x5D ws ; ] right square bracket end-object = ws %x7D ws ; } right curly bracket name-separator = ws %x3A ws ; : colon value-separator = ws %x2C ws ; , comma
Insignificant whitespace is allowed before or after any of the six structural characters.
ws = *( %x20 / ; Space %x09 / ; Horizontal tab %x0A / ; Line feed or New line %x0D ; Carriage return )
Values
A JSON value MUST be an object, array, number, or string, or one of the following three literal names:
false null true
The literal names MUST be lowercase. No other literal names are allowed.
value = false / null / true / object / array / number / string false = %x66.61.6c.73.65 ; false null = %x6e.75.6c.6c ; null true = %x74.72.75.65 ; true
Objects
An object structure is represented as a pair of curly brackets surrounding zero or more name/value pairs (or members). A name is a string. A single colon comes after each name, separating the name from the value. A single comma separates a value from a following name. The names within an object SHOULD be unique.
object = begin-object [ member *( value-separator member ) ] end-object member = string name-separator value
Arrays
An array structure is represented as square brackets surrounding zero or more values (or elements). Elements are separated by commas.
array = begin-array [ value *( value-separator value ) ] end-array
Numbers
The representation of numbers is similar to that used in most programming languages. A number contains an integer component that may be prefixed with an optional minus sign, which may be followed by a fraction part and/or an exponent part.
Octal and hex forms are not allowed. Leading zeros are not allowed.
A fraction part is a decimal point followed by one or more digits.
An exponent part begins with the letter E in upper or lowercase, which may be followed by a plus or minus sign. The E and optional sign are followed by one or more digits.
Numeric values that cannot be represented as sequences of digits (such as Infinity and NaN) are not permitted.
number = [ minus ] int [ frac ] [ exp ] decimal-point = %x2E ; .
digit1-9 = %x31-39 ; 1-9 e = %x65 / %x45 ; e E exp = e [ minus / plus ] 1*DIGIT frac = decimal-point 1*DIGIT int = zero / ( digit1-9 *DIGIT ) minus = %x2D ; - plus = %x2B ; + zero = %x30 ; 0
Strings
The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".
Alternatively, there are two-character sequence escape representations of some popular characters. So, for example, a string containing only a single reverse solidus character may be represented more compactly as "\\".
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
string = quotation-mark *char quotation-mark char = unescaped / escape ( %x22 / ; " quotation mark U+0022 %x5C / ; \ reverse solidus U+005C %x2F / ; / solidus U+002F %x62 / ; b backspace U+0008 %x66 / ; f form feed U+000C %x6E / ; n line feed U+000A %x72 / ; r carriage return U+000D %x74 / ; t tab U+0009 %x75 4HEXDIG ) ; uXXXX U+XXXX escape = %x5C ; \ quotation-mark = %x22 ; " unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
Encoding
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE 00 xx 00 xx UTF-16BE xx 00 00 00 UTF-32LE xx 00 xx 00 UTF-16LE xx xx xx xx UTF-8
Alternative Canonical JSON
Changes to RFC4627
Whitespace outside of strings is not allowed.
begin-array = %x5B ; [ left square bracket
begin-object = %x7B ; { left curly bracket
end-array = %x5D ; ] right square bracket
end-object = %x7D ; } right curly bracket
name-separator = %x3A ; : colon
value-separator = %x2C ; , comma
Objects
Members are ordered lexicographically by unicode code points of member keys. Members with same key are sorted lexicographically by the canonical JSON of their member values.
Numbers
- Trailing zeroes are eliminated
- Exponent and decimal point are shifted until the number starts with '0.X' where X is is not zero
- There is always an exponent and it is always the small 'e' unless the exponent is zero, then it is removed
- Leading or trailing zeros of the exponent are removed
- Zero or minus zero is encoded as '0'
- Maybe there is a better/easier way but you can normalize all numbers
Strings
All character code points not allowed in Unicode MUST be replaced by the Unicode replacement character (U+FFFD). All Strings MUST be normalized to Unicode Normalization Form C (UAX #15). All characters within the Basic Multilingual Plane beside quotation mark, reverse solidus, and the control characters (U+0000 through U+001F) MUST NOT be escaped.
Encoding
Canonical JSON text MUST be encoded in Unicode UTF-8.