Talk:Canonical JSON

From OLPC
Jump to: navigation, search

JSON specification

Thanks for the effort, but there are several flaws in this specification: There are instances of JSON that cannot be encoded as "canonical JSON" and vice versa. Here is a copy of the JSON specification (RFC 4627), section 2 and 3. Any canonical JSON should refine this. -- User:JakobVoss

JSON Grammar

A JSON text is a sequence of tokens. The set of tokens includes six structural characters, strings, numbers, and three literal names.

A JSON text is a serialized object or array.

   JSON-text = object / array

These are the six structural characters:

   begin-array     = ws %x5B ws  ; [ left square bracket

   begin-object    = ws %x7B ws  ; { left curly bracket

   end-array       = ws %x5D ws  ; ] right square bracket

   end-object      = ws %x7D ws  ; } right curly bracket

   name-separator  = ws %x3A ws  ; : colon

   value-separator = ws %x2C ws  ; , comma

Insignificant whitespace is allowed before or after any of the six structural characters.

   ws = *(
      %x20 /       ; Space
      %x09 /       ; Horizontal tab
      %x0A /       ; Line feed or New line
      %x0D         ; Carriage return
  )

Values

A JSON value MUST be an object, array, number, or string, or one of the following three literal names:

 false null true

The literal names MUST be lowercase. No other literal names are allowed.

 value = false / null / true / object / array / number / string

 false = %x66.61.6c.73.65   ; false

 null  = %x6e.75.6c.6c      ; null

 true  = %x74.72.75.65      ; true

Objects

An object structure is represented as a pair of curly brackets surrounding zero or more name/value pairs (or members). A name is a string. A single colon comes after each name, separating the name from the value. A single comma separates a value from a following name. The names within an object SHOULD be unique.

  object = begin-object [ member *( value-separator member ) ] end-object
  member = string name-separator value

Arrays

An array structure is represented as square brackets surrounding zero or more values (or elements). Elements are separated by commas.

  array = begin-array [ value *( value-separator value ) ] end-array

Numbers

The representation of numbers is similar to that used in most programming languages. A number contains an integer component that may be prefixed with an optional minus sign, which may be followed by a fraction part and/or an exponent part.

Octal and hex forms are not allowed. Leading zeros are not allowed.

A fraction part is a decimal point followed by one or more digits.

An exponent part begins with the letter E in upper or lowercase, which may be followed by a plus or minus sign. The E and optional sign are followed by one or more digits.

Numeric values that cannot be represented as sequences of digits (such as Infinity and NaN) are not permitted.

  number = [ minus ] int [ frac ] [ exp ]

  decimal-point = %x2E       ; .
  digit1-9 = %x31-39  ; 1-9

  e = %x65 / %x45     ; e E

  exp = e [ minus / plus ] 1*DIGIT

  frac = decimal-point 1*DIGIT

  int = zero / ( digit1-9 *DIGIT )

  minus = %x2D        ; -

  plus = %x2B         ; +

  zero = %x30         ; 0

Strings

The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

Alternatively, there are two-character sequence escape representations of some popular characters. So, for example, a string containing only a single reverse solidus character may be represented more compactly as "\\".

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

   string = quotation-mark *char quotation-mark

   char = unescaped /
   escape (
       %x22 /   ; "    quotation mark  U+0022
       %x5C /   ; \    reverse solidus U+005C
       %x2F /   ; /    solidus  U+002F
       %x62 /   ; b    backspace       U+0008
       %x66 /   ; f    form feed       U+000C
       %x6E /   ; n    line feed       U+000A
       %x72 /   ; r    carriage return U+000D
       %x74 /   ; t    tab      U+0009
       %x75 4HEXDIG )  ; uXXXX         U+XXXX

   escape = %x5C       ; \

   quotation-mark = %x22      ; "

   unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

Encoding

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.

   00 00 00 xx  UTF-32BE
   00 xx 00 xx  UTF-16BE
   xx 00 00 00  UTF-32LE
   xx 00 xx 00  UTF-16LE
   xx xx xx xx  UTF-8

Alternative Canonical JSON

Changes to RFC4627

Whitespace outside of strings is not allowed.

  begin-array     = %x5B  ; [ left square bracket
  begin-object    = %x7B  ; { left curly bracket
  end-array       = %x5D  ; ] right square bracket
  end-object      = %x7D  ; } right curly bracket
  name-separator  = %x3A  ; : colon
  value-separator = %x2C  ; , comma

Objects

Members are ordered lexicographically by unicode code points of member keys. Members with same key are sorted lexicographically by the canonical JSON of their member values.

Numbers

  • Trailing zeroes are eliminated
  • Exponent and decimal point are shifted until the number starts with '0.X' where X is is not zero
  • There is always an exponent and it is always the small 'e' unless the exponent is zero, then it is removed
  • Leading or trailing zeros of the exponent are removed
  • Zero or minus zero is encoded as '0'
Maybe there is a better/easier way but you can normalize all numbers

Strings

All character code points not allowed in Unicode MUST be replaced by the Unicode replacement character (U+FFFD). All Strings MUST be normalized to Unicode Normalization Form C (UAX #15). All characters within the Basic Multilingual Plane beside quotation mark, reverse solidus, and the control characters (U+0000 through U+001F) MUST NOT be escaped.

Encoding

Canonical JSON text MUST be encoded in Unicode UTF-8.