Canonical JSON: Difference between revisions

From OLPC
Jump to navigation Jump to search
(Explain UTF-16 encoding better.)
 
(5 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{OLPC}}
A "canonical JSON" format is provided in order to provide meaningful and repeatable hashes of [http://json.org JSON]-encoded data. Canonical JSON is parsable with any full JSON parser, but security-conscious applications will want to verify that input is in canonical form before authenticating any hash or signature on that input.
A "canonical JSON" format is provided in order to provide meaningful and repeatable hashes of [http://json.org JSON]-encoded data. Canonical JSON is parsable with any full JSON parser, but security-conscious applications will want to verify that input is in canonical form before authenticating any hash or signature on that input.


Line 32: Line 33:
''char'' ''chars''
''char'' ''chars''
''char'':
''char'':
''any-7-bit-ASCII-character-except-"-or-\''-and-control-characters
''any byte except hex 22 (") or hex 5C (\)''
\\
\u ''four-hex-digits-in-lowercase''
\"
''number'':
''number'':
''int''
''int''
Line 45: Line 47:
''digit'' ''digits''
''digit'' ''digits''


Whitespace is not permitted between tokens. Leading and trailing whitespace is likewise disallowed. The ''members'' production in ''object'' must consist of keys '''in lexicographically sorted order'''. Contents of strings must be in a UTF-16 encoding of unicode [http://www.unicode.org/reports/tr15/ Normalization Form D] (UAX #15) with no byte-order mark (since each character-or-escape-sequence encodes a full 16-bit codepoint). Control characters are defined as unicode codepoint 127 (decimal) and those with unicode codepoints less than 32.
Whitespace is not permitted between tokens. Leading and trailing whitespace is likewise disallowed. The ''members'' production in ''object'' must consist of keys '''in lexicographically sorted order'''. Strings are uninterpreted bytes, with only two escaped byte values. Because only two byte values are escaped, be aware that JSON-encoded data may contain embedded control characters and nulls. It is suggested that unicode strings be represented as the UTF-8 encoding of unicode [http://www.unicode.org/reports/tr15/ Normalization Form C] (UAX #15). However, arbitrary content may be represented as a string: it is not guaranteed that string contents can be meaningfully parsed as UTF-8.
The "backslash u" escape form must not be used for any unicode character with code point greater than 31 (decimal) or less than 127 (decimal), except for codepoint 34 (decimal) and codepoint 92 (decimal), which are, respectively, the quotation and backslash character.


==Notes==
==Notes==
Line 53: Line 54:
* All whitespace is eliminated.
* All whitespace is eliminated.
* Trailing commas in ''members'' and ''elements'' are not allowed.
* Trailing commas in ''members'' and ''elements'' are not allowed.
* Only one 'escape' sequence is defined for strings, and its use is mandatory for certain characters.
* Only one 'escape' sequence is defined for strings, and its use is mandatory for quote and backslash.

== Compatibility ==
* A previous version of this specification required strings to be valid unicode, and relied on JSON's \u escape. This was abandoned as it doesn't allow representing arbitrary binary data in a string, and it doesn't preserve the identity of non-canonical unicode strings.
* A previous version of this specification suggested Normalization Form D for unicode strings. This was changed to Normalization Form C to better comply with the recommendations of [http://tools.ietf.org/html/rfc3987 RFC 3987].

== Implementations ==
* [https://github.com/mirkokiefer/canonical-json canonical-json (JavaScript)]
** This does not appear to be a faithful implementation of this specification. [[User:CScott|CScott]] 06:01, 27 February 2015 (UTC)

[[Category:software]]

Latest revision as of 06:01, 27 February 2015

  This page is monitored by the OLPC team.

A "canonical JSON" format is provided in order to provide meaningful and repeatable hashes of JSON-encoded data. Canonical JSON is parsable with any full JSON parser, but security-conscious applications will want to verify that input is in canonical form before authenticating any hash or signature on that input.

The grammar for canonical JSON closely matches the grammar presented at json.org:

object:
    {}
    { members } 
 members:
   pair
   pair , members
pair:
   string : value
array:
   []
   [ elements ]
elements:
   value
   value , elements
value:
   string
   number
   object
   array
   true
   false
   null
string:
   ""
   " chars "
chars:
   char
   char chars
char:
   any byte except hex 22 (") or hex 5C (\)
   \\
   \"
number:
   int
int:
   digit
   digit1-9 digits
   - digit1-9
   - digit1-9 digits
digits:
   digit
   digit digits

Whitespace is not permitted between tokens. Leading and trailing whitespace is likewise disallowed. The members production in object must consist of keys in lexicographically sorted order. Strings are uninterpreted bytes, with only two escaped byte values. Because only two byte values are escaped, be aware that JSON-encoded data may contain embedded control characters and nulls. It is suggested that unicode strings be represented as the UTF-8 encoding of unicode Normalization Form C (UAX #15). However, arbitrary content may be represented as a string: it is not guaranteed that string contents can be meaningfully parsed as UTF-8.

Notes

  • Floating point numbers are not allowed in canonical JSON. Neither are leading zeros or "minus 0" for integers.
  • All map keys must be quoted, and must appear in sorted order.
  • All whitespace is eliminated.
  • Trailing commas in members and elements are not allowed.
  • Only one 'escape' sequence is defined for strings, and its use is mandatory for quote and backslash.

Compatibility

  • A previous version of this specification required strings to be valid unicode, and relied on JSON's \u escape. This was abandoned as it doesn't allow representing arbitrary binary data in a string, and it doesn't preserve the identity of non-canonical unicode strings.
  • A previous version of this specification suggested Normalization Form D for unicode strings. This was changed to Normalization Form C to better comply with the recommendations of RFC 3987.

Implementations