In the Minefield of JSON Parsing

[2016-10-28] Presentation at Soft-Shake Conference, Geneva

Session Description

JSON is the de facto standard when it comes to (un)serialising and exchanging data in web and mobile programming. But how well do you really know JSON? We'll read the specifications and write test cases together. We'll test common JSON libraries against our test cases. I'll show that JSON is not the easy, idealised format as many do believe. Indeed, I did not find two libraries that exhibit the very same behaviour. Moreover, I found that edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services, mainly because JSON libraries rely on specifications that have evolved over time and that let many details loosely specified or not specified at all.

Table of Contents

  1. JSON Specifications
  2. The Tests
    2.1 Structure
    2.2 Numbers
    2.3 Arrays
    2.4 Objects
    2.5 Strings
  3. Testing Architecture
  4. Results and Comments
    6.1 Reaching application limits
    6.2 Bearer token invalidation
    6.3 Keys revocation
  5. STJSON
  6. Conclusion

1. JSON Specifications

JSON is the de facto serialization standard when it comes to sending data over HTTP, both in modern web sites and mobile applications.

"Discovered" in 2001 Douglas Crockford, JSON specification is so short and simple that Corckford created business cars with that hold the whole grammar on their back.

alt text

Pretty much all Internet users and programmers use JSON, yet few do actually agree on how JSON should actually work. The conciseness of the grammar lets many aspects undefined, and several specifications exist, let alone their interpretation.

Crockford chose not to version JSON definition:

Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.

Yet JSON is defined in four different specifications:

  1. 2002 - json.org, and the business card
  2. 2006 - IETF RFC 4627 https://tools.ietf.org/html/rfc4627
  3. 2013 - ECMA 404 http://www.ecma-international.org/publications/standards/Ecma-404.htm and 262 http://www.ecma-international.org/publications/standards/Ecma-262.htm
  4. 2014 - IETF RFC 7159 https://tools.ietf.org/html/rfc7159

In fact, IETF RFC 7159 was preceeded by 7158, but RFC 7158 was erroneously dated from "March 2013" instead of "March 2014" and RFC 7159 was released to fix the typo.

The main difference between these specifications is that RFC 4627 required JSON documents to consist in only an object or an array. Later specifications lifted this restrictions and allow JSON documents to hold single values such as a number or a string.

Despite clarifying several things (TODO: explain), RFC 7159 contains several approximations and leaves many details losely specified.

RFC 7159 mentions that a design goal of JSON was to be "a subset of JavaScript", it turns out that it's is actually not. Specifically, JSON allows the Unicode line terminators U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR to appear unescaped. But JavaScript specifies that strings cannot contains line terminators (ECMA-262 - 7.8.4 String Literals), and line terminators include U+2028 and U+2029 (7.3 Line Terminators). The single fact that these two characters are allowed without escape in JSON strings while they are not in JavaScript implies that JSON is not a subset of JavaScript, despite the JSON design goals.

Also, RFC 7159 is unclear about how a JSON parser should treat extreme number values, malformed Unicode strings, objects with similars keys and/or values or handle recursion depth. Some corner cases are explicitely left free to implementations, while others suffer from contradictory statements.

To illustrate the poor precision or RFC 7159, I wrote a corpus of JSON test files and did document how selected JSON parsers do handle these files. You'll see that deciding if a test should pass or not is far from trivial, and that I did not find two parsers that exhibit the very same behaviour.

2. The Tests

In this section, I discuss some interesting tests, and the rationale to decide if they should be accepted or rejected by RFC 7159 compliant parsers, or if RFC 7150 lets parsers free free to accept the tests or not.

2.1 Structure

Lonely values - Clearly, lonely values such as 123 or "asd" must pass. In practice, many popular parsers still do implement RFC 4627 and won't parse lonely values.

Trailing commas - Trailing commas such as in [123,] or {"a":1,} are not part of the grammar, so these files should not pass, right? The thing is that RFC 7159 allows parsers to support "extensions" (section 9), although it does not define these extensions. In practice, trailing commas is a common extension.

Comments - Comments are not part of the grammar. Crockford removed them from early specifications. Yet, they are still another common extension. Some parsers allow trailing comments [1]//xxx, or even inline comments [1,/*xxx*/2].

Unclosed Structures - These tests cover everything that is opened and not closed, such as [ or [1,{,3]. They are clearly invalid and must fail.

Nested Structures - Structures may contain other structures. An array may contain other arrays. The first element can be an array, whose first element is also an array, etc, like russian dolls [[[[[]]]]]. RFC 7159 allows parsers to set limits to the maximum depth of nesting (section 9).

In practice, several parsers don't and crash when they meet such files. For example, Xcode itself will crash when you drag a .json file containing 10000 times the character [, most probably because of its JSON syntax highlighter.

// TODO: illustrate

White Spaces - RFC 7150 grammar defines white spaces as 0x20 (space), 0x09 (tab), 0x0A (line feed) and 0x0D (carriage return). It allows white spaces before and after "structural characters" []{}:,. So, we'll write passing tests like 20[090A]0D and failing ones including all kinds of white spaces that are not explicitely allowed, such as 0x0C form feed or [E281A0], which is the UTF-8 encoding for U+2060 word joiner.

Note that the underlined values represent the hex values of the bytes.

2.2 Numbers

NaN and Infinity - Strings describing special numbers such as NaN or Infinity are not part of the JSON grammar. However, several parser accept them, which is allowed since it is an "extension" (section 9). Test files will also test -NaN and -Infinity.

Hex Numbers - RFC 7159 doesn't allow hex numbers. Tests will include numbers such as 0xFF, and this tests must fail.

Range and Precision - What about numbers with a huge amount of digits? According to RFC 7159, "A JSON parser MUST accept all texts that conform to the JSON grammar" (section 9). However, according to the same paragraph, "An implementation may set limits on the range and precision of numbers.". So, it is unclear to me parsers are allowed to raise errors when they meet 1e9999 or 0.0000000000000000000000000000001.

Exponential format - Parsing exponential notation can be surprinsingly hard. Here are some valid contents [0E0], [0e+1] and invalid ones [1.0e+], [0E] and [1eE2].

2.3 Arrays

Most edge cases regarding arrays are opening/closing issues and nesting limit. These cases were discussed in section 2.1 Structure. Passing tests will include [[],[[]]], while failing tests will be like ] or [[]]].

2.4 Objects

Duplicated Keys - RFC 7159 says that "The names within an object should be unique." section 4. It does not prevent parsing objects where the same key appearing several times {"a":1,"a":2}, but lets parsers decide what to do in this case. The same section 4 even mentions that "(some) implementations report an error or fail to parse the object", without telling clearly if failing to parse such objects is compliant or not with the RFC and especially section 9: "A JSON parser MUST accept all texts that conform to the JSON grammar.".

Variants of this special case include same key - same value {"a":1,"a":1}, and similar keys or values, where the similarity depends on how you compare strings. For example, the keys may be binary different but equivalent according to Unicode NFC normalization, such as in {"C3A9:"NFC","65CC81":"NFD"} where boths keys read "é". Tests will also include {"a":0,"a":-0}.

2.5 Strings

File Encoding - "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8" section 8.1.

Passing tests will include text encoded in these three encodings. UTF-16 and UTF-32 texts will also include both their big-endian and little-endian variants.

Failing tests will include a string encoded in ISO-Latin-1.

// TODO: add these files

Byte Order Mark - While "Implementations MUST NOT add a byte order mark to the beginning of a JSON text." section 8.1, "implementations (...) MAY ignore the presence of a byte order mark rather than treating it as an error".

Failing tests will include a plain UTF-8 BOM with no other content. Tests with implementation defined results will include a UTF-8 BOM with a UTF-8 text, but also a UTF-8 BOM with a UTF-16 text, and a UTF-16 BOM with a UTF-8 text.

// TODO: add these files

Control Characters - Control characters must be escaped, and are defined as U+0000 through U+001F (section 7). This range does not include 0x7F DEL, which may be part of other definitions of control characters. That is why passing tests include ["7F"].

Escape - "All characters may be escaped" (section 7) but some MUST be escaped: quotation mark, reverse solidius and control characters. Failing tests will include the escape character without the escaped value, or with an incomplete escaped value. Examples: ["\"], ["\, [\.

The escape character can be used to represent codepoints in the Basic Multilingual Plane (\u005C). Passing tests will include the zero character \u0000, which may cause issues in C-based parsers. Failing tests will include capital U \U005C, non-hexadecimal escaped values \u123Z and incomplete escaped values \u123.

// TODO: Discuss malformed Unicode

Codepoints outside of the BMP are represented with their escaped UTF-16 surrogates: U+1D11E becomes \uD834\uDD1E. Passing tests will include single surrogates. Single surrogates are valid JSON according to the grammar, yet string to be produced is not defined. According to the Unicode standard, invalid codepoints should be replaced by U+FFFD REPLACEMENT CHARACTER

Single Unpaired UTF-16 Surrogate - impl. defined

mention errata

Malformed Unicode

   - JSON text SHALL be encoded in Unicode.
        -> bytes not encoding unicode (eg: U+D800 and noncharacters U+FDD0 and U+10FFFE) -> invalid
        -> possible alternative: replace each maximal subpart of an ill-formed subsequence with U+FFFD
   - An implementation may set limits on the length and character contents of strings.
   - While the Unicode Standard permits the mutation of the original JSON (i.e., substituting U+FFFD for ill-formed Unicode), RFC 4627 is silent on this issue.

Other Undefined Behaviours

3. Testing Architecture

4. Results and Comments

aaa

  1. aa
  2. aa
  3. aa

5. STJSON

6. Conclusion

Specifications are ambiguous.

JSON parsing is a minefield

Next: investigate ProtoBuf https://github.com/apple/swift-protobuf-plugin