Json vs. simplejson vs. ujson

jmoiron · on April 6, 2015

When I wrote the same kind of article in Nov 2011 [1], I came to similar conculsions; ujson was blowing everyone away.

However, after swapping a fairly large and json-intensive production spider over to ujson, we noticed a large increase in memory use.

When I investigated, I discovered that simplejson reused allocated string objects, so when parsing/loading you basically got string compression for repeated string keys.

The effects were pretty large for our dataset, which was all API results from various popular websites and featured lots of lists of things with repeating keys; on a lot of large documents, the loaded mem object was sometimes 100M for ujson and 50M for simplejson. We ended up switching back because of this.

[1] http://jmoiron.net/blog/python-serialization/

jyotiska · on April 6, 2015

Hey Jason. Thats pretty interesting. I have also noticed similar things but for my case, we needed faster loading/unloading for some cases, hence ujson.

thethimble · on April 6, 2015

I would love to see benchmarks in PyPy as well! I wonder how well a JIT would handle de/serialization.

JoshTriplett · on April 6, 2015

Seems like there should be a standard Python mechanism for constructing "atoms" or "symbols" that automatically get commoned up.

tlb · on April 6, 2015

Check out intern(): https://docs.python.org/2/library/functions.html#intern

JoshTriplett · on April 6, 2015

Appears to be deprecated though.

icebraining · on April 6, 2015

Not really, it was just moved to the sys module: https://docs.python.org/3.4/library/sys.html#sys.intern

rat87 · on April 6, 2015

I'm pretty sure symbols are not meant to be created from "user" input where user is untrusted, can't this lead to ddos atacks? Same thing for interning. De-Duping doesn't have that risk.

munificent · on April 6, 2015

Lua has an interesting approach here. In Lua, all strings are interned. If you have "two" strings that consist of the same bytes, you are guaranteed that they have the same address and are the same object. Basically, every time a string is created from some operation, it's looked up in a hash table of the existing strings and if an identical one is found, that gets reused.

However, that hash table stores weak references to those strings. If nothing else refers to a string, the GC can and will remove it from the string table.

This gives you great memory use for strings and optimally fast string comparisons. The cost is that creating a string is probably a bit slower because you have to check the string table for the existing one first.

It's an interesting set of trade-offs. I think it makes a lot of sense for Lua which uses hash tables for everything, including method dispatch and where string comparison must be fast. I'm not sure how much sense it would make for other languages.

TheLoneWolfling · on April 6, 2015

A problem with that approach:

You can discover what internal strings are held in a web application via a timing attack.

Better hope you never hold onto a reference to internal credentials inside the application! (Say... DB username / password? Passwords before they're hashed? Etc.)

jarman · on April 6, 2015

Depends on symbol implementations and intended usage.

For example Erlang symbols are deeply ingrained into language, and vm doesn't even garbage collects them, so creating symbols from user data is basically giving user 'crush vm' button.

On the other hand, if symbols are treated as another data type, as string with some optimizations - no such problems shall arise

michaelmior · on April 6, 2015

I think most JSON structures are unlikely to have user input be used as keys. This is also likely where there would be the most benefit from interning since keys are often repeated many times.

borman · on April 6, 2015

The problem with all (widely known) the non-standard JSON packages is, they all have their gotchas.

cjson's way of handling unicode is just plain wrong: it uses utf-8 bytes as unicode code points. ujson cannot handle large numbers (somewhat larger than 263, i've seen a service that encodes unsigned 64-bit hash values in JSON this way: ujson fails to parse its payloads). With simplejson (when using speedups module), string's type depends on its value, i.e. it decodes strings as 'str' type if their characters are ascii-only, but as 'unicode' otherwise; strangely enough, it always decodes strings as unicode (like standard json module) when speedups are disables.

smerritt · on April 6, 2015

Agreed, especially about simplejson. I work on a project that uses simplejson, and it leads to ugly type checking all over the place because you never know what your JSON string got turned into. For example:

https://github.com/openstack/swift/blob/39c1362a4f5a7df75730...

and https://github.com/openstack/swift/blob/39c1362a4f5a7df75730...

and many more just like those.

The worst part is the bugs that appear or disappear depending on whether simplejson's speedups module is in use or not.

TazeTSchnitzel · on April 6, 2015

There are so many poorly-written JSON decoders out there. I've had the misfortune of fixing two of PHP's to follow JSON's case-sensitivity and whitspace rules properly.

Drdrdrq · on April 6, 2015

I disagree with the conclusion. How about this: you should use the tool that most of your coworkers already know and which has large community support and adequate performance. In other words, stop foling around and use json library. If (IF!!!) you find performance inadequate, try the other libraries. And most of all, if optimization is your goal: measure, measure and measure! </rant>

jbergstroem · on April 6, 2015

I just want to add another library in here which – at least in my world – is replacing json as the number one configuration and serialisation format. It's called libucl and it's main consumer is probably the new package tool in FreeBSD: `pkg`

Its syntax is nginx-like but can also parse strict json. It's pretty fast too.

More info here: https://github.com/vstakhov/libucl

twic · on April 6, 2015

The automatic array creation feature [1] seems misguided. It means that as a programmer consuming a configuration file, i can't know whether a given field will be a scalar or an array. I recently worked on a JavaScript API that had that behaviour, and it was a pain.

Apart from that, though, this looks like a really good format.

[1] https://github.com/vstakhov/libucl#automatic-arrays-creation

michaelmior · on April 6, 2015

Out of curiousity, what do you mean by "my world"? Is this a particular domain you're working in, or just your personal usage?

leojg · on April 6, 2015

He is obviously an alien and he is trying to introduce the tools he use in his home world.

OnTopic: I think is an unusual way of saying "the environment I use"

jbergstroem · on April 6, 2015

Was just referring to the set of tools and libraries I surround myself with.

stock_toaster · on April 6, 2015

At one point I looked into using it, but there werent any python bindings at the time, and I didnt have the time (for the project) to write any. Are there any good language libs for it these days?

jbergstroem · on April 6, 2015

Conveniently enough the first python extension landed in head 10 days ago: https://github.com/vstakhov/libucl/pull/68

stock_toaster · on April 6, 2015

oh nice! thanks for the link. :D

crdoconnor · on April 6, 2015

Any particular reason to use this over YAML for configuration?

TillE · on April 6, 2015

Macros seem to be the key distinguishing feature. It's a good idea, sort of borrowing from template engines.

crdoconnor · on April 7, 2015

Sure, but why not use a well known and solid templating engine if you're going to do templating?

Ansible combines YAML with jinja2 to do this type of stuff, for instance.

I'm not totally certain, but I think that might end up being simpler, more expressive and more powerful than this.

wodenokoto · on April 6, 2015

How hard is it to draw a bar graph? I'd imagine it is easier than creating an ASCII table and then turning that into an image, but I've never experimented with the latter.

jyotiska · on April 6, 2015

You are right. I had to create the ASCII tables, because I was not able to draw tables on Medium.

icebraining · on April 6, 2015

What I found interesting is that the original version had plain HTML tables, not ascii in images: http://blog.dataweave.in/post/87589606893/json-vs-simplejson...

fiatjaf · on April 6, 2015

https://chartspree.io/

azinman2 · on April 6, 2015

Not helpful.

gpvos · on April 6, 2015

Actually sounds like an honest question to me.

azinman2 · on April 6, 2015

"How hard is it" is sarcastic here.

metaphorm · on April 6, 2015

I didn't read it that way. try to be more charitable of your interpretation of things other people say on the internet.

gpvos · on April 6, 2015

Yes, I considered that interpretation and found it the less probable one. Duh.

chojeen · on April 6, 2015

Maybe this is a dumb question, but is json (de)serialization really a bottleneck for python web apps in the real world?

MagicWishMonkey · on April 6, 2015

Depends on the app. My previous job required processing thousands of address book contact records uploaded to the server in a massive list. It was not unsual for some of these objects to exceed 10mb (when serialized to disk).

The default json module took close to 5 seconds to deserialize the payload once it hit the server, while ujson could do the same work in a fraction of the time (less than a second). 5 seconds might not seem like a whole lot when the import process as a whole could take 30 seconds or so, but when the user is stuck staring at their device it makes sense to cut down the response time any way you can.

est · on April 6, 2015

For some really large JSONs out there in the wild, yes, it's a big bottleneck.

We ended up using ijson.

metaphorm · on April 6, 2015

depends on what you're doing.

for the typical AJAX call for some rows of data selected from a datastore and JSON encoded, then no the JSON encoding is not the bottleneck, the network latency and database io time dominate the time it takes to JSON encode the data.

however, consider an alternative kind of task that might, for example, produce a big JSON dump of thousands of records. this is fairly typical of a data export of some kind. the network and database time for this request is the same as for the smaller one, but now instead of JSON encoding 50 records you're encoding 50000 records. it can start to add up. a poorly optimized JSON library will add multiple full seconds to your response time here.

michaelmior · on April 6, 2015

> ultrajson ... will not work for un-serializable collections

So I can't serialize things with ultrajson that aren't serializable? I must be missing something in this statement.

> The verdict is pretty clear. Use simplejson instead of stock json in any case...

The verdict seems clear (based solely on the data in the post) that ultrajson is the winner.

apendleton · on April 6, 2015

> So I can't serialize things with ultrajson that aren't serializable? I must be missing something in this statement.

This might not be what they're talking about, but I did run into what might be the same issue when looking at ujson before. The builtin JSON module lets you define custom serializations for types that aren't natively JSON-serializable; we had an application that did that with datetime objects, encoding them as ISO 8601 date strings. ujson doesn't support anything like that; you have to make sure everything is one of the JSON types already before encoding.

kelseyfrancis · on April 6, 2015

> The verdict seems clear (based solely on the data in the post) that ultrajson is the winner.

ultrajson isn't a drop-in replacement, though, because it doesn't support sort_keys.

michaelmior · on April 6, 2015

Fair enough. Although I'm not sure why one would want that behaviour given that there is no guarantee of ordering when a particular JSON file is processed with any other library.

mdaniel · on April 6, 2015

I don't know what they do with it, but it's handy for writing tests against an expected JSON file: assert json.dumps(expected, sort_keys=True) == json.dumps(obj, sort_keys=True) # where expected was json.load()-ed and obj was produced by the function

michaelmior · on April 6, 2015

I don't understand your example and why you wouldn't just do assert expected == obj.

jroseattle · on April 6, 2015

> keep in mind that ultrajson only works with well defined collections and will not work for un-serializable collections. But if you are dealing with texts, this should not be a problem.

Well-defined collections? As in, serializable? Well sure, that's requisite for the native json package as well as simplejson (as far as I can recall -- haven't used simplejson in some time.)

But does "texts" refer to strings? As in, only one data type? The source code certainly supports other types, so I wonder what this statement refers to.

random567 · on April 6, 2015

ujson doesn't error out if you have a collection that isn't serializable so you can lose individual keys. It also has issues with ints and floats that are too big (just fails out)

foota · on April 6, 2015

I disagree with the verdict at the end of the article, it seems like json would be better if you were doing a lot of dumping? And also for the added maintenance guarantee of being an official package.

jkire · on April 6, 2015

> We have a dictionary with 3 keys

What about larger dictionaries? With such a small one I would be worried that a significant proportion of the time would be simple overhead.

[Warning: Anecdote] When we were testing out the various JSON libraries we found simplejson much faster than json for dumps. We used large dictionaries.

Was the simplejson package using its optimized C library?

jkire · on April 6, 2015

> In this experiment, we have stored all the dictionaries in a list and dumped the list using json.dumps()

I completely failed to read this the first time I went through. I guess this is equivalent to dumping bigger dictionaries.

> [Warning: Anecdote] When we were testing out the various JSON libraries we found simplejson much faster than json for dumps.

Turns out we were using sort_keys=True option, which apparently makes simplejson much faster than json.

ktzar · on April 6, 2015

The usage of percentages in the article is wrong. 6 is not 150% faster than 4.

alexhill · on April 6, 2015

Yeah, this error is made consistently throughout the article and the author should fix it. It serves to inflate the project's performance by as much as two-thirds, and some people will see this as intentionally misleading and write your project off because of it. ultrajson looks to be way faster; don't alienate people by fudging the numbers.

jyotiska · on April 6, 2015

Author here. Thanks for pointing this out. I wrote this article long back and forgot about it. I have made the changes .

financequoll · on April 6, 2015

50% would be a bit closer.

anon4 · on April 6, 2015

6 = 1.5 * 4. I'm not seeing the problem.

icebraining · on April 6, 2015

"150% faster" implies the speed is 2.5 times, not 1.5.

L-four · on April 6, 2015

speed = speed * 1.5 Is not the same as speed += speed * 1.5

stared · on April 6, 2015

But ujson comes at a price of slightly reduced functionality. For example, you cannot set indent. (And I typically set indent for files <100MB, when working with third-party data, often manual inspection is necessary).

(BTW: I got tempted to try ujson exactly for the original blog post, i.e. http://blog.dataweave.in/post/87589606893/json-vs-simplejson...)

Plus, AFAIK, at least in Python 3 json IS simplejson (but a few version older). So every comparison of these libraries is going to give different results over time (likely, with difference getting smaller). Of course, simpejson is the newer thing of the same, so it's likely to be better.

willvarfar · on April 6, 2015

(My own due diligence when working with serialisation: http://stackoverflow.com/questions/9884080/fastest-packing-o...

I leave this here in case it helps others.

We had other focus such as good for both python and java.

At the time we went msgpack. As msgpack is doing much the same work as json, it just shows that the magic is in the code not the format..)

apu · on April 6, 2015

Also weird crashes with ultra json, lack of nice formatting in outputs, and high memory usage in some situations

dbenhur · on April 6, 2015

> Without argument, one of the most common used data model is JSON

JSON is a data representation, not a data model.

js2 · on April 6, 2015

I'll have to try ultrajson for my use case, but when I benchmarked pickle, simplejson and msgpack, msgpack came out the fastest. I also tried combining all three formats with gzip, but that did not help. Primarily I care about speed when deserializing from disk.

velox_io · on April 6, 2015

I know it goes against the grain, but I wish that binary json (UBJSON) had much more widespread usage. There's no reason tools can't convert it back to json for us old humans.

The speed deference between working with binary streams and parsing text is night and day.

zapov · on April 6, 2015

No it's not.

http://hperadin.github.io/jvm-serializers-report/report.html

akoumjian · on April 6, 2015

We took a look at ujson about a year ago and found that it failed loading even json structures that went 3 layers deep. I also recall issues handling unicode data.

It was a big disappointment after seeing these kinds of performance improvements.

MagicWishMonkey · on April 6, 2015

It kills me that the default JSON module is so slow, if you're working with large JSON objects you really have no choice but to use a 3rd party module because the default won't cut it.

bpicolo · on April 6, 2015

Python version? Library version? Results are meaningless without that info

fijal · on April 6, 2015

The standard JSON has an optimized version in PyPy (that does not beat ujson, but is a lot faster than the stdlib one in cpython)

UUMMUU · on April 6, 2015

was aware of simplejson but had not seen ultra json. This is awesome to see. Thanks for the writeup.

aaronem · on April 6, 2015

*(Python)