wren

a classy little scripting language

Performance

Even though most benchmarks aren’t worth the pixels they’re printed on, people seem to like them, so here’s a few:

Method Call

wren
0.12s 
luajit (-joff)
0.16s 
ruby
0.20s 
lua
0.35s 
python3
0.78s 
python
0.85s 

DeltaBlue

wren
0.13s 
python3
0.48s 
python
0.57s 

Binary Trees

luajit (-joff)
0.11s 
wren
0.22s 
ruby
0.24s 
python
0.37s 
python3
0.38s 
lua
0.52s 

Recursive Fibonacci

luajit (-joff)
0.10s 
wren
0.20s 
ruby
0.22s 
lua
0.28s 
python
0.51s 
python3
0.57s 

Shorter bars are better. Each benchmark is run ten times and the best time is kept. It only measures the time taken to execute the benchmarked code itself, not interpreter startup.

These were run on my MacBook Pro 2.3 GHz Intel Core i7 with 16 GB of 1,600 MHz DDR3 RAM. Tested against Lua 5.2.3, LuaJIT 2.0.2, Python 2.7.5, Python 3.3.4, ruby 2.0.0p247. LuaJIT is run with the JIT disabled (i.e. in bytecode interpreter mode) since I want to support platforms where JIT-compilation is disallowed. LuaJIT with the JIT enabled is much faster than all of the other languages benchmarked, including Wren, because Mike Pall is a robot from the future.

The benchmark harness and programs are here.

Why is Wren fast? #

Languages come in four rough performance buckets, from slowest to fastest:

  1. Tree-walk interpreters: Ruby 1.8.7 and earlier, Io, that interpreter you wrote for a class in college.

  2. Bytecode interpreters: CPython, Ruby 1.9 and later, Lua, early JavaScript VMs.

  3. JIT compiled dynamically typed languages: Modern JavaScript VMs, LuaJIT, PyPy, some Lisp/Scheme implementations.

  4. Statically typed languages: C, C++, Java, C#, Haskell, etc.

Most languages in the first bucket aren’t suitable for production use. (Servers are one exception, because you can throw more hardware at a slow language there.) Languages in the second bucket are fast enough for many use cases, even on client hardware, as the success of the listed languages shows. Languages in the third bucket are quite fast, but their implementations are breathtakingly complex, often rivaling that of compilers for statically-typed languages.

Wren is in the second bucket. If you want a simple implementation that’s fast enough for real use, this is the sweet spot. In addition, Wren has a few tricks up its sleeve:

A compact value representation #

A core piece of a dynamic language implementation is the data structure used for variables. It needs to be able to store (or reference) a value of any type, while also being as compact as possible. Wren uses a technique called NaN tagging for this.

All values are stored internally in Wren as small, eight-byte double-precision floats. Since that is also Wren’s number type, in order to do arithmetic, no conversion is needed before the “raw” number can be accessed: a value holding a number is a valid double. This keeps arithmetic fast.

To store values of other types, it turns out there’s a ton of unused bits in a NaN double. You can stuff a pointer for heap-allocated objects, with room left over for special values like true, false, and null. This means numbers, bools, and null are unboxed. It also means an entire value is only eight bytes, the native word size on 64-bit machines. Smaller = faster when you take into account CPU caching and the cost of passing values around.

Fixed object layout #

Most dynamic languages treat objects as loose bags of named properties. You can freely add and remove properties from an object after you’ve created it. Languages like Lua and JavaScript don’t even have a well-defined concept of a “type” of object.

Wren is strictly class-based. Every object is an instance of a class. Classes in turn have a well-defined declarative syntax, and cannot be imperatively modified. In addition, fields in Wren are private to the class—they can only be accessed from methods defined directly on that class.

Put all of that together and it means you can determine at compile time exactly how many fields an object has and what they are. In other languages, when you create an object, you allocate some initial memory for it, but that may have to be reallocated multiple times as fields are added and the object grows. Wren just does a single allocation up front for exactly the right number of fields.

Likewise, when you access a field in other languages, the interpreter has to look it up by name in a hash table in the object, and then maybe walk its inheritance chain if it can’t find it. It must do this every time since fields may be added freely. In Wren, field access is just accessing a slot in the instance by an offset known at compile time: it’s just adding a few pointers.

Copy-down inheritance #

When you call a method on an object, the method must be located. It could be defined directly on the object’s class, or it may be inheriting it from some superclass. This means that in the worst case, you may have to walk the inheritance chain to find it.

Advanced implementations do very smart things to optimize this, but it’s made more difficult by the mutable nature of the underlying language: if you can add new methods to existing classes freely or change the inheritance hierarchy, the lookup for a given method may actually change over time. You have to check for that which costs CPU cycles.

Wren’s inheritance hierarchy is static and fixed at class definition time. This means that we can copy down all inherited methods in the subclass when it’s created since we know those will never change. Method dispatch then just requires locating the method in the class of the receiver.

Method signatures #

Wren supports overloading by arity using its concept of signatures. This makes the language more expressive, but also faster. When a method is called, we look it up on the receiver’s class. If we succeed in finding it, we also know it has the right number of parameters.

This lets Wren avoid the extra checking most languages need to do at runtime to handle too few or too many arguments being passed to a method. In Wren, it’s not syntactically possible to call a method with the wrong number of arguments.

Computed gotos #

On compilers that support it, Wren’s core bytecode interpreter loop uses something called computed gotos. The hot core of a bytecode interpreter is effectively a giant switch on the instruction being executed.

Doing that using an actual switch confounds the CPU’s branch predictor—there is basically a single branch point for the entire interpreter. That quickly saturates the predictor and it just gets confused and fails to predict anything, which leads to more CPU stalls and pipeline flushes.

Using computed gotos gives you a separate branch point at the end of each instruction. Each gets its own branch prediction, which often succeeds since some instruction pairs are more common than others. In my rough testing, this makes a 5-10% performance difference.

A single-pass compiler #

Compile time is a relatively small component of a language’s performance: code only has to be compiled once but a given line of code may be run many times. However, fast compilation helps with startup speed—the time it takes to get anything up and running. For that, Wren’s compiler is quite fast.

It’s modeled after Lua’s compiler. Instead of tokenizing and then parsing to create a bunch of AST structures which are then consumed and deallocated by later phases, it emits code directly during parsing. This means it does minimal memory allocation during a parse and has very little overhead.

Why don’t other languages do this? #

Most of Wren’s performance comes from language design decisions. While it’s dynamically typed and dispatched, classes are relatively statically defined. That makes a lot of things much easier. Other languages have a much more mutable object model, and cannot change that without breaking lots of existing code.

Wren’s closest sibling, by far, is Lua. Lua is more dynamic than Wren which makes its job harder. Lua also tries very hard to be compatible across a wide range of hardware and compilers. If you have a C89 compiler for it, odds are very good that you can run Lua on it.

Wren cares about compatibility, but it requires C99 or C++98 and IEEE double precision floats. That may exclude some edge case hardware, but makes things like NaN tagging, computed gotos, and some other tricks possible.