Thursday, December 20, 2012

Just a regular expression

It's taken me a little time to really get my head around the Javascript regular expressions. You have to be pretty on the ball when you're writing a parser.

In fact I managed to double the decoder speed with a couple of deft new functions, by thinking a little more carefully about the Javascript RegEx objects and their peculiars.

At first, Javascript's RegEx object seem particularly bad for the task of generalized parsing, because parsing generally requires checking the next token and the Javascript function calls seem to do either first or global searches, but not from a given index.

Then they seem like a great idea when you discover the 'sticky' option in the MDN documentation which distinctly allows you to do just this. Then they seem like a terrible idea again when you discover that only Firefox implements them.

Most people get disheartened here and go back to parsing the string themselves character by character. But regular expressions execute as compiled, highly optimized code. You're never going to beat that with 'case' statements.

So, back to the code mines we go, and some more digging turns up the "lastIndex" property on the RegEx objects, which is one of the more badly named javascript methods. (and there are a lot, trust me) It has some quite unexpected behavior, and it's got at least two purposes, but this property in combination with the standard .exec() call, turns out to be exactly what we need.

exec() seems inappropriate (or at least very inefficient), because a LALR parser wants to check if the next chunk of string is the token it wants, and standard exec() with a global option can't be stopped from zooming away into the string, looking for it anywhere. Right to the end if need be.

But that's OK, because if we let it, and then remember where (or if ) it found a match, then we don't have to ask again until we get past that point. It's still very inefficient at the point we first call it, but it gives us enough information to not have to call it again for a while.

It's obvious when you think about the situation when you get near the end of the file. By then, many of the match expressions will have worked out they never appear again. They can just hang up their coats and say 'nope' if they're asked if there's anything left to do. It's called having a functional memory.

Yes yes, standard RegEx optimization tricks. Nothing new. That's not the point.

Once I put the optimization in the code doubled in speed. (whoo!) But only on Safari (iOS) and FireFox (windows). Chrome (windows) and IE (windows) continued running at exactly the same speed as before.

Now that's interesting.

What it suggests is that the Javascript engines in Chrome and IE had already implemented the identical optimization, but at a language interpreter level. I assume they detect the exact same RegEx object being applied to the exact same string object and they just re-start the matching from internally stored state.

But Safari and Firefox clearly haven't implemented this "same regex, same string" optimization, so when I explicitly do it myself it saves an enormous amount of time.

Here's the relevant bit of code. Don't worry about the lambdas.


function decode_rex(type, empty, rex) {
rex_registry.push(rex);
return function(s,i) {
var r = {ok:empty?true:false, pos:i, len:0, type:type};
// are we past the point where we need to look for a new match?
if(rex.lastIndex<=i) {
// match the expression from here
rex.lastIndex = i;
var m = rex.exec(s);
if(m instanceof Array) {
// found it
rex.matchIndex = m.index; 
rex.matchLength = m[0].length;
rex.lastIndex = rex.matchIndex + 1; // safe consumer
} else {
// no more
rex.lastIndex = s.length;
rex.matchIndex = -1;
rex.matchLength = 0;
}
}
// is the next match supposed to be now?
if(rex.matchIndex===i) {
// consume the match
r.ok = true; r.len = rex.matchLength;
}
return r;
}
}



Finite Loop

I was going to post more details about JSON-U yesterday, but right about then the entire concept was in pieces around my feet after discovering a few things about how the colon and semicolon are handled in reality. Don't ask.

But that's OK, because I reworked the grammar and managed to remove both special characters, two more, and added new capabilities besides. To the point of making URL's nearly Turing-complete. Ignore that for the moment.

There are a few last things I'm trying to figure out. I've got my encoding to reliably survive most kinds of common mangling, and it really takes some determined effort to make the parser fail, but there's always the lure of handling one more case now so you never have to worry about it again.

Oddly, by being so free to rip things out of the parser, I'm discovering was of putting them back but using only the things remaining. For example, I had an array constructor syntax, and a function call syntax. Then I removed arrays as an explicit data, which is unhead of, but I kept the named function space (now called 'locallizers") and defined the "unnamed function" to "return an array consisting of the parameters passed to the function".

Boom, explicit arrays are gone. Replaced with a "pseudo-call" to a function that creates them. So functions are more primal than arrays, in a data coding. And since the functions use the round brackets, we save two characters from ever being used.

I've gone round and round, pulling things out and replacing them again. (quoted strings are back, but only to read in user-hacked values, never as output.) and it's like an infinite Rubik's cube where a couple more twists leaves a scrambled mess, but then a few more moves and everything is harmonious and better matched than ever.

I'm down to the point where I have a test page that generates hundreds of random data structures, encodes them all, checks they go backwards again properly, and offers up a list for clicking. I can type random crap into the parse-as-you-type text boxes, and JSON-U tends to cope with this better (spends more time in a 'parsed' state that correctly represents the data) than JSON does.

Along the road I've made some funny choices. But they hopefully combine their powers in a good way. For example, here's one consequence:

If you repetitively 'unescape' canonical JSON-U, you get the same string.
if you encodeURI() canonical JSON-U, you get the same string.

That's actually a major ability, and can't be said for any encoding that has a backslash, percent sign, or space in it. Ever.

("Canonical" just means "the one true blessed way, not the way we do it when we're in a rush.")

The single remaining annoying issue turns up when someone uses encodeURIComponent() on something that ends up in the hash fragment. From what I can tell of the standard, the fragment requires only the equivalent of encodeURI() and all possible component characters, including the hash, are allowed from that point on.

Therefore, doing a blanket encodeURIComponent() or escape() on anything destined for the hash fragment is de facto wrong. But that won't stop people from doing it, because who really knows the standards that well? How far do I go to accept the incorrect behavior? I think the answer is actually "not at all". But then, it might be easy to make it work with just a few more twists of the cube.

At the moment my encoding survives most mangling, Can I make it survive all? Perhaps.

Why do I care so much? Because shortly I'll be handing out URLs in the new format that I'll be expected my server to honor for months. Years even. While the code goes through development cycles, and I'm sure I'll want to change the parameters every few damn months. I need something like JSON-U up and running before I even  hand out my first (versioned, metadata enhanced, expiry timed) links.

I have to be able to accurately predict all possible futures for the data I'll want to transmit and consume. As they say, prediction is very hard, especially about the future.

Fortunately, I am a computer scientist. And predictions of those kinds are one of my superpowers.

Sunday, December 16, 2012

A digression to the heart of the matter

I "invented" a new micro-language over the weekend. I put the word in air quotes, because I was actually trying very hard not to do any such thing, but was forced by dreadful necessity to do some inventing anyway.

Why the odd programmer self-loathing? Because there are already too many microformats in use, and I'm not sure that adding "yet another unifying standard" to the mix will help. But I need it.

Here's the problem: URLs. That's it in a nutshell. Bloody URLs.

There is a spec for them, but hardly anyone reads it. Even when they have, they usually just chop URL strings up with Regular Expressions intended to get the bit they want, and fail utterly if given anything that doesn't start with http:// or https://. So it's a minefield.

The latest crazy thing to do is to make extensive use of what they call the 'hash fragment'; everything that comes after the '#' sign in the URL, which is a special fragment for two reasons:

  • It is never sent to the server during page requests, it's only available to the client browser and scripts.
  • Clicking on a page link that differs only in it's hash fragment from the current page does not reload the page, and in fact does nothing at all (except notify your scripts) if the fragment id doesn't match up with a real element id.

Remember, making the browser focus on an element half-way down the page was the original intention of the hash fragment, so all browsers support it and are 'tolerant' of this abuse. (ie: they don't throw errors or 'correct' the URL because a weird fragment ID doesn't actually exist in the page) We are just extending that metaphor, which just happens to work consistently on almost every browser ever made.

A classic modern use of the hash fragment is to put database article identifiers in it that correspond to AJAX calls to obtain that 'page fragment'. When your script detects the user has clicked on a new 'fragment link', the corresponding article is loaded. Sites like Twitter and Facebook have reached the point where there is only one 'page' for the entire site, and everything you see is dynamically loaded into that container.

A consequence has been such AJAX-driven sites were difficult for search engines (ie: Google) to index properly as none of the 'content' was on any real pages anymore. So they came up with a scheme: When the search engine (which acts like a big automatic browser) sees URLs with hash fragments, why not call a 'special' url with that fragment as a normal server GET parameter (basically, juggle the URL around a bit to put the client part in the server part) and then the site can tell the search engine all about the text and paragraphs that the 'hash fragment' corresponds to.

The search engine can even publish those links, and since they still contain the article id, your site will load up that content and the user will see what they expect!

So long as everyone does their jobs right.

Just to be sure that they don't accidentally index sites which can't cope with this behavior  Google (and therefore all search engines) use the presence of the "!" symbol at the start of the fragment to indicate "this is an indexable hash link"  Since the 'bang' character must be the first one following the 'hash', it is informally referred to as a "hashbang". (For the sound made by primitive parsers mangling your parameter strings just before they crash)

Why does this matter? Well, let's say I have a map application (hey I do!) and I want to record the co-ordinates of their window location into the URL in such a way that if they copy the link and sent it to someone, or just bookmark it, then going back to that link reloads the map to that same geographic location.

These cannot be ordinary server GET parameters, because we can't change the browsers URL to that without causing a page reload. It has to stay in the hash fragment, which means the client has to do all the work decoding it's own parameters.

In fact, wouldn't it be nice if we could just append an entire serialized Javascript object up there in the URL? And then deserialize it again with the same ease as JSON? Hey, why don't you just turn Javascript objects into JSON strings, then URLEncode them? Well:
  • The standard javascript encode/escape routines don't quite do all the edge-case characters right.
  • JSON transforms really badly when URLEncoded. As in, every bracket and apostrophe turning into three or six characters bad.
  • Let's not even get into the unicode/utf8 conversion issues.
  • The result is usually an unreadable mess.
  • A lot of people do exactly that anyway.
Well, if the standard JSON format encodes badly because of its syntax character choices, why not just make some different choices that are URL-friendly? And then deal with the other character encoding issues?

...oh, and it would be nice if the data didn't change format when in the hash fragment compared to when it gets used as a server parameter, since different characters are allowed...
... and it would be nice if we could save some bytes by leaving out some of the JSON 'syntactic sugar' (like quoted identifiers) that aren't totally necessary...
... and if there were a more efficient way of encoding binary strings than 'percent' escaping, that would be good....
... and it would be nice if the user (ie, the developer) could still hand-edit some parameters, but then hide them away without affecting anything else...

That's pretty much what I have done. I'm calling the result JSON-U.


So, that's the why. Next time, the how.

Tuesday, December 11, 2012

Australian Fire Maps released!

It's fire season in Australia. My state in particular, Queensland, tends to catch fire quite a lot. I know this because I recently worked for the state government and built them a bushfire tracking system.

One almost as good as the system I built for you.


This app integrates data from the following sources:


You will note that the Northern Territory is absent from this list. There are reasons.