Monday, September 7, 2015

Apache Spark Overview

Let me tell you a secret.

Well, it shouldn't be a secret, but given how few people seem to know about it, we may as well be talking about Higgs-Englert physics.

It's Apache Spark.

"WTF is Spark?" you will probably say. You'll have heard about Apache, the venerable webserver software. You might not have really being paying much attention to their background projects which have been merging together over the last year like the arms and legs of some kind of Voltron. ("And I'll form the Head!")

Apache is making the transition from being the kind of software you run your old website on, to being the kind of software you run Twitter, or Facebook, or eBay on. Or Netflix, which is possibly the best case study for the software we'll be talking about, although that's more Apache Cassandra, a topic for another time.

At that level, there are new problems that are oddly different from the old ones. All these guys use cloud computing resources.. they don't really depend on physical machines. They rent them 'out of the air' for as many hours (or minutes) as they need. This is so they can increase the size of their clusters from say 20-50 to a few hundred for a couple of hours in order to handle peak loads.

eg: They don't repair servers. For most of those machines, no human will ever ssh into the box. In fact, they are usually put on a countdown for a rolling 'refresh' which shuts down the most ancient servers and replaces them with fresh ones, in the equivalent of a slow A/B testing transfer. Really clever systems stop the rollout of the new software and clone copies of the old, as automated reliability statistics come in.

But if a box is giving you trouble, you don't spend any of your time on it whatsoever. You mercilessly sent it back to the cloud, after getting a replacement.

At that level, it's all about "devops". Specifically what they call either orchestration or choreography. Depending, I suppose, on whether you listen to chamber music, or prefer dancing the samba.

Here's the Netflix devops problem: In each country, there are daily viewing peaks. There are weekly viewing peaks. These peaks are 10x the baseline, and last a couple of hours. Then most people go to bed. This is the predictable part.

Then there's the unpredictable side. When a beloved actor like Leonard Nimoy dies, there is a tendency for millions of people to go home via the bottle-shop and queue up every movie he's ever done as a kind of binge-tribute. I've heard.

And that's the kind of situation that your scalable internet service has to handle, if you're going to serve movies on demand to 100 million people. Very rarely, you have to be able to service everyone at once. And you cannot FailWhale. That was funny once, when it was Twitter. Once.

The most amusing thing about Twitter is that once they got past the FailWhale, their company value went from merely silly to completely ludicrous. We are talking 10 million to 200 million, because they proved they were finally in the big league. What was the technical miracle which banished the white Whale? It was Scala... the primary language behind Spark.

So, what's Spark? In a nutshell, it's next-gen Hadoop.

"What was hadoop again?" you probably ask, since you probably never used it. Well, it was a giant hack that allowed hundreds of computers to be ganged together and carry out various file-processing tasks across the entire cluster.

What for? Logfile processing, mainly. The daily re-indexing of Wikipedia. Places like eBay and Amazon used it for their first recommender systems ("other people also bought this!") and all because of the simple necessity of churning through more gigabytes of text than any single computer can manage.

You have to realize that, to a large extent, the billions of dollars that eBay and Amazon are worth are because of their "people also bought" recommender systems. That list of five other things (five is the optimum psychological number) absolutely must be the best possible, where "best" is defined as "most likely to buy next". This is not advertising, this is lead generation. There are metrics.

The point of lead generation is to turn each sale into an opportunity for another sale. "Accessorize, accessorize, accessorize!" and when those system break, or just degrade, then the bottom line impact is direct and palpable. Companies live and die by their ability to snowball those sales.

Netflix had this happen, and they offered a million dollars to the mathematician who could solve it for them. This was the famous "Netflix Prize". The resulting algorithm is now known as "Alternating Least Squares", and the details are a topic for another day.

Spark implements the ALS algorithm in it's standard mlib library. It's core. It's yours. You can have a million dollar algorithm, gratis. If you want to run ALS large scale, and this is most important - in real time - then Spark is the only option.

The only option, unless you want to spend about a man-century implementing the equivalent of fine-grained distributed transaction control and data storage, and that's just the infrastructure your math needs to sit on top of.

If you want to grow into the size of one of these services, you need to start with a seed capable of growing that large. Fortunately, in this analogy, those seeds are happily falling off the existing trees, and being blown about by the winds of open source software to the fertile fields of... some poetic metaphor I probably should have shut down a while ago.

That means Scala, and Cassandra. That means Zookeeper, and message queues. haproxy to spread the load. Graphite to chart the rise and fall of resources. Ansible to spin up the servers. It means dozens of support tools you've never heard of, and would never run by choice if you didn't have a pressing need to get the job done.

And these are all sub-components needed to support an overarching system like Spark - which schedules "parallel programs" across the entire cluster which are tolerant to the various slings and arrows of the internet.

There is a level above Spark which is still in formation - exemplified by the Mesos project. That one seeks to be a kind of  "distributed hypervisor" that can manage a cluster of machines and run many flavors of Spark, Hadoop, Cassandra or whatever within the single cluster. Otherwise we tend to get 'clusters of clusters' syndrome where each 'machine' is effectively only running one program.

You have the dev cluster, and the testing cluster, as well as the production cluster of course. Oh, and that's one each for the database cluster, webserver/app cluster, and the small front-end routing clusters or logging cluster that hang off the big clusters...

Yeah. Fire up the music, let the dance of clusters begin. Oh, and once you put on those magic dancing shoes, you can never ever take them off again until the company dies. This is that fairy-tale.

Spark is the answer to questions you haven't asked yet. Literally, that's the kind of algorithms it is specialized to run. And it scales all the way. That's its value. That's what Apache is doing these days, trying to close the conceptual gap so both ends, big and small, are using the same base code. I love it.

But no-one sells it, and the people who do use it in anger are too busy making billions of dollars to spend much time explaining exactly how, or writing documentation. You really gotta tease the information out of them, and watch a lot of their talks at Big Data conferences to see where all the pieces actually fit. There is an enormous learning curve.

And that's why it's still such a secret.

And given how many people read my blog, I'm sure will remain so.