Apache Spark has been on a tear, but big data innovation is moving so fast that Spark may not last.
Apache Spark keeps defying gravity. Though Spark has yet to truly hit mainstream big data adoption, all signs point to an incredibly bright future, according to research from Redmonk's Donnie Berkholz.
Just as Hadoop was starting to find a home within the enterprise, Spark threatens to disrupt its momentum. Given that Spark offers better performance and is easier to use, this shouldn't surprise us.
The real question, however, is whether Spark will live long enough to realize its promise. Given the frenetic pace of open-source innovation in big data, it's very likely that Spark will give way to an even better system before it finds widespread adoption.
Hadoop poses as big data poster child
Hadoop, despite supporting one IPO (Hortonworks) and promising at least two more (Cloudera and MapR), has barely moved the needle on enterprise adoption, by 451 Research estimates (Figure A):
Enterprise adoption of big data.
This is despite the now broad adoption of big data, which often translates to "Hadoop" within the enterprise. As Gartner surveys show, when enterprise decision-makers are asked how their big data projects are going, they're demonstrating a decided shift away from dreaming to doing (Figure B):
Percentage of big data projects.
All of this translates, as a recent Deutsche Bank research note finds, into real comfort with Hadoop: "CIOs are now broadly comfortable with the technology and see it as a significant part of the future data architecture. We would expect significant $ commitments in [fiscal year 2015]."
And yet... there's Spark.
As I've written, Spark is giving Hadoop a serious run for its money. And at a 100x performance premium to Hadoop MapReduce in-memory, and 10x better performance on disk, Hadoop is going to lose that race.
It may lose more than performance drag races, though, as Indeed jobs data suggests (Figure C):
Hadoop vs. Spark job trends.
Keep in mind that Spark wasn't open sourced until 2010, so there's clearly some noise in this data. But the post-2010 rise seems right, particularly in light of Berkholz's analysis (Figure D).
Interest in Spark has skyrocketed.
For example, Berkholz digs into Stack Overflow mentions and finds that "interest in Spark has skyrocketed from minimal to far above every other technology on the chart."
As he points out, this rise coincides with Spark moving to the Apache Software Foundation (in 2013) and the founding of Databricks, a company set up to commercialize Spark adoption.
This sharp rise jibes well with Google Trends data, as he notes that Hacker News mentions show a less pronounced spike. Even there, however, all roads seem to point "up and to the right" for Spark.
Open source eats itself
The question is whether this trend can continue. After all, open source tends to cannibalize itself. While Linux and a few other projects have managed to stick around for over a decade, the blistering pace of big data innovation may mean that no technology can keep up.
Proprietary software has already learned this lesson. Remember when proprietary software vendors gave us enterprise data warehouses, Business Intelligence, and more? That was then, this is now. As Cloudera co-founder Mike Olson posits, "No dominant platform-level software infrastructure has emerged in the last 10 years in closed-source, proprietary form."
The corollary, however, may be just as hard on open source: "No dominant platform-level software infrastructure that emerges in open source form may last 10 years without being disrupted by other open-source software."
There is reason to believe that Spark may stick around for a while, though.
As Basho's Dave McCrory told me, "All of the other solutions/frameworks that are out there seem to be more complex or only solve part of the problem. Quite a few of our customers are already investigating or are using Spark now (especially with streaming and such)." To which Redmonk's James Governor declares, "Clustering is the new key value."
DataStax's Scott Hirleman weighs in, too, arguing that while "Hadoop lets you analyse all sorts of data together without normalizing," Spark goes one step further, allowing users to "do that plus is easy/fast."
And while Ant Stanley "can't think of what the next thing to overtake Spark is," he's equally sure "there is something brewing in a Google or Facebook lab."
And therein lies the problem... and the promise.
While software vendors have incentives to slow innovation long enough to capitalize on their innovations, companies like Facebook and Google do not. Much of the best big data technology has emerged from their labs, and will continue to do so, with no respect of software market share or software profit margins.
So, will Spark stick around? Definitely maybe. But one thing is clear: if it doesn't, we benefit, anyway.