IBM ignites Spark

On June 15, 2015 IBM made a major announcement to support Apache Spark with massive resources. This will help the already most active Apache project to achieve the final breakthrough to reach an even broader developer community.

As a reminder: Spark is the most active Apache project. It is a general cluster computing engine for large-scale data processing, in particular for big data and complex analytics. While the MapReduce technology used by Hadoop is disk-based in general, Spark is optimized for in-memory computing. Thus it performs much better (up to 100 times faster). However, Spark can interface with the Hadoop Distributed File System.

IBM's announcement and commitment to its Spark support and its contribution are vast: Spark is supposed to become a core part of IBM's analytics and commerce platform as well as of the Blumix Cloud platform. 3,500 researchers and developers shall work on Spark-related projects, and IBM will include Spark technology in its Watson-based offerings. Additionally, IBM will open-source its breakthrough SystemML machine learning technology. Spark originates from an UC Berkeley project.

Through this massive support of the Spark project and the commitment to contributing to the community process, IBM is consequently delivering on its open-source strategy in combination with its analytics commitment. This also shows the importance of massively parallel and in-memory computing for IBM in handling the digital universe, which is expected to grow at least tenfold from 2013 to 2020. Enterprises need tools and solutions to manage exploding data volumes, to differentiate between data of different values, filter out the data noise and, finally, to use it for their digital transformation projects.

With its Spark support IBM sets a counterpoint against MapReduce, for which Google holds a patent. IBM and Google are competing more and more in the data management and analytics space, and IBM needs to remain a member of the MapReduce community but not enhance its popularity. Spark is an ideal tool for this because it can work on HDFS but is not dependent on it. And it supports a more advanced in-memory computing paradigm that also counters SAP's HANA success.

Bottom line: IBM's decision to massively support the Apache Spark project with a significant staff commitment, a broad integration in its own offerings and open-sourcing its SystemML machine learning technology positions IBM at the forefront of large-scale, massively parallel in-memory analytics and shows Google as well as SAP potential limitations to their nearly unlimited growth perspectives.