Thursday, 12 May 2016

Why Spark in place of MapReduce ???

Apache Spark has numerous advantages over Hadoop's MapReduce execution engine, in both the speed with which it carries out batch processing jobs and the wider range of computing workloads it can handle.Spark is able to execute batch-processing jobs between 10 to 100 times faster than the MapReduce engine according to Cloudera, primarily by reducing the number of writes and reads to disc.As Hadoop moves beyond MapReduce, an Enterprise focus, in-memory technology and accessible machine learning are the next frontiers."You have map and reduce tasks and after that there's a synchronisation barrier and you persist all of the data to disc".While this feature was designed to allow a job to be recovered in case of failure, "the side effect of that is that we weren't leveraging the memory of the cluster to the fullest".

"What Spark does really well is this concept of an Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed."But there's no synchronisation barrier that's slowing you down. The usage of memory makes the system and the execution engine really fast."

No comments:

Post a Comment