Toolbox for developers / MapReduce (Big Data)

MapReduce program in JavaScript

This article make a short presentation of MapReduce algorithm, and shows how to use JavaScript to implement a simple MapReduce program (with shortcuts and some simplifications). The objective of this tutorial is to dissect the algorithm with a exemple without specialized framework.

Prerequisites

This tutorial is for developers who already know JavaScript. We use only the JavaScript language implemented as standard in the browser.

The MapReduce algorithm

MapReduce is a programming model for processing large data sets. Some calculations on large data sets are not possibles with a SQL query in relational database. For example MapReduce can be used to compute segmentations / scoring.

The scalability and fault-tolerance are the advantages of MapReduce.

MapReduce works on keys and values,data should be converted into key-value pair.

MapReduce is composed of several steps:

  • Split phase: Splits data
  • Map phase: Each worker applies the Map function to one split.
  • Sort & Shuffle phase: Redistributes data based on the output keys of Map phase, in order to one key is located on the same worker node.
  • Reduce phase: Collects the results of subproblems to provide a global result
  • Aggregation phase: Aggregate the results
MapReduce schema

MapReduce example

We will take the example of the ranking of students according to their notes (typically French...). The goal is to calculate a scoring: good or bad student.

Data input

Student notes will be stored in one CSV file, one note per line.

john,mathematics,17
jack,mathematics,12
joe,mathematics,12
john,english,10
...

We will take as key the student firstname, as values the notes.

Split

It would consisted in splitting the file, so that in the next step each worker applies the map function to one of these files.
For simplicity, we will not implement this phase in our example program (No split, and only one worker).

Map

This first step consists of returning the list of keys / values:

{"key":"john", values:[17]}
{"key":"jack", values:[12]}
{"key":"joe", values:[10]}
{"key":"john", values:[10]}
...

Sort & Shuffle

This step builds a list in which each key is associated with the list of all its values.

{"key":"john", values:[17, 10]}
{"key":"jack", values:[12, 16]}
{"key":"joe", values:[10, 8]}
...

Reduce

This last step presents a synthetic result of the previous operations.
It will assign a score to students : < 10: bad, >= 10 : good (In France, the ratings are out of 20)

{"key":"john", "scoring":"good"}
{"key":"jack", "scoring":"good"}
{"key":"joe", "scoring":"bad"}
...

MapReduce Implementation

Map function

/* Generates array of key / values */
function map (fileData /* CSV file data */) {
	var output = [];
	fileData.split(/\r?\n/).forEach(function(line) {
		var lineParts = line.split(',');
		output.push({"key":lineParts[0], "values":[lineParts[2]]}); 
	});
	return output;
}

Sort & Shuffle functions

/* Groups values by key */
function shuffle (output /* map output */) {
	var results = {};
	output.forEach(function(object) {
		if (results.hasOwnProperty(object.key)) {
			results[object.key].values = results[object.key].values.concat(object.values);
		} else {
			results[object.key] = object;
		}
	});
	return results;
}

Reduce function

/* Groups values by key */
function reduce (output /* Shuffle output */) {
	var results = [];
	for (var property in output) {
		var sum = output[property].values.reduce(function(a, b) { return parseInt(a) + parseInt(b); });
		results.push({"name":output[property].key, "scoring": ((sum/output[property].values.length) < 10 ? "bad" : "good")});
	}
	return results;
}

Full code:

/* Generates array of key / values */
function map (fileData /* CSV file data */) {
	var output = [];
	fileData.split(/\r?\n/).forEach(function(line) {
		var lineParts = line.split(',');
		output.push({"key":lineParts[0], "values":[lineParts[2]]});
	});
	return output;
}
/* Groups values by key */
function shuffle (output /* map output */) {
	var results = {};
	output.forEach(function(object) {
		if (results.hasOwnProperty(object.key)) {
			results[object.key].values = results[object.key].values.concat(object.values);
		} else {
			results[object.key] = object;
		}
	});
	return results;
}
/* Groups values by key */
function reduce (output /* Shuffle output */) {
	var results = [];
	for (var property in output) {
		var sum = output[property].values.reduce(function(a, b) { return parseInt(a) + parseInt(b); });
		results.push({"name":output[property].key, "scoring": ((sum/output[property].values.length) < 10 ? "bad" : "good")});
	}
	return results;
}
document.getElementById('csv-input').addEventListener('change', function(e) {
	if (e.target && e.target.files) {
		var file = e.target.files[0]; /* only one file is processed */
		var reader = new FileReader();
		reader.onload = function() {
			var mapped = map(reader.result);
			var shuffled = shuffle(mapped);
			var reduced = reduce(shuffled);
			document.getElementById('mapreduce-output').innerHTML = JSON.stringify(reduced, null, "\t");
		};
		reader.readAsText(file);
	}
}, false);

Tests

You can try, select a csv file with notes:
Output:

					

Conclusions & To go even further

This implementation suffers from several shortcomings: The treatments are neither parallelized nor distributed! Furthermore treatments are not optimized.

This tutorial is only a first approach of MapReduce algorithm in order to understand the principles. To go even further: https://en.wikipedia.org/wiki/MapReduce

You can report a bug or give feedback by adding a comment (above) or by clicking "Contact us" link (at the top right hand corner of the page).

Comments





The tools are provided "as is", without warranty of any kind, either express or implied.