CSE 124: Networked Services

Welcome to Hadoop

Hadoop is a platform for distributed computations based on Google's Map Reduce (paper).

Tasks are submitted to a JobTracker daemon which runs once per cluster. The JobTracker does some preprocessing then divides up the job and passes on smaller tasks to a number of TaskTracker daemons on other (or the same) machine. The TaskTrackers continually update the JobTracker as to their progress and when they're done, their results.

There is a distributed filesystem which is available to all the Hadoop nodes. This filesystem is run by a NameNode (which is another daemon, usually run on the same machine as the JobTracker ). Files are boken up into Blocks which are replicated spread out among many DataNodes . The NameNode handles requests for files and locates which DataNodes hold the appropriate blocks for each request. It then directs the appropriate DataNodes to provide the requested Blocks.

A description of some of the files the system uses can be found here.

Links

Official Links