Wednesday, September 3, 2008

A Scalable Language, and a Scalable Framework

I'm a pretty big fan of the MapReduce framework. With two fairly simple classes, a Map and a Reduce (influenced by, but not the same as, functional programming constructors of the same name), you can easily write programs that operate on terabytes of data. Apache's Hadoop is a popular open source version of MapReduce, and it's used by Yahoo, Amazon, and Facebook, to name a few.

I'm also a pretty big fan of Scala, and luckily Hadoop is written in Java, which is to say that you can use Hadoop from Scala pretty easily. Unfortunately, like many Java libraries, Hadoop requires a lot of boiler plate, and it's of course not well integrated with some of the higher-order paradigms that Scala supports so well. But maybe we can fix that.

Since my group is just now moving to both Scala (slowly) and Hadoop (somewhat more quickly), I thought it would be good to help combine the two. The result is SMR, which provides a wrapper around Hadoop to make everything much more Scala-like.

WordCount: Java

A pretty basic introduction to MapReduce is the word count example. The task is, given a series of files, to segment all the words in a collection of files, and return a list of each word and the number of time it appears. Here's the vanilla Java Hadoop example, taken from Hadoop's tutorial, leaving out imports:

public class WordCount {
public static class Map extends MapReduceBase
implements Mapper<LongWritable, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}

The basic idea is pretty simple. For the map: take each line in each file, tokenize it, and for each word emit a pair of that word and 1. The reduce receives a word and an iterator of counts and then sums them together, and emits the word with the sum.

WordCount: Scala

The problem is that this remarkably simple program gets lost in all that boilerplate. Let's take a look at how SMR tackles the problem:
object WordCount {
def main(args : Array[String]) {
val (h,remainingArgs) = Hadoop.fromArgs(args,new Path("output"));
val words = for( (offset,line) <- h.loadLines(remainingArgs);
word <- line.split(" \n\r\t\f"))
yield(word,1);

val sums = words.hreduce{ (word,iter) =>
(word,iter.reduceLeft(_+_));
}
sums.elements foreach println
}
}

Both programs accomplish the same task, but one takes 1/4 the number of lines. The reasons for the disparity are multiple, but there are a few key features of Scala that account for most of it:
  1. Syntactic Support for Map: By treating for loops as syntactic sugar for map and its cousins, Scala allows you to create new classes that interact with the primitive looping construct in intuitive ways.
  2. Strong Type System: Scala's type system not only takes a lot of the boiler plate away from the code, but it makes library writer's lives easier by making much richer type information available at compile time. In SMR, the type system is used to automatically select the correct "input format type" and figure out what the types of the Maps and the Reduces are.
  3. Higher-Order and Anonymous Functions: The combination of functions that take other functions and the ability to create new functions on the fly easily and succinctly obviates a lot of the boiler plate associated with defining a new "Mapper" and "Reducer" for every task. Instead, the compiler automatically creates the functions that perform the task for you.
SMR is still very much in development, as my group is still transitioning to Hadoop, but I hope that people find the code useful in scaling up their problems to deal with terabytes and terabytes of data.

Update:

I was criticized for comparing unfairly comparing Scala to Java, and that I should compare Scala/SMR to a less uninspired language like Erlang. So I found this from this blog post:



test_map_reduce() ->
M_func = fun(Line) ->
lists:map(
fun(Word) ->
{Word, 1}
end, Line)
end,

R_func = fun(V1, Acc) ->
Acc + V1
end,

map_reduce(3, 5, M_func, R_func, 0,
[[this, is, a, boy],
[this, is, a, girl],
[this, is, lovely, boy]]).



I don't know Erlang, but this looks pretty reasonable to me (thought it doesn't have to deal with string tokenization), and it's is competitive in legibility with SMR (and, in some ways, I'll readily admit, better). But as someone like me stuck in JVM-land, I'm still pretty satisfied with the Scala wrapper, especially since it buys compatibility with all the goodies available in Hadoop.

36 comments:

Stephan.Schmidt said...

I really really really wish that someday one blogger who compares a solution in language X tries to write proper Java code and not the worst he can do. That would really lead to some insights for Java developers.

When writing stupid Java code, people dismiss the post as stupid, without looking for insights.

Some points:
1.) Why use split in Scala but not Java (but the noisy StringTokenizer)?
2.) Why use while (values.hasNext()) instead of a Java 5 for loop?
3.) The Scala solution seems to have custom code which isn't shown (hreduce, Hadoop.fromArgs, ...)

With SMR in Java the Java solution would be much shorter.

So you say: With my private framework which wraps everything, the solution in Scala is shorter than the Solution in Java without a wrapper?

No big insight there.

I guess the Java solution could be written much shorter with the usage of Google collections and the for hack in Java.

http://stephan.reposita.org/archives/2008/08/06/for-hack-with-option-monad-in-java/

It would be great to assume SMR in your Java solution, use the for hack, add Google collections and THEN show how Scala is better.

Peace
-stephan

Stephan.Schmidt said...

Don't shoot fish in a barrel is what I've meant -stephan

john said...

Im agree this is an unfair comparing, Why lately the Scala folks and bloggers are so arrogant?, If continue that way their little language will continue stay on a niche for a long time. Dont compare to Java, Java is your brother thanks to Java Scala is here, Compare with the rivals as C# or Erlang so on. This blog is so stupid.

David Hall said...

@stephan

re: 1) and 2) I ripped the Java code right out of the Hadoop tutorial, but you're right I should have fixed that.

3) the custom code is in SMR, which brings me to your next point:

I actually disagree. SMR would look horrible in Java (unless you think Functional Java looks good). Without the three things I enumerated (and also Manifests, which let you recover the erasure at runtime.) The "equivalent" Java code would look much worse than the Java tutorial. (Your "Option" hack can't support "yield" which is what makes the for comprehensions work in SMR, for example.)

@john As soon as you point out a commercial-scale Open Source implementation of MapReduce in C#, I'll do what I can. I look forward to Microsoft releasing Dryad, for instance. As for Erlang, you're probably right. I just found a blog post, and I'll update the blog to compare it to Erlang.

James Iry said...

Stephan,

Before you go touting the for-hack (which I do think is slick), be aware that it has has two problems as compared to Scala/Haskell monads. First, you have to manually lift the result into the monad, possibly using mutation (just write a list monad to see what I mean). That's what Dan was talking about with "yield". But more importantly, your for-hack only supports eager, sequential evaluation. That's okay with Option, but it would totally defeat the purpose here.

The whole point of parallel map/reduce is that map can be done completely in parallel and fold (reduce) with an associative function can be done using a kind of tree-shaped parallelism.

Stephan.Schmidt said...

@James: The point of my post wasn't that Java is shorter or better than Scala. Obviously (for type inference alone) Scala is shorter, the yield in Scala helps too.

What I said was that he should have tried to write the best available solution in Java - perhaps in a functional style - and THEN compare it.

Peace
-stephan

James Iry said...

Stephan: "Obviously (for type inference alone) Scala is shorter, the yield in Scala helps too."

I'm with Dan on this one. Type inference and for/yield are really only the tip of the iceberg. I'm looking forward to a blog posting where you write enough of "JMR" to solve the word count problem in a cleaner way.

Stephan.Schmidt said...

James: You're right of course,

"Type inference and for/yield are really only the tip of the iceberg"

but type inference (which I oppose ;-) reduces a lot of noise in the example.

Oh, my blog list is quite long, we'll see when I have some time, but it really sounds interesting, at least for me to gain more insight into Scala/Java view.

Peace
-stephan

Brendan said...

For another language comparison for the Hadoop word count example, here's one for a Jython Hadoop library:

http://code.google.com/p/happy/

I am not going to attempt to post code in a Blogger comment.

Is the word count example the "Hello, World" of MapReduce? We should make a big page of examples in as many Hadoop-binded libraries as possible!

Brendan said...

Hm, here's a better link. http://www.mqlx.com/~colin/happy.html

Aji said...
This comment has been removed by a blog administrator.
Ted Dunning ... apparently Bayesian said...

I built something like this quite some time ago using groovy. The results were even more impressive for toy programs. For instance, here is a fully hadoop compatible map-reduce version of word count:

wc = mapReduce(
{key, line, out, report -> line.split().each {w -> out.collect(w,1)}},
{w, counts, out, report -> out.collect(w, counts.sum())}
)

In this example, wc is a function that can then be applied to various inputs as in

wc(["this is input text", "for the word-counter"])

or

wc(new File("/local/input/file")

or even

wc(functionUsingMapReduce(new HdfsFile("/existing/file")))

See http://tdunning.blogspot.com/2008/03/hello-world-for-map-reduce.html for details.

The problem with this sort of hack is that it doesn't really solve the problems that you face in writing map-reduce programs.

The big problem is that map-reduce is just a tad too low-level for most problems. A secondary problem is that debugging programs written using a facade like this are harder to debug than programs written using raw java + map-reduce. A tertiary problem is that the boilerplate of map-reduce isn't proportional to problem size so large programs are proportionally much more efficient to write than small programs relative to the wrapped code.

What I would much rather have is something that provides much higher-level semantics such as those provided by Pig, but which provides for better integration into real programming languages.

If Scala (SMR really) could do that cleanly, then I would be much more interested.

google said...
This comment has been removed by the author.
Research Writer said...

Many institutions limit access to their online information. Making this information available will be an asset to all.

Thesis Writing Help

John Petrucci said...

I just want to emphasize the good work on this , has excellent views and a clear vision of what you are looking for.
thesis writing | undergraduate essay

SeregaKiev said...

Cool! Author the best! Thank you!
автошкола

domainhos said...

I am newbie to this java field. Now only I am learning the basics of java. This information is very much useful for me. Thanks for sharing.
best website hosting

legend said...

wow i like scalable lang......easy can do it....thanks for posting

how to write a resume

certificationkey said...

I am a newbie in this field java. Now I'm learning the basics of Java. This information is very useful for me. Thank you for sharing.
garage shoes

ZeoS said...

I wrote an example for the same code but with a different approach: I used scala just for the map function. The rest of the code is in java.
Link:
http://developersnightmare.blogspot.com/2011/09/first-steps-with-rails-inheritance.html

Adol said...

I have enjoyed a lot by reading your blog. Really good to read. Thanks for sharing. buy domain name India

motmot said...

You can find scala version wordcount using latest hadoop apis at https://github.com/smishra/scala-hadoop

James Smith said...

This site is excellent and so is how the subject matter was explained.I also like some of the comments too.Waiting for next post.
Schaumburg Personal Trainer

desseely serro said...

Ab Roller How to Get Rid of Leg Cellulite How to Get Rid of Cellulite Fast How to Get rid of Cellulite Naturally Can You Get Rid of Cellulite Without Exercise How to Get Rid of Cellulite on Thighs and Bum CELLULITE TREATMENT CELLULITE ON STOMACH EXERCISE STOMACH CELLULITE

83bc37e8-b38f-11e2-a9f0-000bcdcb471e said...

I'm satisfied to seek out so many helpful information here in the submit, we need develop extra techniques in this regard, thank you for sharing. PTFE

smith rose said...

The Timeline view in Project 2010 Download can be copied and pasted in an e-mail message or in any Office application for a quick view of Project progress along a graphic timeline.

smith rose said...

Microsoft Project 2010 Download lets you quickly zoom the timephased part of a view using a zoom slider in the status bar. Simply move the slider to the right to move zoom in (show shorter time intervals, such as days or hours) on your schedule and to the left to zoom out (show longer intervals, such as weeks or months). The zoom slider works in the Gantt Chart, network diagram, and calendar views, as well as in all graph views.
Acrobat 9 Pro Extended is the complete PDF solution for business and technical professionals.

smith rose said...

Project 2010 Download includes a timeline view that is automatically displayed above other views, showing a concise overview of the entire schedule. You can add tasks to the timeline and even print it for an attractive summary report of the entire project. Or you can paste it into an e-mail for an instant report with no fuss.
Visio Standard 2010 Download offers modern and intuitive diagramming tools to transform complex ideas into aha moments and get everyone on the same page with less time and effort. A diverse set of pre-drawn shapes, pictures, and templates, and new automatic drawing tools make visualization easier than ever.

smith rose said...

You wouldn't think that collaboration could increase through something as simple and ancient as copying and pasting Project information. With this new functionality, you can now copy and paste content to and from Office programs and Project 2010 and keep its formatting, outline levels, and column headers.
Adobe Acrobat 9 Pro Download includes Adobe LiveCycle Designer ES software for advanced form creation.
Gaining a clear and complete view of information that matters to your business often requires both a high-level perspective and detailed data. With just a few clicks, Microsoft Visio 2010 Download helps you see the entire picture by showing meaningful data and information graphically in a single, up-to-date diagram.

Momo Wu said...

Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I'll be subscribing to your feed and I hope you post again soon.recover windows password

smith rose said...

Microsoft Visio 2010 Download delivers innovative features across the entire family of products: Microsoft Visio Standard 2010 Download and Visio Standard 2010 Download for occasional and professional project managers, and Microsoft Visio Professional 2010 Download, combined with Microsoft Project Server 2010, for the power and depth to meet the needs of larger organisations.
You can publish a diagram directly to SharePoint from inside Visio 2010 Download. Create the diagram in Visio, use Visio to publish it to the server, and view the diagram in a browser.
Visio Professional 2010 Download is the complete PDF solution for business and technical professionals.

83bc37e8-b38f-11e2-a9f0-000bcdcb471e said...

I really like the tips you have given. Thanks a lot for sharing. Will be referring a lot of friends about this.My Teeth Whitening News

83bc37e8-b38f-11e2-a9f0-000bcdcb471e said...

I am very interested for this post. This is by far one of the most good posts I read here and look forward for more such nice post. app store marketing

a0cb7efa-0dff-11e3-a79e-000bcdca4d7a said...

I adore the admired advice you action in your articles. I will go on to share. Abounding acknowledgment for you!nike free sko

jimmy wilson said...

Amazing article

Marlboro Man Harley Davidson Jacket

a0cb7efa-0dff-11e3-a79e-000bcdca4d7a said...

Thanks for magnificent information I was looking for this information for my mission.http://www.sz-wholesale.com