I'm a pretty big fan of the MapReduce framework. With two fairly simple classes, a Map and a Reduce (influenced by, but not the same as, functional programming constructors of the same name), you can easily write programs that operate on terabytes of data. Apache's Hadoop is a popular open source version of MapReduce, and it's used by Yahoo, Amazon, and Facebook, to name a few.
I'm also a pretty big fan of Scala, and luckily Hadoop is written in Java, which is to say that you can use Hadoop from Scala pretty easily. Unfortunately, like many Java libraries, Hadoop requires a lot of boiler plate, and it's of course not well integrated with some of the higher-order paradigms that Scala supports so well. But maybe we can fix that.
Since my group is just now moving to both Scala (slowly) and Hadoop (somewhat more quickly), I thought it would be good to help combine the two. The result is SMR, which provides a wrapper around Hadoop to make everything much more Scala-like.
WordCount: Java
A pretty basic introduction to MapReduce is the word count example. The task is, given a series of files, to segment all the words in a collection of files, and return a list of each word and the number of time it appears. Here's the vanilla Java Hadoop example, taken from Hadoop's tutorial, leaving out imports:
public class WordCount {
public static class Map extends MapReduceBase
implements Mapper<LongWritable, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
The basic idea is pretty simple. For the map: take each line in each file, tokenize it, and for each word emit a pair of that word and 1. The reduce receives a word and an iterator of counts and then sums them together, and emits the word with the sum.WordCount: Scala
The problem is that this remarkably simple program gets lost in all that boilerplate. Let's take a look at how SMR tackles the problem:
object WordCount {
def main(args : Array[String]) {
val (h,remainingArgs) = Hadoop.fromArgs(args,new Path("output"));
val words = for( (offset,line) <- h.loadLines(remainingArgs);
word <- line.split(" \n\r\t\f"))
yield(word,1);
val sums = words.hreduce{ (word,iter) =>
(word,iter.reduceLeft(_+_));
}
sums.elements foreach println
}
}
Both programs accomplish the same task, but one takes 1/4 the number of lines. The reasons for the disparity are multiple, but there are a few key features of Scala that account for most of it:- Syntactic Support for Map: By treating for loops as syntactic sugar for map and its cousins, Scala allows you to create new classes that interact with the primitive looping construct in intuitive ways.
- Strong Type System: Scala's type system not only takes a lot of the boiler plate away from the code, but it makes library writer's lives easier by making much richer type information available at compile time. In SMR, the type system is used to automatically select the correct "input format type" and figure out what the types of the Maps and the Reduces are.
- Higher-Order and Anonymous Functions: The combination of functions that take other functions and the ability to create new functions on the fly easily and succinctly obviates a lot of the boiler plate associated with defining a new "Mapper" and "Reducer" for every task. Instead, the compiler automatically creates the functions that perform the task for you.
Update:
I was criticized for comparing unfairly comparing Scala to Java, and that I should compare Scala/SMR to a less uninspired language like Erlang. So I found this from this blog post:
test_map_reduce() ->
M_func = fun(Line) ->
lists:map(
fun(Word) ->
{Word, 1}
end, Line)
end,
R_func = fun(V1, Acc) ->
Acc + V1
end,
map_reduce(3, 5, M_func, R_func, 0,
[[this, is, a, boy],
[this, is, a, girl],
[this, is, lovely, boy]]).
I don't know Erlang, but this looks pretty reasonable to me (thought it doesn't have to deal with string tokenization), and it's is competitive in legibility with SMR (and, in some ways, I'll readily admit, better). But as someone like me stuck in JVM-land, I'm still pretty satisfied with the Scala wrapper, especially since it buys compatibility with all the goodies available in Hadoop.
25 comments:
I really really really wish that someday one blogger who compares a solution in language X tries to write proper Java code and not the worst he can do. That would really lead to some insights for Java developers.
When writing stupid Java code, people dismiss the post as stupid, without looking for insights.
Some points:
1.) Why use split in Scala but not Java (but the noisy StringTokenizer)?
2.) Why use while (values.hasNext()) instead of a Java 5 for loop?
3.) The Scala solution seems to have custom code which isn't shown (hreduce, Hadoop.fromArgs, ...)
With SMR in Java the Java solution would be much shorter.
So you say: With my private framework which wraps everything, the solution in Scala is shorter than the Solution in Java without a wrapper?
No big insight there.
I guess the Java solution could be written much shorter with the usage of Google collections and the for hack in Java.
http://stephan.reposita.org/archives/2008/08/06/for-hack-with-option-monad-in-java/
It would be great to assume SMR in your Java solution, use the for hack, add Google collections and THEN show how Scala is better.
Peace
-stephan
Don't shoot fish in a barrel is what I've meant -stephan
Im agree this is an unfair comparing, Why lately the Scala folks and bloggers are so arrogant?, If continue that way their little language will continue stay on a niche for a long time. Dont compare to Java, Java is your brother thanks to Java Scala is here, Compare with the rivals as C# or Erlang so on. This blog is so stupid.
@stephan
re: 1) and 2) I ripped the Java code right out of the Hadoop tutorial, but you're right I should have fixed that.
3) the custom code is in SMR, which brings me to your next point:
I actually disagree. SMR would look horrible in Java (unless you think Functional Java looks good). Without the three things I enumerated (and also Manifests, which let you recover the erasure at runtime.) The "equivalent" Java code would look much worse than the Java tutorial. (Your "Option" hack can't support "yield" which is what makes the for comprehensions work in SMR, for example.)
@john As soon as you point out a commercial-scale Open Source implementation of MapReduce in C#, I'll do what I can. I look forward to Microsoft releasing Dryad, for instance. As for Erlang, you're probably right. I just found a blog post, and I'll update the blog to compare it to Erlang.
Stephan,
Before you go touting the for-hack (which I do think is slick), be aware that it has has two problems as compared to Scala/Haskell monads. First, you have to manually lift the result into the monad, possibly using mutation (just write a list monad to see what I mean). That's what Dan was talking about with "yield". But more importantly, your for-hack only supports eager, sequential evaluation. That's okay with Option, but it would totally defeat the purpose here.
The whole point of parallel map/reduce is that map can be done completely in parallel and fold (reduce) with an associative function can be done using a kind of tree-shaped parallelism.
@James: The point of my post wasn't that Java is shorter or better than Scala. Obviously (for type inference alone) Scala is shorter, the yield in Scala helps too.
What I said was that he should have tried to write the best available solution in Java - perhaps in a functional style - and THEN compare it.
Peace
-stephan
Stephan: "Obviously (for type inference alone) Scala is shorter, the yield in Scala helps too."
I'm with Dan on this one. Type inference and for/yield are really only the tip of the iceberg. I'm looking forward to a blog posting where you write enough of "JMR" to solve the word count problem in a cleaner way.
James: You're right of course,
"Type inference and for/yield are really only the tip of the iceberg"
but type inference (which I oppose ;-) reduces a lot of noise in the example.
Oh, my blog list is quite long, we'll see when I have some time, but it really sounds interesting, at least for me to gain more insight into Scala/Java view.
Peace
-stephan
For another language comparison for the Hadoop word count example, here's one for a Jython Hadoop library:
http://code.google.com/p/happy/
I am not going to attempt to post code in a Blogger comment.
Is the word count example the "Hello, World" of MapReduce? We should make a big page of examples in as many Hadoop-binded libraries as possible!
Hm, here's a better link. http://www.mqlx.com/~colin/happy.html
I built something like this quite some time ago using groovy. The results were even more impressive for toy programs. For instance, here is a fully hadoop compatible map-reduce version of word count:
wc = mapReduce(
{key, line, out, report -> line.split().each {w -> out.collect(w,1)}},
{w, counts, out, report -> out.collect(w, counts.sum())}
)
In this example, wc is a function that can then be applied to various inputs as in
wc(["this is input text", "for the word-counter"])
or
wc(new File("/local/input/file")
or even
wc(functionUsingMapReduce(new HdfsFile("/existing/file")))
See http://tdunning.blogspot.com/2008/03/hello-world-for-map-reduce.html for details.
The problem with this sort of hack is that it doesn't really solve the problems that you face in writing map-reduce programs.
The big problem is that map-reduce is just a tad too low-level for most problems. A secondary problem is that debugging programs written using a facade like this are harder to debug than programs written using raw java + map-reduce. A tertiary problem is that the boilerplate of map-reduce isn't proportional to problem size so large programs are proportionally much more efficient to write than small programs relative to the wrapped code.
What I would much rather have is something that provides much higher-level semantics such as those provided by Pig, but which provides for better integration into real programming languages.
If Scala (SMR really) could do that cleanly, then I would be much more interested.
Many institutions limit access to their online information. Making this information available will be an asset to all.
Thesis Writing Help
I just want to emphasize the good work on this , has excellent views and a clear vision of what you are looking for.
thesis writing | undergraduate essay
Cool! Author the best! Thank you!
автошкола
I am newbie to this java field. Now only I am learning the basics of java. This information is very much useful for me. Thanks for sharing.
best website hosting
wow i like scalable lang......easy can do it....thanks for posting
how to write a resume
I am a newbie in this field java. Now I'm learning the basics of Java. This information is very useful for me. Thank you for sharing.
garage shoes
I wrote an example for the same code but with a different approach: I used scala just for the map function. The rest of the code is in java.
Link:
http://developersnightmare.blogspot.com/2011/09/first-steps-with-rails-inheritance.html
I have enjoyed a lot by reading your blog. Really good to read. Thanks for sharing. buy domain name India
You can find scala version wordcount using latest hadoop apis at https://github.com/smishra/scala-hadoop
This site is excellent and so is how the subject matter was explained.I also like some of the comments too.Waiting for next post.
Schaumburg Personal Trainer
Ab Roller How to Get Rid of Leg Cellulite How to Get Rid of Cellulite Fast How to Get rid of Cellulite Naturally Can You Get Rid of Cellulite Without Exercise How to Get Rid of Cellulite on Thighs and Bum CELLULITE TREATMENT CELLULITE ON STOMACH EXERCISE STOMACH CELLULITE
I'm satisfied to seek out so many helpful information here in the submit, we need develop extra techniques in this regard, thank you for sharing. PTFE
Post a Comment