Thursday, April 25, 2013

MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems 1st edition, Donald Miner



In the 1990s O'Reilly books had a well-earned reputation for quality. O'Reilly authors such as Simson Garfinkel explained technical topics with precision, clarity, and wit. I proudly kept a whole shelf of O'Reilly books at work, and I imbibed copious java from their tenth anniversary mug. I'm sorry to see that O'Reilly's traditional quality has gone the way of the Internet bubble. MapReduce Design Patterns represents the absolute nadir of technical writing, and it never should have been published in its current form.

One of the most poorly written parts of the book is Appendix A on Bloom filters. As I was writing my original review of the book, I thought it might be helpful to point readers to a better explanation of the topic. Turning to Wikipedia as a potential reference, I was struck by the number of similarities between it and Appendix A. It now appears that this appendix plagiarizes the Wikipedia article "Bloom filter." To see this, compare the opening paragraph of the Wikipedia article (January 19, 2013) to the first two paragraphs of the book's appendix (which you can see in the sample pages here):

Wiki: A Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. (Paragraph 1, sentence 1)

MRDP: Conceived by Burton Howard Bloom in 1970, a Bloom filter is a probabilistic data structure used to test whether a member is an element of a set. (Page 221, paragraph 1, sentence 1)

Wiki: False positive retrieval results are possible, but false negatives are not; i.e. a query returns either "inside set (may be wrong)" or "definitely not in set". (Paragraph 1, sentence 2)

MRDP: While false positives are possible, false negatives are not. This means the result of each test is either a definitive "no" or "maybe." You will never get a definitive "yes." (Page 221, paragraph 2, sentences 2 - 4)

Wiki: Elements can be added to the set, but not removed (though this can be addressed with a counting filter). (Paragraph 1, sentence 3)

MRDP: With a traditional Bloom filter, elements can be added to the set, but not removed. There are a number of Bloom filter implementations that address this limitation, such as a Counting Bloom Filter, but they typically require more memory. (Page 221, paragraph 2, sentences 5 and 6)

Wiki: The more elements that are added to the set, the larger the probability of false positives. (Paragraph 1, sentence 4)

MRDP: As more elements are added to the set, the probability of false positives increases. (Paragraph 2, sentence 7)

When confronted with examples like these, authors typically claim that the similarities are due to their unintentionally copying verbatim from their notes. While that may be true in some cases, it is the task of the publisher to see that problems like this are corrected before books are released. Clearly the authors and the editors at O'Reilly have failed to diagnose this problem and provide a timely appendectomy. The result is a book with a fatal case of appendicitis left to die a humiliating death in the marketplace.

Although MapReduce Design Patterns would have benefitted from an appendectomy, such an operation would have been insufficient to restore the book to good health. For much of the book suffers from a sort of write-once-copy-everywhere mentality that leads to dreadful writing and programming. A few choice examples should suffice to illustrate this point.

Until the book's penultimate chapter every example except two includes this pattern of statements:

"The following descriptions of each code section explain the solution to the problem.
Problem: ..."

Apparently it occurred to neither the authors nor the editors that it might be premature to refer to "the problem" and its solution before that problem had been stated. And certainly no one thought to ask whether or not the first sentence of the pattern clearly sets forth what's coming next in the book. Yet through the magic of the Ctrl-C, Ctrl-V sequence, this statement appears dozens of times throughout the book.

The first hint of an editorial hand finally appears at beginning of the Generating Data Examples section of Chapter 7, where at last we find the statement of a problem in paragraph form followed by our now familiar sentence. Unfortunately, the book's remaining four examples revert to the authors' text design pattern with an ungrammatical twist:

"The sections below with its corresponding code explain the following problem.
Problem: ..."

Perhaps a NullWritable object would have made a better editor.

Fortunately, not all of the book's wretched writing is as annoying as this. Some of it, such as this garbled thought from page 185, is hilarious:

"There is no implementation for any of the overridden methods, or for methods requiring return values return basic values."

Programmers may be amused by how the class MRDPUtils seems to appear and disappear randomly with the invocation of the method transformXmlToMap() in the book's code examples. They may also laugh at the erroneous comments in the source code on pages 20, 23, 26, and 29. Since the book's sample code contains the same errors, one might begin to wonder if anyone read or tested that code after it was written. Considering the map() method of the UserIdReputationEnrichmentMapper class given on page 165, that seems unlikely. An astute reader will easily see that this method emits the wrong key, and testing certainly would have revealed it. Since the map() method's actual output clearly contradicts the specification for the reducer implementation on the same page, the problem could have been spotted by a conscientious editor.

Almost two decades have passed since Simson Garfinkel typed "buy more O'Reilly books" in an example in one of his books. After reading MapReduce Design Patterns, I no longer agree with his recommendation. Readers who are interested in this topic will do well to look elsewhere for more information on the subject.

This book is a good catalog of the different patterns any big data solutions programmer should know in order to effectively perform their job. While the authors admit that writing some of these patterns in the context of a map/reduce job on Hadoop with tools like Pig available can be counterproductive they make the compelling argument that understanding these patterns is still important.

The technical examples in the book are sometimes missing blocks of code, which while easily derived may be a source of frustration for some readers. (I have my implementations of the exercises on github, under my username of cfeduke; I learn best by doing, so keying in and executing examples is paramount.)

I've had a moderate level of experience with Hadoop, from 0.18 to 1.x, before tackling this book. I felt that this book taught me a fair amount about the guts of writing a map/reduce job though if I did not have a solid foundation working with Hadoop the examples may have been difficult to grok.

The authors chose to use Stack Overflow community data to demonstrate the patterns presented and I felt that was an excellent decision as its easy to derive other queries to answer - and implement - having some knowledge of the corpus.

The book gives a good introduction to MapReduce design patterns. But what i found really missing are good examples.
I had studied Jimmy Lin's book [...]before i read this which gives some really good examples of algorithm design. I was hoping to find something which focussed on how some of the design patterns can be leveraged to implement more complicated and non-trivial algorithms in Map-Reduce more effectively.
But i feel that the book uses some fairly straightforward algorithms to explain the pattern and does not go deep.
Another thing that i did not like is that the book is just too much Hadoop specific and ignores other Map Reduce implementations which are getting very popular.
Overall the book is a good step in introducing patterns and algorithms in a more systematic manner, in the Map Reduce programming paradigm. It gives a good survey of some of the emerging areas in last few chapters. The chapter on Meta Patterns was my favorite as it gives some good introductory material on building more complicated pipelines using Map Reduce, and how one could take steps in optimizing the runtime of bigger pipelines.

Product Details :
Paperback: 230 pages
Publisher: O'Reilly Media; 1 edition (December 22, 2012)
Language: English
ISBN-10: 1449327176
ISBN-13: 978-1449327170
Product Dimensions: 7.5 x 0.6 x 9.2 inches

More Details about MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems 1st edition

or

Download MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems 1st edition PDF Ebook

No comments:

Post a Comment