Monday, April 22, 2013

Hadoop: The Definitive Guide 3rd edition, Tom White



Hadoop is a pretty complex technology for even seasoned engineers to grasp and appreciate fully. Attempting to explain its core concepts and usage in a book is no small feat but I think the author did an admirable job in capturing the essence of Hadoop and the surrounding landscape. The thing that makes Hadoop so fascinating but so hard to fully grasp is that it really involves an understanding of its surrounding complementing technologies to truly understand what Hadoop is and why it is so popular.

Can this book serves as a beginners guide? I am not sure. I have read a few Hadoop blogs and articles and have some prior hello world setup experinces with Hadoop and yet I couldn't always follow the book. It is definitely not a beginners book with fools-proof detailed instructions to setup and run every example. It is however an excellent book to educate users to the world of Hadoop, what Hadoop really is, what it involves and the complementing set of technologies that integrate and/or build on top of Hadoop that makes it even more useful.

I walk away from this book with a much better understanding of the inner workings of Hadoop (HDFS, MapReduce), a solid grasp of its surrounding technologies (Pig, Hive, HBase) and a much better appreciation of the power of Hadoop, especially when used alongside its many complementing technologies. This is not a beginners introductory book, nor does it cover any high level data analysis or any BI solutions scenarios. This is also not an admin/configuration guide to setup, design and maintain complex Hadoop clusters. But if you read this book with the right expectations, you won't be disappointed.

My take on the current state of Hadoop is it is still in its infancy, with an overly complex set of technologies and functioning at pretty low-level. In due time, Hadoop will form the backbone distributed technology but will pretty much shielded and be invisible to most users. Higher level data analysis solutions and real time queries will be the new rage powered by Hadoop in the background. I am looking forward to the next battleground!

This book is far from being a beginners book. It is far from being organized, far from being sold in Amazon under O'Reilly brand (which I will carefully consider from now on).

First of all - there is now simple guidelines and walk through. Nothing about which external jars to add, and where, nothing about order, systematic learning. just 600 pages full of details, without any order of importance.

I bought this book for a project at work, to prototype a log analysis system using Hadoop. I haven't bought very many technical books in the last few years, but the quality of most online documentation for Hadoop is poor and books seemed like a better option. This book is considered the "bible" for Hadoop. It was useful, and I kept it open on my desk for quite a while as I worked to get the infrastructure set up. Consider it a high-level intro to lots of different Hadoop topics, and you'll be happy with it. Just don't expect it to answer all of your questions. You'll probably still end up doing a lot of digging through other online sources, because the Hadoop ecosystem is large and complicated, and no book can really cover all of it. Besides this book, I also bought Hadoop In Action (not quite as big as this book, but a useful counter-point) and Data Intensive Text Processing With MapReduce (which gave me a good intro to the Map Reduce algorithm, but wasn't that useful once I had a general idea what was going on).

This book has it all: breadth, depth, great descriptions and code. And at the same time great readability. I have been using hadoop for two years and still find items that I had not fully digested the first time around. It is a SOLID five hundred plus pages, covering : hadoop overview, setup and configuration, cluster config, troubleshooting, filesystem/compression/serialization formats, map/reduce and algorithms, and very good beefy sections on the major ecosystem components (hbase/hive/pig/hadoop/zookeeper/sqoop).

I have owned and read many hundreds of technical books in 25 years and I can't bring to mind any other single reference that is of such quality and completeness.

If you're looking to learn about what Hadoop is, all of the buzzwords/terms you've heard about (i.e. HDFS, MapReduce), and get an overview of software in the Hadoop ecosystem (Pig, Hive, etc.) this is a good book that will give you a good overview and pointers in the right direction.

However, the book isn't going to give you a lot of detail on programming MapReduce and things like that.

In other words, it's a good breadth book, not a good depth book. So YMWV depending on what you're looking for.

I bought the previous edition of this book and gave it 4 stars. I bought this newer edition looking for information about Hadoop 2.0, Yarn, and all of the new stuff coming out. It provided a little bit of information about this, but overall was lacking in these details. So I notched it down 1 star because of that. It was just too much duplicate information from the prior edition.

I purchased this book a few months ago based on many earlier 5-star reviews. I had high hopes that it would be as good as those reviewers highly praised. However, the book is actually unbelievably poorly organized - essentially written in a spaghetti fashion. Yes - it contains a lot of information about Hadoop, but with three basic issues: 1) examples are trivial and hard to get working due to insufficient, unclear or no procedures; 2) many subjects (e.g. streaming) are spread over several chapters and readers have to stitch them together after reading all relevant chapters; and 3) many stataments are either inaccurate or lack supportive data. Ironically, one has to apply MapReduce to all the subjects in order to sort out various subjects in a more logic order. I look forward to the 4th edition with significant quality improvement.

I bought this book as a very experienced programmer but no prior experience with Hadoop, which I need to come up to speed on for a new project. I am extremely disappointed in the book and feel I wasted my money. If there's one thing you want from a book on a new technology, it's the ability to get a basic "Hello World" equivalent program running, from which you can then start iterating. This book completely falls down on this most basic requirement - when you get to the very first example program in the book, it tells you that you need to first compile a bunch of example code from the book's website. That shouldn't be required, but ok, whatever. Then when you go to the book's website, you are told that you first need to install a bunch of extra stuff covered later in the book before you can compile the libraries apparently needed to get anything at all to run. This really makes no sense at all - there's no way I should be having to read all the later chapters to figure out what these things are in order to get my very first example program running. Tossed it into the trash and off in search of a resource done by someone who understands how to structure a tutorial properly.

I read the book with attention mainly to Hadoop's underlying premises and platform architecture, and note that this review focuses on the book itself, not the subject of Hadoop in general.

Firstly, I agree with the reviewer noting the book's a "mishmash". It's rather unorganized and thus presented poorly in that it delivers a series of ad-hoc "how tos". After three editions, this should have been remedied.

But, what I feel is the largest shortcoming is that, while the author certainly seems to demonstrate deep knowledge of Hadoop and its related projects, he make numerous assertions of underlying platform concepts that are either unsubstantiated or completely incorrect. Given the complexities and efforts expected of large-scale, distributed systems, this is a critical weakness.

For example, page 3 under "Data Storage and Analytics" (and available under "Look Inside") illustrates a naïve and incorrect understanding of disk performance; research "understanding IOPS" to understand why this is. Ironically, actual and not theoretical performance would likely be worse than what he outlines so had he provided perhaps just a tad more accuracy, he would not only have maintained credibility, but also in turn made a stronger case for the limitations of disk I/O (albeit rotating in this context). This is not to split hairs since, and by his own statement, the focal point of Hadoop is mitigating mass storage and processing scalability bottlenecks, and Hadoop is the focal point of the book. Foundational knowledge, such as how to measure disk performance, in the problem itself is expected.

His knowledge of RAID concepts is also demonstrably quite lacking, and various RAID levels have to-date been the standard mechanism to speedup disk I/O and mitigate consequences of disk failure. HDFS has its own counterpart to RAID so a definitive guide to Hadoop must provide a definitive understanding of RAID. Again, this is squarely within the scope of the book so to expect the author to understand the topic is not unreasonable, but unfortunately here too his credibility suffers.

Page 3 also describes "how RAID works", but even that statement is inherently inaccurate. In practice, "RAID" itself isn't an absolute term and must be accompanied by a level, and certain levels serve completely different purposes (research "RAID levels"); his comment would be accurate rephrased as "how RAID 1 (mirroring) works". Later, in chapter 9, he does in fact refer to RAID 0, but then states that RAID 0 "is" (as opposed to "may be") slower than JBOD with HDFS. Regardless of whether than could be the case or not, it's presented as fact and, inexplicably, he offers a hyperlink to an email outlining a brief, one-off experiment as "proof". This is far from scientific or objective; to extrapolate a single cause from such an "experiment" is tantamount to junk science. The authors of the experiment's results themselves didn't even offer it as conclusive.

He also makes careless logical and mathematical generalizations, like in the following statement: "[i]n JBOD, disk operations are independent, so the average speed of operations is greater than that of the slowest disk.". That is not a true statement because if all of the disks are same speed (however that's measured...) then mean speed and each disk's speed would be equal. Furthermore, "[d]isk performance often shows considerable variation in practice, even for disks of the same model.". Period. End of story. No evidence, no citations, not even a logical proof. Nothing. A completely subjective and baseless assertion that the reader is expected to simply accept. This pattern unfortunately permeates the entire text.

His recommendation of JBOD, however, applies only to a certain class of Hadoop servers and for another he does in fact recommend RAID. Whether that reflects general consensus, I don't know, but after claiming that JBOD under HDFS outperforms RAID 0, he adds that JBOD is superior also because "if a disk fails in a JBOD configuration, HDFS can continue to operate without the failed disk, whereas with RAID, failure of a single disk causes the whole array (and hence the node) to become unavailable.". I'm sure that gave a chuckle to those who possess even the most basic understanding of RAID levels and level nesting. And besides the proposition being simply false at face value, it's also logically contradicts his suggestion a few paragraphs prior that RAID should be used, albeit for a certain server role, but used nevertheless. Whether he's gaming his RAID explanations to suit a particular purpose or he's playing fast-and-loose with terms he doesn't understand is unclear, but what is clear is that his information is unreliable.

Another example includes asserting a SAN impacts data center bandwidth. With virtually no exceptions, SANs are over dedicated fiber channels, not "the network", and thus "network bandwidth" potentially being a bottleneck, as he describes, is completely inapplicable.

He refers to a "1 GB" switch in several places and we're left to assume it's actually "1 Gbps". Similarly, references to "rational" rather than "relational" databases appear repeatedly early in the book. Misprints or not, they further erode credibility.
"Linear scalability" through parallel processing is a repeated reference, but at any scale--from multicore to thousand-node grids--engineers know that Amdahl's Law proves this is simply not possible. "Less non-linear" or a similar description would be accurate and not mislead the reader to believe doubling compute doubles speedup.

Ultimately, I'm disappointed in the extremely limited depth the author demonstrates in understanding distributed system and even simple computing fundamentals. Perhaps these topics have been rushed and perhaps other flaws are attributable to the publisher, but they are so central to the subject that to speak to them at all requires speaking to them intelligently and scientifically. Because the author unfortunately indicates little of either, I cannot recommend this book and will instead seek credibility on the subject elsewhere.

There are many great reviews for Tom White's latest incarnation of Hadoop: The Definitive Guide - and almost all of them are right whether it is to reference this book as "the bible" or a "mismash of many concepts or chapters". The fact is Hadoop is a fast ever changing field and whether you are a beginner or have plenty of experience building / deploying large Hadoop clusters - this is one of the few books that has most if not all of the fundamentals that you need to do day-to-day development and operations on Hadoop.

For example, if you are creating a Hive table - you can quickly jump to the Hive section near the book to discuss your options of sequence, Avro, or rcfiles. As you dig into the book, you realize that you may need to dive into the concepts of compression, so you'll flip back to the beginning of the book that discussions all of your different compression options. Admittedly, an end-to-end approach would be much easier to read but I'm not sure that this would be possible. After all, there are many different types of developers that are jumping into Hadoop. As a database developer, the Hive to compression approach would be more natural while a coder would prefer going into the details first and then worry about the abstractions later.

But the key thing here is that most (if not all) of the information is there so you can build your solution. If you are diving into Hadoop or a regular user - this is the reason you will get this book!

Product Details :
Paperback: 688 pages
Publisher: O'Reilly Media; 3 edition (May 26, 2012)
Language: English
ISBN-10: 1449311520
ISBN-13: 978-1449311520
Product Dimensions: 7 x 1.4 x 9.2 inches

More Details about Hadoop: The Definitive Guide 3rd edition

or

Download Hadoop: The Definitive Guide 3rd edition PDF Ebook

No comments:

Post a Comment