Colin McNamara – CCIE 18233 , VCP, EMCIE, NCDA, GEEK

Technical reviews and articles from a CCIE with extensive experience in designing and implementing converged enterprise networks.

Colin McNamara – CCIE 18233 , VCP, EMCIE, NCDA, GEEK header image 1

I used to be fat – How I beat the bulge

January 13th, 2012 · Health

I used to be obese. I tipped the scales at 290 pounds, and could barely walk up a flight of stairs without getting short of breath.

The picture on the left was taken back then. I was on staying in a hotel 110 nights a year and at on a flight at least twice a week. Living the jetset life was bringing me to a quick end to my life.

While I can talk all day about the journey from fat to skinny(er) I’d like to take the chance to share a couple key tools that helped me shed the pounds.

Getting back on the bike

I didn’t get to be 290 pounds in one day. I gained it one day at a time, while sitting at a desk typing on a computer (while eating something incredibly yummy). The human body is an amazing machine that reacts well to physical activity.

I personally found that getting back onto a bicycle provided a way for me to burn some calories while doing something that is very enjoyable. It also provided a physical activity that wasn’t as hard on my joints (which at 290 is a big risk) as running.

I ended up dragging one of my old racing bikes out of the garage, however that is not necessary for everybody. You burn just as many calories on a 150 dollar Walmart bike as you do riding 5000 dollar custom bike. What is important is that you are out being active, not what you are being active on.

What can’t be measured can’t be improved

Nobody wants to hear that you need to track your calories and weight to lose weight. But here it is – You have to track your calories and weight. Sorry, I know it sucks but you have to do it. Losing weight is simple math. Take in less calories then you need each day and you lose. Eat more then you need and you gain. How do you find out what your magic number of calories per day is? It is simple. You have what is called your Basal Metabolic Rate (BMR) and the calories burned during daily activity. Put those both together and that is your calorie budget for the day. Now you just have to find some tool to track it.

I have used a tool called the daily burn tracker for a couple years now. There is a free option that allows you to track via a webpage, and also a low cost iPhone app that allows you to look up foods and log them throughout the day. This allows me to keep an eye on my food intake, and make sure that calories aren’t sneaking up on me.

After measuring the calories you put into yourself, it is important to measure the results. When I started losing weight I just used a spreadsheet to track my progress. As time moved on I got introduced to the Withings scale.

I have to say, this scale is awesome for the inner geek in you. It measures your weight, your body fat, and your BMI (Body Mass Index) all automatically. Not only does it do that, but it upload your statistics via WiFI to a personal private account on www.withings.com . After your data is there you can set it up to sync to other services, (such as dailyburn listed before) or to twitter if you are up for some public support and/or embarrassment.

One other item that Withings makes, and I use it he blood pressure monitor.

This plugs into your iPhone or iPad and automatically takes your blood pressure and resting pulse. This is uploaded to the same interface that you use to view your weight and fat percentages. I find that it provides yet another window into the state of my health, and also provides a great feedback loop when I am training to hard (resting heart rate in the morning will be elevated).

You have to find balance and enjoy yourself

It is easy to become myopic in focus and become consumed with hitting a calorie goal each and every day.

While it is good to be focused it is important to remember that becoming fat didn’t happen in a day, it took time. The same is true with getting skinny. It is a long road, and it is ok to have fun for a day, enjoy some drinks and a good meal in moderation and have a good time. Just remember, the next day to get back on track, capture those calories and continue on the road to the skinny you.

→ No CommentsTags: basal metabolic rate·bicycle·calories per day·human body·losing weight·math·one day at a time·racing bikes·scales·short of breath

Tuning Hadoop and Hadoop driven applications

December 28th, 2011 · Hadoop

Hadoop is an open source framework for processing and querying big data on clusters of commodity hardware. It was originally developed by Yahoo in 2006 as a clone of Google File System (GFS) and MapReduce framework used to store web search indexes and crawl data for the search engine Nutch.

In the last few years however developers have embraced MapReduce (the ability to map key pairs, and reduce them into small byte size computing chunks to distribute across hybrid storage/processing nodes), and have begun developing a vast array of applications that can utilize the distributed storage and compute capacity.

My Background with Hadoop

Back in 2006 I working for a startup in San Diego that did high dimensional mathematical analysis of financial transactions to quantify identity theft risk. Over the time I was there we went from an scale up batch system to serving 10,000 transactions a day to a scale out web service (today you would call it a cloud) that served millions of transactions a day all served under 250 milliseconds each.

To scale to that size under such strict latency requirements it was necessary to experiment with and implement some pretty cutting edge open source technologies. I cheated off the notes of Jeremy Zawodny at Yahoo almost daily (thanks Jeremy, your knowledge and tools totally saved my butt many times). At that same time Jeremy’s team started doing some interesting work around distributed computing with Hadoop. Needless to say this was a technology I had to try. Hadoop was extremely young at the time, however for certain analytics workloads I was able to use 10 PC’s to outperform a half million dollars in compute and fibre channel storage.

Flash forward 6 years – Hadoop is all grown up

hadoop-hbase-extended-applications-2

Over the past six years not only has Hadoops file system (HDFS) and processing (MapReduce) capabilities matured, but a suite of applications has been developed. These include tools to managed Hadoop clusters, large scale log analysis tools, scale out analytics packages and large scale distributed database applications.

The list of clients using hadoop has grown too. This ranges from Yahoo, Ebay and Facebook to enterprise customers like Fox, TMobile, Equifax and the New York Stock Exchange using Greenplum (Project R running on Hadoop). No longer is Hadoop a tool for a select few, it is now the next logical extension of the standard web service LAMP stack, and increasingly useful for Data Warehouse workloads.

Tuning the foundation – Hadoop and MapReduce

hadoop-mapreduce-red-3

Many times when people talk about tuning parallel compute clusters like Hadoop, SunGrid or LSF they forget the obvious. They forget that the squeezing performance is about managing the delicate balance between applications and infrastructure.  When tuning that balance, you have to first segregate applications that directly access the hardware resources, and applications that access these apps. To create a frame of reference think of the relationship between Apache, MySQL and Disks in a LAMP architecture.

When dealing with Hadoop Distributed File System (HDFS) and the MapReduce jobs that run on it there are three primary dimensions of tuning. These are dimensions are -

1. Tuning keystones in the infrastructure such as optimizing NameNode and Job Tracker server performance. (note memory sizing, tcp performance, cpu scaling)

2. Optimizing transfer of data between slave nodes in the HDFS cluster (note, bundled 1 gig / CPU)

3. Balancing I/O systems in slave nodes such as memory, server side flash, and spinning disk.

Optimizing NameNode and Job Tracker server performance

The NameNode in a Hadoop cluster is used to track the locations of the different file shards distributed across all slave nodes in the cluster. It is also used to house metadata for certain applications that reside in the Hadoop cluster. This puts specific strain on CPU, Memory and Network interfaces.

CPU / Network Interface

Certain processes inside of the name node do not take advantage of the multitude of cores available on today’s servers. The biggest offender in this case is the RPC server which processes network requests in a serial manner. Utilizing the fastest CPU as possible in conjunction with low latency network adapters such as Mellanox MNPH29D-XTR 10 Gig NIC, and low latency fabric switches such as the Nexus 5548. Optimizing the CPU and Network interface has significant effect on minimizing bottlenecks due to serialization delay of RPC requests.

Memory

NameNodes can use a lot of memory when servicing HDFS alone. The addition of layered applications on top of HDFS that utilize the NameNode as well as the increase in file numbers in HDFS only increase the importance of sufficient amounts of high speed memory.

Optimizing transfers between nodes in the HDFS cluster

Certain types of jobs such as sorts and greps (the basis for index generation) move significant amounts of data between nodes in the Hadoop cluster. Since the inception of Intel’s Nehalem processor family, single gigabit interface have presented bottlenecks when transmitting and receiving data. This inserts “slack time” in the cluster minimizing the time that slave node is actually processing data. The net result of this equals either slower job completion / response times or the unnecessary addition of additional nodes to the cluster (increasing your cost per job/transaction).

Impact of server bandwidth on job completion time

hadoop-data-transfer-bonding

To illustrate this point please reference this test done by Intel on their own Hadoop cluster with a first generation Nehalem processor. Even then a single gigabit interface was not sufficient to service a node. In this case doubling the bandwidth to two gigabit by bonding interfaces together rebalanced the node. However if you follow Moore’s law, nodes utilizing Sandy Bridge CPU’s (due to release some time in 2012) will need four plus gigabit of network during a data transfer to avoid unnecessary wait times. Luckily this generation of server will have 10 Gig adapters built into the motherboard.

Network bandwidth and design

cisco-ucs-hadoop-rack-4

HDFS and the many of the applications that reside on top of it have the notion of a Rack ID. This can be used for fault isolation. For example if you had A/B racks on different power feeds you could ensure that redundant data shards are stored on nodes in different racks, and therefore increase the systems tolerance of faults.

This Rack Id can also be queried by higher level applications to ensure that jobs requiring high bandwidth data transfers are localized within say a pair of Nexus 5500′s with 10 gig fabric extenders. This would minimize the utilization of typically oversubscribed uplinks north of the access layer ensuring again, that nodes are not sitting idle while receiving data.

If however your application requirements do force you to expand jobs beyond the scaling capacity of a pair of low latency fabric extended switches. Maximizing your active paths between pods (groupings with the same rack.id) utilizing tools like Fabric Path create a layer two mesh between your pods can help in minimizing the wait time that a node may experience.

Minimizing I/O wait times in disk subsystems

There are many places where RAID or higher performance disk systems can yield benefits in the Hadoop cluster. One place is the MapReduce local directory. This is the place that mapped files are stored locally, adding multiple disks to this mount is one option. A second option which is gaining more and more traction is utilizing Solid State Disk (SSD) or PCIe based flash cards to present optimal IO for certain functions.

hadoop-sort-io-double-disk-5

The graphic above, again from Intel demonstrates in a very simple fashion the impact of going from two to four disks in a node (doubling the IO). The result was completing the required job in one half the time. In simple terms, increasing the cost of the server by 10% increased its sort performance by 100%. Again performance increases vary by workload. However in a strict sense this increases the per server cost while decreasing the cost per job / transaction.

Reducing the Impact of Disk Spill

Disk spill is the result of the majority of the servers buffer being full of data during the map operation. Once a certain percentage of utilization is hit (normally 80%) a job is kicked off to write this data to disk, making room for more data. Adding more memory to the server to be used as a buffer for the Map operation minimizes disk spill ratio’s. This may increase the per server costs, but depending on your workload may end up your cost per job/transaction due to more efficient operation of your nodes. A second option, first explored by Intel in their chipset design clusters is to extend RAM into solid state cards inside of their servers.

Bringing it all together

Hadoop  supports a suite of applications that are used from the worlds largest web service providers to large enterprises. Uses include Data Warehousing, Analytics, Log analysis and large horizontally scaled databases.

Similar to other parallel compute systems such as Sun Grid Engine, or Platform LSF, a system wide approach to performance tuning must be used to ensure optimal performance as measured by cost per job / transaction. This system wide approach include server optimizations for specific server roles server roles such as large memory and PCIe Flash cards. As well as utilization of network equipment and topologies such as Nexus 5500 and fabric extenders to create low latency high bandwidth back planes ideally suited for Hadoop clusters.

Want to learn more?

Yahoo developers network Hadoop blog – http://developer.yahoo.com/hadoop/

Hadoop Distributed File Systems Architecture Guide – http://hadoop.apache.org/common/docs/current/hdfs_design.html

Big Data Network Design Considerations (Cisco) – http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-690561.html

Hadoop and Hbase applications for read intensive search – http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/

Evolution of Google File System – http://queue.acm.org/detail.cfm?id=1594206

 


→ No CommentsTags: analytics·chunks·commodity hardware·financial transactions·gfs·Hadoop·jeremy zawodny·latency requirements·open source framework·open source technologies·web search indexes·web service

Tracking my progress during PBP 2011

August 16th, 2011 · Sports

For those of you that don’t know. I am a bit of a cycling fanatic. My events of choice are ultra-endurance events lasting 200 miles or more.

For the last two years I have been training with a goal of finishing the oldest bicycling event in history, Paris-Brest-Paris. This event has been held every 4 years since 1896, and is 1200 kilometers (roughly 750 miles) long. Roughly 700 of us Americans qualified to go this time, and I am one of them.

For those of you that want to track my progress while I am on the road ACP (the group who runs PBP) has implemented a RFID tracking system that will track me as I move between checkpoints on the course. You can monitor my progress starting on Aug 21st at 6:00 pm CEST (Paris Time) by going to the page below and entering in the frame number 4500.

http://www.paris-brest-paris.org/pbp2011/index2.php?lang=en&cat=accueil&page=edito

Frame number = 4500

Bonne Route!  –Colin

—-UPDATE —-

My results are below – I came in at 76 hours and 57 minutes (well under the 90 hour time limit for my group).

→ No CommentsTags: pbp 2011