Velocity 2011 - Part 1: Workshops
My notes on the workshop day at the Velocity Conference.
A lot of Chef stuff, but of course OpsCode was a sponsor. Real gems where Decisions in the Face of Uncertainity and Advanced post-mortem Fu and Human Error 101.
Read also about the first and second day.
A lot of Chef stuff, but of course OpsCode was a sponsor. Real gems where Decisions in the Face of Uncertainity and Advanced post-mortem Fu and Human Error 101.
Read also about the first and second day.
Workshop: OpenStack
- Compute Nodes (hardware and virtualization agnostic)
- Storage Nodes ("Glance") for HTTP Object Storage
- Rest APIs: OpenStack API, EC2 Compatibility Module
- Open & Modular design
- Storage
- http://swift.openstack.org/
- Distributed Shared-nothing Architecture
- HTTP only
- Availability Zones provide independant outage scenarios
- Put data into >=3 different availability zones
- Swift is independant from OpenStack and can be used stand-alone
- http://swift.openstack.org/
- OpenStack installation
- Using Chef to script the installation
- There are many cook books for automated OpenStack setups
- Mostly Chef advertising, no detailed setup & installation examples :-(
- Using Chef to script the installation
Workshop: Scale Dirty
- Yet Another Chef Advertising Session
Workshop: Decisions in the Face of Uncertainity
Just Enough Statistics to be Dangerous- John Rauser, Principal Quantitative Engineer, Amazon
- Any numerical information without a precision is worthless. All numbers derived from the real world are actually an estimate
- Example: How old is Jeff Bezos?
- Wrong Answers: 47, 55, …
- Correct Anwswers: 42 - 55, 40 - 50, 45 +/- 5
- Always give a range unless forced to give precise numbers!
- BTW, Jeff is 47
- Example: How old is Jeff Bezos?
- Statistics is a method to calibrate our estimations
- Measurements are a tool to reduce uncertainity
- Classromm exercise
- Participants estimate the answer to 10 questions
- How would be the distribution of right answers (x out of 10 right) if the probability for a right answer is equal for each question?
- The binomial distribution is the most important statistical function that helps us here, see http://www.wolframalpha.com/input/?i=binomial+distribution for definition
- The audience shows that it is a badly calibrated estimator for this task. On average people get 4 answers right, according to the binomial distribution the average should be much much higher.
- Conclusion For each question we have to find a suitable way to calibrate the estimates. → Statistical Inference
- Participants estimate the answer to 10 questions
- John told his personal story of how he became interested in statistics and data analysis
- Before the advent of computers, analytical statistics was the only way to reach results that require lots of calculations.
- With computers we can use direct simulation to find a simple answer to the question "what happens if we run the experiment many many times"?
- Statistical Inference
- Randomize data production, find a random process that generates the data
- Repeat by simulation
- Reject and model that does not agree with the data
- Randomize data production, find a random process that generates the data
- Decisions in the face of uncertainty by the example of estimating the amount of business cards in a stack.
- You had to be there, nice rollercoaster between math, statistics and life experience as a data analyst
- You had to be there, nice rollercoaster between math, statistics and life experience as a data analyst
Workshop: Advanced Postmortem Fu and Human Error 101
- John Allspaw, etsy.com
- The "System" you operate also contains people, not only hardware & software
- Postmortem relies on having good data to analize
- Each graph needs to be put into context by marking important events (e.g. deployments)
- Rich internal communications (IRC, Blog, Twitter) act as a flight recorder, everything is timestamped
- Define and discuss various crisis patterns
- Human error is an inevitable by-product of strained complex systems.
- pre-mortems are better than post-mortems: How to prepare for new features
- contingency planning
- what could go wrong?
- contingency planning
- Just culture
- How to live with and embrace human error
- The culture required to perform blameless post-mortems
- Problem: Negligence is oftenly found during an outage, usually the amount of negligence corresponds with the severity of the outage
- Holding people accountable != Blaming people
- No bad apples, only bad theories of error
- Increase Accountability by supporting learning
- Organizational Roots: Accountability = Responsibility + Requisite Authority
- How to live with and embrace human error
- The culture of an organization has great influence
Workshop: Hadoop
- Hadoop: Open Source Storage and Processing Engine
- MapReduce for processing
- Hadoop Distributed File System (HDFS) for distributed storage
- Hadoop separates distributed system fault-tolerance code from application logic
- MapReduce for processing
- Gotchas:
- Configuration and version divergence within a cluster. This can lead to hard-to-catch bugs.
- Cluster state: Is it up, network partitioning,
- Configuration and version divergence within a cluster. This can lead to hard-to-catch bugs.
- Claudera Service and Configuration Manager (SCM)
- Available to Claudera customers
- integrated configuration and service management for hadoop services
- Process supervision, what processes are running where
- Configuration management, with hadoop-specific dependencies
- No plans right now to open source the SCM!
- Available to Claudera customers
- Related work
- Google’s cluster manager
- procfile & foreman
- LinkedIn’s glu - https://github.com/linkedin/glu/
- Google’s cluster manager
- Hadoop planning tips:
- NameNode and JobTracker often on beefier hardware
- Configure disks as JBOD
- Gigabit Ethernet
- Top of rack switches
- Avoid virtualization
- NameNode and JobTracker often on beefier hardware
- Hadoop installation tips:
- CentOS 5 / RHEL 5 most common
- Oracle JVM, bugs are known and worked around
- Mount noatime
- Adjust swappiness
- Use Cloudera’s Distribution (CDH3), install as .rpm or .deb
- Brings all relevant components for the Hadoop ecosystem in a tested and compatible fashion
- Hue, Oozie, Hive, Flume, Sqoop, Pig, HBase, Zookeeper
- Brings all relevant components for the Hadoop ecosystem in a tested and compatible fashion
- CentOS 5 / RHEL 5 most common
- Hadoop configuration tips:
- Use source control
- XML files *-site.xml and hadoop-env.sh
- most important config items:
- dfs.name.dir NameNode. Typicall two volumes + NFS (mounted correctly)
- dfs.data.dir DataNodes. One directory per phyiscal harddisk
- mapred.tasktracker.map.tasks.maximum Max number of maps per machine (1 per core)
- mapred.tasktracker.reduce.tasks.maximum Max number of redcues per machine (1/3 per core)
- dfs.name.dir NameNode. Typicall two volumes + NFS (mounted correctly)
- Hadoop requires DNS with correct reverse lookups.
- IPv6: Everyone turns it off
- Secondary name node not checkpointing, logs grow forever.
- Use source control
Comments
Post a Comment