Velocity 2011 - Part 1: Workshops

My notes on the workshop day at the Velocity Conference.

A lot of Chef stuff, but of course OpsCode was a sponsor. Real gems where Decisions in the Face of Uncertainity and Advanced post-mortem Fu and Human Error 101.

Read also about the first and second day.

Workshop: OpenStack

  • Compute Nodes (hardware and virtualization agnostic)
  • Storage Nodes ("Glance") for HTTP Object Storage
  • Rest APIs: OpenStack API, EC2 Compatibility Module
  • Open & Modular design
  • Storage
    • http://swift.openstack.org/
    • Distributed Shared-nothing Architecture
    • HTTP only
    • Availability Zones provide independant outage scenarios
    • Put data into >=3 different availability zones
    • Swift is independant from OpenStack and can be used stand-alone
  • OpenStack installation
    • Using Chef to script the installation
    • There are many cook books for automated OpenStack setups
    • Mostly Chef advertising, no detailed setup & installation examples :-(

Workshop: Scale Dirty

  • Yet Another Chef Advertising Session

Workshop: Decisions in the Face of Uncertainity

Just Enough Statistics to be Dangerous
  • John Rauser, Principal Quantitative Engineer, Amazon
  • Any numerical information without a precision is worthless. All numbers derived from the real world are actually an estimate
    • Example: How old is Jeff Bezos?
    • Wrong Answers: 47, 55, …
    • Correct Anwswers: 42 - 55, 40 - 50, 45 +/- 5
    • Always give a range unless forced to give precise numbers!
    • BTW, Jeff is 47
  • Statistics is a method to calibrate our estimations
  • Measurements are a tool to reduce uncertainity
  • Classromm exercise
    • Participants estimate the answer to 10 questions
    • How would be the distribution of right answers (x out of 10 right) if the probability for a right answer is equal for each question?
    • The binomial distribution is the most important statistical function that helps us here, see http://www.wolframalpha.com/input/?i=binomial+distribution for definition
    • The audience shows that it is a badly calibrated estimator for this task. On average people get 4 answers right, according to the binomial distribution the average should be much much higher.
    • Conclusion For each question we have to find a suitable way to calibrate the estimates. → Statistical Inference
  • John told his personal story of how he became interested in statistics and data analysis
  • Before the advent of computers, analytical statistics was the only way to reach results that require lots of calculations.
  • With computers we can use direct simulation to find a simple answer to the question "what happens if we run the experiment many many times"?
  • Statistical Inference
    • Randomize data production, find a random process that generates the data
    • Repeat by simulation
    • Reject and model that does not agree with the data
  • Decisions in the face of uncertainty by the example of estimating the amount of business cards in a stack.
    • You had to be there, nice rollercoaster between math, statistics and life experience as a data analyst

Workshop: Advanced Postmortem Fu and Human Error 101

  • John Allspaw, etsy.com
  • The "System" you operate also contains people, not only hardware & software
  • Postmortem relies on having good data to analize
  • Each graph needs to be put into context by marking important events (e.g. deployments)
  • Rich internal communications (IRC, Blog, Twitter) act as a flight recorder, everything is timestamped
  • Define and discuss various crisis patterns
  • Human error is an inevitable by-product of strained complex systems.
  • pre-mortems are better than post-mortems: How to prepare for new features
    • contingency planning
    • what could go wrong?
  • Just culture
    • How to live with and embrace human error
    • The culture required to perform blameless post-mortems
    • Problem: Negligence is oftenly found during an outage, usually the amount of negligence corresponds with the severity of the outage
    • Holding people accountable != Blaming people
    • No bad apples, only bad theories of error
    • Increase Accountability by supporting learning
    • Organizational Roots: Accountability = Responsibility + Requisite Authority
  • The culture of an organization has great influence

Workshop: Hadoop

  • Hadoop: Open Source Storage and Processing Engine
    • MapReduce for processing
    • Hadoop Distributed File System (HDFS) for distributed storage
    • Hadoop separates distributed system fault-tolerance code from application logic
  • Gotchas:
    • Configuration and version divergence within a cluster. This can lead to hard-to-catch bugs.
    • Cluster state: Is it up, network partitioning,
  • Claudera Service and Configuration Manager (SCM)
    • Available to Claudera customers
    • integrated configuration and service management for hadoop services
    • Process supervision, what processes are running where
    • Configuration management, with hadoop-specific dependencies
    • No plans right now to open source the SCM!
  • Related work
  • Hadoop planning tips:
    • NameNode and JobTracker often on beefier hardware
    • Configure disks as JBOD
    • Gigabit Ethernet
    • Top of rack switches
    • Avoid virtualization
  • Hadoop installation tips:
    • CentOS 5 / RHEL 5 most common
    • Oracle JVM, bugs are known and worked around
    • Mount noatime
    • Adjust swappiness
    • Use Cloudera’s Distribution (CDH3), install as .rpm or .deb
      • Brings all relevant components for the Hadoop ecosystem in a tested and compatible fashion
      • Hue, Oozie, Hive, Flume, Sqoop, Pig, HBase, Zookeeper
  • Hadoop configuration tips:
    • Use source control
    • XML files *-site.xml and hadoop-env.sh
    • most important config items:
      • dfs.name.dir NameNode. Typicall two volumes + NFS (mounted correctly)
      • dfs.data.dir DataNodes. One directory per phyiscal harddisk
      • mapred.tasktracker.map.tasks.maximum Max number of maps per machine (1 per core)
      • mapred.tasktracker.reduce.tasks.maximum Max number of redcues per machine (1/3 per core)
    • Hadoop requires DNS with correct reverse lookups.
    • IPv6: Everyone turns it off
    • Secondary name node not checkpointing, logs grow forever. 

Comments

Like this content? You could send me something from my Amazon Wishlist. Need commercial support? Contact me for Consulting Services.

Popular posts from this blog

Overriding / Patching Linux System Serial Number

The Demise of KaiOS - Alcatel 3088X

A Login Security Architecture Without Passwords