Merge lp:~aisrael/charms/trusty/apache-hadoop-client/benchmarks into lp:~bigdata-dev/charms/trusty/apache-hadoop-client/trunk

Proposed by Adam Israel
Status: Merged
Merge reported by: Cory Johns
Merged at revision: not available
Proposed branch: lp:~aisrael/charms/trusty/apache-hadoop-client/benchmarks
Merge into: lp:~bigdata-dev/charms/trusty/apache-hadoop-client/trunk
Diff against target: 266 lines (+213/-0)
8 files modified
README.md (+44/-0)
actions.yaml (+38/-0)
actions/parseTerasort.py (+54/-0)
actions/teragen (+21/-0)
actions/terasort (+49/-0)
hooks/benchmark-relation-changed (+3/-0)
hooks/install (+2/-0)
metadata.yaml (+2/-0)
To merge this branch: bzr merge lp:~aisrael/charms/trusty/apache-hadoop-client/benchmarks
Reviewer Review Type Date Requested Status
Juju Big Data Development Pending
Review via email: mp+260526@code.launchpad.net

Description of the change

This merge proposal adds support for benchmarking, and implements a 'terasort' benchmark. This adds two external dependencies: python-pip (which may already be installed via other requirements) and charm-benchmark, which is installed via pip.

To post a comment you must log in.
80. By Adam Israel

Add a Benchmarking section to the README

Revision history for this message
Cory Johns (johnsca) wrote :

Awesome! Thanks for this. See my two inline comments, though, regarding the /etc/environment issues you ran into.

Revision history for this message
Cory Johns (johnsca) wrote :

Merged this, with some modifications, into ~bigdata-dev.

Revision history for this message
Cory Johns (johnsca) wrote :

I should have clarified. We decided it made more sense to apply this to the apache-hadoop-plugin charm, rather than -client, as that now serves as the general connection point which was previously the role of -client.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'README.md'
--- README.md 2015-05-19 16:00:07 +0000
+++ README.md 2015-05-28 21:05:22 +0000
@@ -22,6 +22,50 @@
22 juju ssh client/022 juju ssh client/0
23 hadoop jar my-job.jar23 hadoop jar my-job.jar
2424
25## Benchmarking
26
27 You can perform a terasort benchmark, in order to gauge performance of your environment:
28
29 $ juju action do apache-hadoop-client/0 terasort
30 Action queued with id: cbd981e8-3400-4c8f-8df1-c39c55a7eae6
31 $ juju action fetch --wait 0 cbd981e8-3400-4c8f-8df1-c39c55a7eae6
32 results:
33 meta:
34 composite:
35 direction: asc
36 units: ms
37 value: "206676"
38 results:
39 raw: '{"Total vcore-seconds taken by all map tasks": "439783", "Spilled Records":
40 "30000000", "WRONG_LENGTH": "0", "Reduce output records": "10000000", "HDFS:
41 Number of bytes read": "1000001024", "Total vcore-seconds taken by all reduce
42 tasks": "50275", "Reduce input groups": "10000000", "Shuffled Maps ": "8", "FILE:
43 Number of bytes written": "3128977482", "Input split bytes": "1024", "Total
44 time spent by all reduce tasks (ms)": "50275", "FILE: Number of large read operations":
45 "0", "Bytes Read": "1000000000", "Virtual memory (bytes) snapshot": "7688794112",
46 "Launched map tasks": "8", "GC time elapsed (ms)": "11656", "Bytes Written":
47 "1000000000", "FILE: Number of read operations": "0", "HDFS: Number of write
48 operations": "2", "Total megabyte-seconds taken by all reduce tasks": "51481600",
49 "Combine output records": "0", "HDFS: Number of bytes written": "1000000000",
50 "Total time spent by all map tasks (ms)": "439783", "Map output records": "10000000",
51 "Physical memory (bytes) snapshot": "2329722880", "FILE: Number of write operations":
52 "0", "Launched reduce tasks": "1", "Reduce input records": "10000000", "Total
53 megabyte-seconds taken by all map tasks": "450337792", "WRONG_REDUCE": "0",
54 "HDFS: Number of read operations": "27", "Reduce shuffle bytes": "1040000048",
55 "Map input records": "10000000", "Map output materialized bytes": "1040000048",
56 "CPU time spent (ms)": "195020", "Merged Map outputs": "8", "FILE: Number of
57 bytes read": "2080000144", "Failed Shuffles": "0", "Total time spent by all
58 maps in occupied slots (ms)": "439783", "WRONG_MAP": "0", "BAD_ID": "0", "Rack-local
59 map tasks": "2", "IO_ERROR": "0", "Combine input records": "0", "Map output
60 bytes": "1020000000", "CONNECTION": "0", "HDFS: Number of large read operations":
61 "0", "Total committed heap usage (bytes)": "1755840512", "Data-local map tasks":
62 "6", "Total time spent by all reduces in occupied slots (ms)": "50275"}'
63 status: completed
64 timing:
65 completed: 2015-05-28 20:55:50 +0000 UTC
66 enqueued: 2015-05-28 20:53:41 +0000 UTC
67 started: 2015-05-28 20:53:44 +0000 UTC
68
2569
26## Contact Information70## Contact Information
2771
2872
=== added directory 'actions'
=== added file 'actions.yaml'
--- actions.yaml 1970-01-01 00:00:00 +0000
+++ actions.yaml 2015-05-28 21:05:22 +0000
@@ -0,0 +1,38 @@
1teragen:
2 description: foo
3 params:
4 size:
5 description: The number of 100 byte rows, default to 100MB of data to generate and sort
6 type: string
7 default: "10000000"
8 indir:
9 description: foo
10 type: string
11 default: 'tera_demo_in'
12terasort:
13 description: foo
14 params:
15 indir:
16 description: foo
17 type: string
18 default: 'tera_demo_in'
19 outdir:
20 description: foo
21 type: string
22 default: 'tera_demo_out'
23 size:
24 description: The number of 100 byte rows, default to 100MB of data to generate and sort
25 type: string
26 default: "10000000"
27 maps:
28 description: The default number of map tasks per job. 1-20
29 type: integer
30 default: 1
31 reduces:
32 description: The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Try 1-20
33 type: integer
34 default: 1
35 numtasks:
36 description: How many tasks to run per jvm. If set to -1, there is no limit.
37 type: integer
38 default: 1
039
=== added file 'actions/parseTerasort.py'
--- actions/parseTerasort.py 1970-01-01 00:00:00 +0000
+++ actions/parseTerasort.py 2015-05-28 21:05:22 +0000
@@ -0,0 +1,54 @@
1#!/usr/bin/env python
2"""
3Simple script to parse cassandra-stress' transaction results
4and reformat them as JSON for sending back to juju
5"""
6import sys
7import subprocess
8import json
9from charmhelpers.contrib.benchmark import Benchmark
10import re
11
12
13def action_set(key, val):
14 action_cmd = ['action-set']
15 if isinstance(val, dict):
16 for k, v in val.iteritems():
17 action_set('%s.%s' % (key, k), v)
18 return
19
20 action_cmd.append('%s=%s' % (key, val))
21 subprocess.check_call(action_cmd)
22
23
24def parse_terasort_output():
25 """
26 Parse the output from terasort and set the action results:
27
28 """
29
30 results = {}
31
32 # Find all of the interesting things
33 regex = re.compile('\t+(.*)=(.*)')
34 for line in sys.stdin.readlines():
35 m = regex.match(line)
36 if m:
37 results[m.group(1)] = m.group(2)
38 action_set("results.raw", json.dumps(results))
39
40 # Calculate what's important
41 if 'CPU time spent (ms)' in results:
42 composite = int(results['CPU time spent (ms)']) + int(results['GC time elapsed (ms)'])
43 Benchmark.set_composite_score(
44 composite,
45 'ms',
46 'asc'
47 )
48 else:
49 print "Invalid test results"
50 print results
51
52
53if __name__ == "__main__":
54 parse_terasort_output()
055
=== added file 'actions/teragen'
--- actions/teragen 1970-01-01 00:00:00 +0000
+++ actions/teragen 2015-05-28 21:05:22 +0000
@@ -0,0 +1,21 @@
1#!/bin/bash
2set -eux
3SIZE=`action-get size`
4IN_DIR=`action-get indir`
5
6benchmark-start
7
8# I don't know why, but have to source /etc/environment before and after
9# invoking the bash shell to get it working.
10. /etc/environment
11su ubuntu << EOF
12. /etc/environment
13if JAVA_HOME=${JAVA_HOME} hadoop fs -stat ${IN_DIR}; then
14 JAVA_HOME=${JAVA_HOME} hadoop fs -rm -r -skipTrash ${IN_DIR} || true
15fi
16
17JAVA_HOME=${JAVA_HOME} hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar teragen ${SIZE} ${IN_DIR}
18
19EOF
20
21benchmark-finish
022
=== added file 'actions/terasort'
--- actions/terasort 1970-01-01 00:00:00 +0000
+++ actions/terasort 2015-05-28 21:05:22 +0000
@@ -0,0 +1,49 @@
1#!/bin/bash
2IN_DIR=`action-get indir`
3OUT_DIR=`action-get outdir`
4SIZE=`action-get size`
5OPTIONS=''
6
7MAPS=`action-get maps`
8REDUCES=`action-get reduces`
9NUMTASKS=`action-get numtasks`
10
11OPTIONS="${OPTIONS} -D mapreduce.job.maps=${MAPS}"
12OPTIONS="${OPTIONS} -D mapreduce.job.reduces=${REDUCES}"
13OPTIONS="${OPTIONS} -D mapreduce.job.jvm.numtasks=${NUMTASKS}"
14
15mkdir -p /opt/terasort
16chown ubuntu:ubuntu /opt/terasort
17run=`date +%s`
18
19# HACK: the environment reset below is munging the PATH
20OLDPATH=$PATH
21
22
23# I don't know why, but have to source /etc/environment before and after
24# invoking the bash shell to get it working.
25. /etc/environment
26su ubuntu << EOF
27. /etc/environment
28
29mkdir -p /opt/terasort/results/$run
30
31# If there's no data generated yet, create it using the action defaults
32if ! JAVA_HOME=${JAVA_HOME} hadoop fs -stat ${IN_DIR} &> /dev/null; then
33 JAVA_HOME=${JAVA_HOME} hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar teragen ${SIZE} ${IN_DIR} > /dev/null
34
35fi
36
37# If there's already sorted data, remove it
38if JAVA_HOME=${JAVA_HOME} hadoop fs -stat ${OUT_DIR} &> /dev/null; then
39 JAVA_HOME=${JAVA_HOME} hadoop fs -rm -r -skipTrash ${OUT_DIR} || true
40fi
41
42benchmark-start
43JAVA_HOME=${JAVA_HOME} hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar terasort ${OPTIONS} ${IN_DIR} ${OUT_DIR} &> /opt/terasort/results/$run/terasort.log
44benchmark-finish
45
46EOF
47PATH=$OLDPATH
48
49`cat /opt/terasort/results/$run/terasort.log | python $CHARM_DIR/actions/parseTerasort.py`
050
=== added file 'hooks/benchmark-relation-changed'
--- hooks/benchmark-relation-changed 1970-01-01 00:00:00 +0000
+++ hooks/benchmark-relation-changed 2015-05-28 21:05:22 +0000
@@ -0,0 +1,3 @@
1#!/bin/bash
2
3relation-set benchmarks=terasort
04
=== modified file 'hooks/install'
--- hooks/install 2015-05-11 22:25:12 +0000
+++ hooks/install 2015-05-28 21:05:22 +0000
@@ -1,2 +1,4 @@
1#!/bin/bash1#!/bin/bash
2apt-get install -y python-pip && pip install -U charm-benchmark
3
2hooks/status-set blocked "Please add relation to apache-hadoop-plugin"4hooks/status-set blocked "Please add relation to apache-hadoop-plugin"
35
=== added symlink 'hooks/upgrade-charm'
=== target is u'install'
=== modified file 'metadata.yaml'
--- metadata.yaml 2015-05-12 22:18:09 +0000
+++ metadata.yaml 2015-05-28 21:05:22 +0000
@@ -12,3 +12,5 @@
12 hadoop-plugin:12 hadoop-plugin:
13 interface: hadoop-plugin13 interface: hadoop-plugin
14 scope: container14 scope: container
15 benchmark:
16 interface: benchmark

Subscribers

People subscribed via source and target branches