Merge lp:~gholt/swift/statsreportdocs into lp:~hudson-openstack/swift/trunk

Proposed by gholt
Status: Merged
Approved by: Chuck Thier
Approved revision: 56
Merged at revision: 56
Proposed branch: lp:~gholt/swift/statsreportdocs
Merge into: lp:~hudson-openstack/swift/trunk
Diff against target: 106 lines (+94/-2)
1 file modified
doc/source/admin_guide.rst (+94/-2)
To merge this branch: bzr merge lp:~gholt/swift/statsreportdocs
Reviewer Review Type Date Requested Status
Chuck Thier (community) Approve
Review via email: mp+32921@code.launchpad.net

Description of the change

Cluster health monitoring docs

To post a comment you must log in.
Revision history for this message
Chuck Thier (cthier) wrote :

Thanks Greg!

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'doc/source/admin_guide.rst'
2--- doc/source/admin_guide.rst 2010-07-30 19:57:20 +0000
3+++ doc/source/admin_guide.rst 2010-08-17 19:41:23 +0000
4@@ -108,8 +108,100 @@
5 Cluster Health
6 --------------
7
8-TODO: Greg, add docs here about how to use swift-stats-populate, and
9-swift-stats-report
10+There is a swift-stats-report tool for measuring overall cluster health. This
11+is accomplished by checking if a set of deliberately distributed containers and
12+objects are currently in their proper places within the cluster.
13+
14+For instance, a common deployment has three replicas of each object. The health
15+of that object can be measured by checking if each replica is in its proper
16+place. If only 2 of the 3 is in place the object's heath can be said to be at
17+66.66%, where 100% would be perfect.
18+
19+A single object's health, especially an older object, usually reflects the
20+health of that entire partition the object is in. If we make enough objects on
21+a distinct percentage of the partitions in the cluster, we can get a pretty
22+valid estimate of the overall cluster health. In practice, about 1% partition
23+coverage seems to balance well between accuracy and the amount of time it takes
24+to gather results.
25+
26+The first thing that needs to be done to provide this health value is create a
27+new account solely for this usage. Next, we need to place the containers and
28+objects throughout the system so that they are on distinct partitions. The
29+swift-stats-populate tool does this by making up random container and object
30+names until they fall on distinct partitions. Last, and repeatedly for the life
31+of the cluster, we need to run the swift-stats-report tool to check the health
32+of each of these containers and objects.
33+
34+These tools need direct access to the entire cluster and to the ring files
35+(installing them on an auth server or a proxy server will probably do). Both
36+swift-stats-populate and swift-stats-report use the same configuration file,
37+/etc/swift/stats.conf. Example conf file::
38+
39+ [stats]
40+ auth_url = http://saio:11000/v1.0
41+ auth_user = test:tester
42+ auth_key = testing
43+
44+There are also options for the conf file for specifying the dispersion coverage
45+(defaults to 1%), retries, concurrency, CSV output file, etc. though usually
46+the defaults are fine.
47+
48+Once the configuration is in place, run `swift-stats-populate -d` to populate
49+the containers and objects throughout the cluster.
50+
51+Now that those containers and objects are in place, you can run
52+`swift-stats-report -d` to get a dispersion report, or the overall health of
53+the cluster. Here is an example of a cluster in perfect health::
54+
55+ $ swift-stats-report -d
56+ Queried 2621 containers for dispersion reporting, 19s, 0 retries
57+ 100.00% of container copies found (7863 of 7863)
58+ Sample represents 1.00% of the container partition space
59+
60+ Queried 2619 objects for dispersion reporting, 7s, 0 retries
61+ 100.00% of object copies found (7857 of 7857)
62+ Sample represents 1.00% of the object partition space
63+
64+Now I'll deliberately double the weight of a device in the object ring (with
65+replication turned off) and rerun the dispersion report to show what impact
66+that has::
67+
68+ $ swift-ring-builder object.builder set_weight d0 200
69+ $ swift-ring-builder object.builder rebalance
70+ ...
71+ $ swift-stats-report -d
72+ Queried 2621 containers for dispersion reporting, 8s, 0 retries
73+ 100.00% of container copies found (7863 of 7863)
74+ Sample represents 1.00% of the container partition space
75+
76+ Queried 2619 objects for dispersion reporting, 7s, 0 retries
77+ There were 1763 partitions missing one copy.
78+ 77.56% of object copies found (6094 of 7857)
79+ Sample represents 1.00% of the object partition space
80+
81+You can see the health of the objects in the cluster has gone down
82+significantly. Of course, I only have four devices in this test environment, in
83+a production environment with many many devices the impact of one device change
84+is much less. Next, I'll run the replicators to get everything put back into
85+place and then rerun the dispersion report::
86+
87+ ... start object replicators and monitor logs until they're caught up ...
88+ $ swift-stats-report -d
89+ Queried 2621 containers for dispersion reporting, 17s, 0 retries
90+ 100.00% of container copies found (7863 of 7863)
91+ Sample represents 1.00% of the container partition space
92+
93+ Queried 2619 objects for dispersion reporting, 7s, 0 retries
94+ 100.00% of object copies found (7857 of 7857)
95+ Sample represents 1.00% of the object partition space
96+
97+So that's a summation of how to use swift-stats-report to monitor the health of
98+a cluster. There are a few other things it can do, such as performance
99+monitoring, but those are currently in their infancy and little used. For
100+instance, you can run `swift-stats-populate -p` and `swift-stats-report -p` to
101+get performance timings (warning: the initial populate takes a while). These
102+timings are dumped into a CSV file (/etc/swift/stats.csv by default) and can
103+then be graphed to see how cluster performance is trending.
104
105 ------------------------
106 Debugging Tips and Tools