Merge ~hyask/autopkgtest-cloud:skia/stats_boot_attempts into autopkgtest-cloud:master

Proposed by Skia
Status: Merged
Merged at revision: d9a5d07124417719fbd9afdc379a5ceda7ed9273
Proposed branch: ~hyask/autopkgtest-cloud:skia/stats_boot_attempts
Merge into: autopkgtest-cloud:master
Diff against target: 182 lines (+94/-10)
1 file modified
dev-tools/stats.ipynb (+94/-10)
Reviewer Review Type Date Requested Status
Tim Andersson Approve
Review via email: mp+467811@code.launchpad.net

Description of the change

New analysis of boot attempts from the log files.

To post a comment you must log in.
Revision history for this message
Tim Andersson (andersson123) wrote :

One inline comment, which you can feel free to ignore. Other than that, approve!

review: Approve
Revision history for this message
Skia (hyask) wrote :

I've added the print of some stats when connecting to the DB. That should help know what's to analyze. Thanks for the feedback.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/charms/focal/autopkgtest-web/webcontrol/stats.ipynb b/dev-tools/stats.ipynb
2similarity index 69%
3rename from charms/focal/autopkgtest-web/webcontrol/stats.ipynb
4rename to dev-tools/stats.ipynb
5index 2dbaddc..4cc8158 100644
6--- a/charms/focal/autopkgtest-web/webcontrol/stats.ipynb
7+++ b/dev-tools/stats.ipynb
8@@ -22,7 +22,9 @@
9 "cell_type": "code",
10 "execution_count": null,
11 "id": "48a388c7",
12- "metadata": {},
13+ "metadata": {
14+ "scrolled": true
15+ },
16 "outputs": [],
17 "source": [
18 "%run ./stats.py --collect-stats --download-db --since-days-ago 4 --until-days-ago 0"
19@@ -32,7 +34,9 @@
20 "cell_type": "code",
21 "execution_count": null,
22 "id": "3d3540d9",
23- "metadata": {},
24+ "metadata": {
25+ "scrolled": true
26+ },
27 "outputs": [],
28 "source": [
29 "import sqlite3\n",
30@@ -40,10 +44,25 @@
31 "import matplotlib.pyplot as plt\n",
32 "\n",
33 "# Update this path with the corresponding path to the database you want to analyze\n",
34- "db_path = \"./autopkgtest_2024-06-03 09:55:39.367132_with_stats.db\"\n",
35+ "db_path = \"./autopkgtest_2024-06-18 15:20:42.817741_with_stats.db\"\n",
36 "\n",
37 "db = sqlite3.connect(f\"file:{db_path}?mode=ro\")\n",
38- "sqlite3.paramstyle = \"named\""
39+ "sqlite3.paramstyle = \"named\"\n",
40+ "\n",
41+ "with db as db_con:\n",
42+ " db_con.row_factory = sqlite3.Row\n",
43+ " \n",
44+ " query = \"\"\"\n",
45+ " SELECT COUNT(tests_stats.run_id) as count, concat(datacenter, '-', arch) as datacenter, arch\n",
46+ " FROM tests_stats\n",
47+ " JOIN result ON result.run_id=tests_stats.run_id\n",
48+ " JOIN test ON test.id=result.test_id\n",
49+ " GROUP BY datacenter, arch\n",
50+ " ORDER BY datacenter\n",
51+ " \"\"\"\n",
52+ " print(\"Showing the number of rows per datacenter-arch that have stats data:\")\n",
53+ " for row in db_con.execute(query):\n",
54+ " print(f\"\\t{row[\"datacenter\"]}, {row[\"count\"]}\")"
55 ]
56 },
57 {
58@@ -73,7 +92,7 @@
59 "execution_count": null,
60 "id": "d16523b1",
61 "metadata": {
62- "scrolled": true
63+ "scrolled": false
64 },
65 "outputs": [],
66 "source": [
67@@ -99,7 +118,7 @@
68 " # Plot point for each dc-arch over time\n",
69 " for dc in df['datacenter'].sort_values().unique():\n",
70 " dc_data = df[df['datacenter'] == dc]\n",
71- " plt.plot(dc_data['date'], dc_data['first_boot_time'], label=dc)\n",
72+ " plt.plot(dc_data['date'], dc_data['first_boot_time'], 'o-', label=dc)\n",
73 "\n",
74 " # Add some title and labels\n",
75 " plt.title('Time to first boot for each datacenter-arch over time')\n",
76@@ -119,10 +138,10 @@
77 "id": "2ceba19c",
78 "metadata": {},
79 "source": [
80- "## Cumulated boot attempts\n",
81+ "## Cumulated boot attempts for all datacenters\n",
82 "\n",
83 "The next cell show the cumulated boot attempts. Sometimes, the `nova` script is unable to reach the VM on first try, but will retry some amount of time, logging the failures. Depending on the particular issues, this can show some boot instabilities, network trouble, or anything. \n",
84- "Since this graph shows cumulated boot attempts, it actually kinda counts the number of time a job had to retry to boot successfully, since the `boot_attempts` is almost always 1. This graph isn't scaled to the number of runned jobs to get a relative percentage, so this is heavily dependent on the number of actually run jobs. Still, this is somehow useful to get a hunch of instabilities in some particular datacenters.\n",
85+ "This graph isn't scaled to the number of runned jobs to get a relative percentage, so this is heavily dependent on the number of actually run jobs. Still, this is somehow useful to get a hunch of instabilities in some particular datacenters.\n",
86 "\n",
87 "The same kind of tweaking to the query than the previous cell can be done here."
88 ]
89@@ -131,7 +150,9 @@
90 "cell_type": "code",
91 "execution_count": null,
92 "id": "e4906b31",
93- "metadata": {},
94+ "metadata": {
95+ "scrolled": true
96+ },
97 "outputs": [],
98 "source": [
99 "with db as db_con:\n",
100@@ -157,7 +178,7 @@
101 " # Plot point for each dc-arch over time\n",
102 " for dc in df['datacenter'].unique():\n",
103 " dc_data = df[df['datacenter'] == dc]\n",
104- " plt.plot(dc_data['date'], dc_data['boot_attempts'], label=dc)\n",
105+ " plt.plot(dc_data['date'], dc_data['boot_attempts'], 'o-', label=dc)\n",
106 "\n",
107 " # Add some title and labels\n",
108 " plt.title('Cumulated boot attempts for each datacenter-arch over time')\n",
109@@ -169,6 +190,69 @@
110 " plt.show()\n",
111 " print(df)\n"
112 ]
113+ },
114+ {
115+ "cell_type": "markdown",
116+ "id": "adfb1df3",
117+ "metadata": {},
118+ "source": [
119+ "## Boot attempts distribution for a single datacenter-arch\n",
120+ "\n",
121+ "When an issue arises on a particular datacenter-arch, this might be useful to get a glimpse at the overall behavior when spawning VMs, and answer questions like \"is it worth it to raise the number of retries to spawn a VM?\".\n",
122+ "\n",
123+ "Remember that this cells requires you to set the `datacenter` and `arch` at the beginning."
124+ ]
125+ },
126+ {
127+ "cell_type": "code",
128+ "execution_count": null,
129+ "id": "bd15df1d",
130+ "metadata": {
131+ "scrolled": true
132+ },
133+ "outputs": [],
134+ "source": [
135+ "with db as db_con:\n",
136+ " db_con.row_factory = sqlite3.Row\n",
137+ " \n",
138+ " datacenter = \"bos03\"\n",
139+ " arch = \"arm64\"\n",
140+ " \n",
141+ " query = f\"\"\"\n",
142+ " SELECT boot_attempts, substr(tests_stats.run_id, 1, 8) as date\n",
143+ " FROM tests_stats\n",
144+ " JOIN result ON result.run_id=tests_stats.run_id\n",
145+ " JOIN test ON test.id=result.test_id\n",
146+ " WHERE arch = '{arch}' AND datacenter = '{datacenter}'\n",
147+ " ORDER BY date\n",
148+ " \"\"\"\n",
149+ " df = pd.read_sql_query(query, db_con)\n",
150+ " # Get the date as datetime object\n",
151+ " df[\"date\"] = pd.to_datetime(df.date)\n",
152+ " # Get boot_attempts as integers\n",
153+ " df[\"boot_attempts\"] = pd.to_numeric(df.boot_attempts, downcast=\"integer\")\n",
154+ " # Handle NaN\n",
155+ " df = df.fillna(0)\n",
156+ " \n",
157+ " # Display data as a graph\n",
158+ " plt.figure(figsize=(14, 5))\n",
159+ "\n",
160+ " # Plot point for each dc-arch over time\n",
161+ " for ba in sorted(df['boot_attempts'].unique()):\n",
162+ " ba_data = df[df['boot_attempts'] == ba]\n",
163+ " ba_data = ba_data.groupby(\"date\").count()\n",
164+ " plt.plot(ba_data, 'o-', drawstyle='steps-post', label=ba)\n",
165+ "\n",
166+ " # Add some title and labels\n",
167+ " plt.title(f'Boot attempts counts for {datacenter}-{arch} over time')\n",
168+ " plt.xlabel('Date')\n",
169+ " plt.ylabel('Boot attempts counts')\n",
170+ " plt.legend()\n",
171+ "\n",
172+ " # Plot the graph\n",
173+ " plt.show()\n",
174+ " print(df)\n"
175+ ]
176 }
177 ],
178 "metadata": {
179diff --git a/charms/focal/autopkgtest-web/webcontrol/stats.py b/dev-tools/stats.py
180similarity index 100%
181rename from charms/focal/autopkgtest-web/webcontrol/stats.py
182rename to dev-tools/stats.py

Subscribers

People subscribed via source and target branches