Merge ~jac-karwowski/dpcs:clustering_a_16 into dpcs:master

Proposed by Jacek
Status: Merged
Approved by: Marek Bardoński
Approved revision: 2ce5c861c3fa79da7fe077f57dc116752d89e4bd
Merge reported by: Marek Bardoński
Merged at revision: not available
Proposed branch: ~jac-karwowski/dpcs:clustering_a_16
Merge into: dpcs:master
Diff against target: 36 lines (+26/-0)
1 file modified
docs/Clustering_A.txt (+26/-0)
Reviewer Review Type Date Requested Status
Marek Bardoński Approve
Review via email: mp+288229@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Marek Bardoński (bdfhjk) wrote :

It's a professional review of a current state-of-arts algorithms. Your ideas seems promising!

I think a good part of our job will be to implement constrained spectral clustering into a scikit-learn project. Let's think about it.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/docs/Clustering_A.pdf b/docs/Clustering_A.pdf
2new file mode 100644
3index 0000000..ef55b27
4Binary files /dev/null and b/docs/Clustering_A.pdf differ
5diff --git a/docs/Clustering_A.txt b/docs/Clustering_A.txt
6new file mode 100644
7index 0000000..84c2a9d
8--- /dev/null
9+++ b/docs/Clustering_A.txt
10@@ -0,0 +1,26 @@
11+To solve our problem of classifying crashes, first we have to think how represent them in our machine learning algorithm. Server gets the report with several fields containing:
12+1) The name and version of crashed application, along with exit code
13+2) System version information (kernel and system version, installed modules)
14+3) stderr output, consisting of several lines of text
15+
16+First two are easy to feed to the classifier, as they are primarily numbers or proper names (libraries and applications), but the third, as important as them, is just a variable length blob of text. This is where in my opinion paragraph2vec (extension of word2vec) algorithm comes to use.
17+
18+My idea is to use paragraph2vec algorithm on text data, then extend the vector using information from 1) and 2) and then use constrained spectral clustering to obtain labels.
19+
20+ A comparison [3] shows that the spectral analysis is currently one of the best clustering algorithms, and with constrained SC [4] we can incorporate prior knowledge. Our data is high-dimensional, but due to the use of spectral clustering it shouldn't be much of a problem (PCA step). We will probably be forced to modify the original approach described in the paper, since our task will require providing “cannot-link constraints” (as opposed to “must-link constrains” described in [4], indicating that two elements are in the same cluster).
21+
22+Paragraph2vec (doc2vec) is already implemented in gensim package (python), I couldn't find any python constrained spectral clustering algorithm, so it is possible we'll have to implement it ourselves.
23+
24+Links:
25+
26+Short explanation of spectral analysis: https://www.youtube.com/watch?v=P-LEH-AFovE
27+Word2vec introduction paper http://arxiv.org/pdf/1411.2738v1.pdf
28+Word2vec introduction ipython notebook https://github.com/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb
29+
30+[1] A p2v algorithm with w2v algorithm description http://arxiv.org/pdf/1405.4053v2.pdf
31+[2] A comparison between p2v and other text analysis algorithms + an improvement idea for p2v http://arxiv.org/pdf/1507.07998v1.pdf
32+[3] A comparison between different clustering algorithms http://arxiv.org/pdf/1511.09123v1.pdf
33+[4] Constrained spectral clustering overview https://dl.acm.org/citation.cfm?id=1148241&dl=ACM&coll=DL&CFID=759388251&CFTOKEN=72271786
34+
35+
36+

Subscribers

People subscribed via source and target branches

to all changes: