Data Powered Crash Solver

Merge ~jac-karwowski/dpcs:clustering_a_16 into dpcs:master

Proposed by Jacek on 2016-03-06

Status:	Merged
Approved by:	Marek Bardoński on 2016-03-09
Approved revision:	2ce5c861c3fa79da7fe077f57dc116752d89e4bd
Merge reported by:	Marek Bardoński
Merged at revision:	not available
Proposed branch:	~jac-karwowski/dpcs:clustering_a_16
Merge into:	dpcs:master
Diff against target:	36 lines (+26/-0) 1 file modified docs/Clustering_A.txt (+26/-0)
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Marek Bardoński		2016-03-06	Approve on 2016-03-08
Review via email: mp+288229@code.launchpad.net

Revision history for this message

Marek Bardoński (bdfhjk) wrote on 2016-03-08:

It's a professional review of a current state-of-arts algorithms. Your ideas seems promising!

I think a good part of our job will be to implement constrained spectral clustering into a scikit-learn project. Let's think about it.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Jacek

UW ML RG Board

 diff --git a/docs/Clustering_A.pdf b/docs/Clustering_A.pdf
 new file mode 100644
 index 0000000..ef55b27
 Binary files /dev/null and b/docs/Clustering_A.pdf differ
 diff --git a/docs/Clustering_A.txt b/docs/Clustering_A.txt
 new file mode 100644
 index 0000000..84c2a9d
 --- /dev/null
 +++ b/docs/Clustering_A.txt
@@ -0,0 +1,26 @@
++To solve our problem of classifying crashes, first we have to think how represent them in our machine learning algorithm. Server gets the report with several fields containing:
++1) The name and version of crashed application, along with exit code
++2) System version information (kernel and system version, installed modules)
++3) stderr output, consisting of several lines of text
++
++First two are easy to feed to the classifier, as they are primarily numbers or proper names (libraries and applications), but the third, as important as them, is just a variable length blob of text. This is where in my opinion paragraph2vec (extension of word2vec) algorithm comes to use.
++
++My idea is to use paragraph2vec algorithm on text data, then extend the vector using information from 1) and 2) and then use constrained spectral clustering to obtain labels.
++
++ A comparison [3] shows that the spectral analysis is currently one of the best clustering algorithms, and with constrained SC [4] we can incorporate prior knowledge. Our data is high-dimensional, but due to the use of spectral clustering it shouldn't be much of a problem (PCA step). We will probably be forced to modify the original approach described in the paper, since our task will require providing “cannot-link constraints” (as opposed to “must-link constrains” described in [4], indicating that two elements are in the same cluster).
++
++Paragraph2vec (doc2vec) is already implemented in gensim package (python), I couldn't find any python constrained spectral clustering algorithm, so it is possible we'll have to implement it ourselves.
++
++Links:
++
++Short explanation of spectral analysis: https://www.youtube.com/watch?v=P-LEH-AFovE
++Word2vec introduction paper http://arxiv.org/pdf/1411.2738v1.pdf
++Word2vec introduction ipython notebook https://github.com/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb
++
++[1] A p2v algorithm with w2v algorithm description http://arxiv.org/pdf/1405.4053v2.pdf
++[2] A comparison between p2v and other text analysis algorithms + an improvement idea for p2v http://arxiv.org/pdf/1507.07998v1.pdf
++[3] A comparison between different clustering algorithms http://arxiv.org/pdf/1511.09123v1.pdf
++[4] Constrained spectral clustering overview https://dl.acm.org/citation.cfm?id=1148241&dl=ACM&coll=DL&CFID=759388251&CFTOKEN=72271786
++
++
++

Data Powered Crash Solver

Merge ~jac-karwowski/dpcs:clustering_a_16 into dpcs:master

Commit message

Description of the change

Preview Diff

Subscribers