Machine Learning Meetup Notes: 2010-07-07
Col-1: patient ID Col-2: responder status ("1" for patients who improved and "0" otherwise) Col-3: Protease nucleotide sequence (if available) Col-4: Reverse Transciptase nucleotide sequence (if available) Col-5: viral load at the beginning of therapy (log-10 units) Col-6: CD4 count at the beginning of therapy
molecular weight and length of "PR Sequence" and "RT Sequence" from the training data start weka open mweight.csv remove patient select resp filter->unsupervised->attribute->numerictonominal click to change to first only apply
neural network classify->functions->multilayerperceptron resp start 738 correct predictions a=0 no improvement 66 correct predictions b=1 improvement
56 no improvement classified as improvement 140 improvement classified as no improvement
how well did it do? 80.4% accuracy
rows tell you what really happenned columns tell you what was predicted
cluster simplekmeans
change num clusters 5 ok->start
scipy cluster.hierarchy main function called linkage ldist takes levenstein distance of each parts of the set result is a matrix distance hierarchical clustering
single linkage clustering: start with n clusters, take the ones that have the shortest distance between them and make that a cluster. then keep going until you have 1 cluster. -when you join two points, you always check both of the distances in that cluster against other points, and then take whatever is smaller complete linkage: you take the largest distance instead -there is also one that takes the average