-
The course was created with the support of SberbankThis is anunconventional course in modern Data Analysis, Machine Learning and DataMining. Its contents are heavily influenced by the idea that data analysisshould help in enhancing and augmenting knowledge of the domain as representedby the concepts and statements of relation between them. According to thisview, two main pathways for data analysis are summarization, for developing andaugmenting concepts, and correlation, for enhancing and establishing relations.The term summarization embraces here both simple summaries like totals andmeans and more complex summaries: the principalcomponents of a set of features and cluster structures in a set of entities.Similarly, correlation covers both bivariate and multivariate relations betweeninput and target features including Bayes classifiers.
The view of the data as a subject ofcomputational data analysis that is adhered to here has emerged quite recently.Typically, in sciences and in statistics, a problem comes first, and then theinvestigator turns to data that might be useful in advancing towards asolution. Yet nowadays the situation is reversed frequently, especially withthe advent of Big Data. Typical questions then are: Take a look at this dataset - what sense can be made out of it? – Is there any structure in the dataset? Can these features help in predicting those? This is more reminiscent to atraveler’s view of the world rather than that of a scientist. The scientistsits at his desk, gets reproducible signals from the universe and tries to accommodatethem into a great model of the universe. The traveler deals with what come ontheir way – here is the data analysis niche. A textbook by the instructor along these lines has been published bySpringer-London in 2011: “Coreconcepts in data analysis is clean and devoidof any fuzziness. The author presents his theses with a refreshing clarityseldom seen in a text of this sophistication. … To single out just one of thetext’s many successes: I doubt readers will ever encounter again such a detailedand excellent treatment of correlation concepts. (Computing Reviews ofACM, June 2011).”
-
Week 1. Intro: Examples of data and data analysis problems; visualization.
Week 2. 1D analysis. Feature scales. Histogram. Two common types of histograms: Gaussian and Power Law. Central values. Minkowski distance and data recovery view. Validation with Bootstrap.
Week 3-4. 2D analysis cases:
(Both quantitative: Scatter-plot, linear regression, correlation and determinacy coefficients: meaning and properties. Both ominal: Contingency table, Quetelet index, Pearson chi-squared coefficient, its double meaning and visualization).
Week 5-6. Learning multivariate correlations
(Bayes approach and Naïve Bayes classifier with a Bag-of-words text model; Decisio ees and criteria for building them.)
Week 7. Principal components (PCA) and SVD
(SVD model behind PCA: student marks as the product of subject factor scores and subject loadings. Application to deriving a hidden underlying factor. Data visualization with PCA. Conventional PCA and data normalization issues.)
Week 8. Clustering with k-means
(K-Means iterations and K-Means features
K-Means criterion. Anomalous clusters and intelligent K-Means.)