Machine Learning - What Is It Good For?

Lecturer: Marco Steenbergen

Modality: In presence

Week 1: 14-18 August 2023

 

Workshop Contents and Objectives

Machine learning is fashionable. But what is it, and how can it be put to good use in the social sciences? This introductory course provides an overview of some of the most important machine learning techniques and their social science applications. Those applications can be grouped into several sub-categories:

  1. Pattern recognition: How do variables hang together, and what groups do our cases form in terms of those variables? For example, political parties take positions on numerous issues. Can we group those issues into ideologies? Based on the issues, can we place the parties into clusters?
  2. Preparing data for statistical analysis: Sometimes data are so voluminous that hand-coding them is near-impossible. We can leverage clever computer algorithms to do the coding for us. For example, we could use an artificial neural network to detect if tweets, of which there are millions, come from a social bot or from a legitimate source.
  3. Doing statistical analysis: As social scientists we are used to building models with numerous parametric assumptions. What if we would let algorithms leverage the data to obtain the model for us? That way, we may detect complex contingencies not previously theorised.
  4. Anomaly detection: Some phenomena, such as war, are fortunately rare. However, this makes analysing them challenging. A whole subfield of machine learning is dedicated to the detection of such rare events or anomalies.

Course Design

Through lectures and group exercises, the course shows applications in each area. After discussing the general principles of machine learning, the course spends half a day discussing unsupervised machine learning (relevant for area 1), 3.5 days on supervised machine learning techniques (relevant for application areas 2 and 3), and one day on anomaly detection (application area 4). On the last day, students present a machine learning project in groups.

Each day, students will learn the intuition behind the techniques, how they can be implemented in R, how they should be interpreted, and how they can be applied in the social sciences. The course is designed to minimise the level of mathematical complexity, although students can always look up the details in vignettes made available for the course. Classification, as well as regression tasks, are considered. In the former, we seek to predict class membership; in the latter, we predict a numeric score. Interpretation is key, and we spend a great deal of time on various metrics and their implementations.

The course covers the following algorithms/techniques: (1) k-nearest neighbours; (2) probabilistic learning (including naïve Bayes, linear, and quadratic discriminant analysis); (3) classification and regression trees, random forests, and model trees; (4) regression with regularisation and partial least squares; (5) artificial neural network analysis; (6) boosting; (7) cluster analysis; (8) SMOTE; (9) support vector machines; and (10) feature selection.

 

Prerequisites

The course assumes a basic familiarity with probability theory and with linear regression analysis. Prior familiarity with machine learning or related fields (e.g., NLP) is not required. On the other hand, a good knowledge of R is essential for the successful completion of the course. Students should know how to read data, how to transform variables, how to work with model objects, and how to create graphs using ggplot.

 

Recommended Reading

  • Aggarwal, Charu C. 2016. Outlier Analysis. New York: Springer. ISBN 9783319475776.
  • James, Gareth et al. 2021. An Introduction to Statistical Learning with Applications in R. New York: Springer, 2nd edition. ISBN 9781071614174.