Web Scraping and Data Mining with R

Instructor: Lukasz Walasek

Modality: In presence

Week 1: 12 - 16 August 2024

 

Workshop contents and objectives

The widespread accessibility of the internet and the ongoing digitisation of information has transformed how people communicate and share knowledge. These changes have provided new possibilities for social scientists who can now use large sources of publicly available online data to gain new insights about individuals and their social environment. Creative uses of unsolicited online data can provide a unique window into people’s behaviours, attitudes, and beliefs in the context of significant social, political, and economic realities. Such “big data” approaches have led to significant advances in our understanding of the dynamics of political ideology, risk attitudes, health, wellbeing, misinformation, consumer behaviour, and sustainability, among many others.

The aim of this course is to introduce the core concepts and methodologies for web scraping and data mining approaches in R. During the sessions, participants will learn about the sources and types of online data that can be accessed by social scientists. They will also develop new skills in how such data can be scraped/harvested/extracted. In a series of practical exercises and activities, participants will learn about managing their own big data science projects and solving challenges associated with ethical considerations, data wrangling and formatting, web crawling, and data exploration through visualisation. During the course, participants will apply their new skills as they embark on their first data mining project under the supervision of the course lead. On the completion of the course, participants will be equipped with the necessary skills to identify, extract, process, and visualise large volumes of online data.

 

Workshop design

Each class is divided into two main parts. Each day will begin with a class covering one of the core web scraping and/or data mining topics. During this part, the instructor will discuss key concepts of web scraping, and demonstrate how to apply them in R environment.  Afternoons (after lunch) will be devoted to practical exercises and activities, which will require participants to apply a range of data mining techniques to collect, process, and visualise online data. During this part, participants will work together and with the instructor to progress through various exercises on web scraping and data mining.

Participants are welcome to use any remaining time to work on their own data mining projects. The instructor will happily assist with any individual project. Materials (lecture slides, sample datasets, handouts, exercises with solutions, annotated R scripts) will be made openly available via an online repository to all participants.

 

Detailed lecture plan (daily schedule)

Day 1.
Morning: Fundamental aspects of web scraping and data mining; Theory driven research using large volumes of online data; Basics of HTML and CSS.
Afternoon: Applying CSS selectors using Rvest to extract web content.

Day 2.
Morning: Overview of the HTTP protocol; Setting up web connections using R; Ethics and legalities of web scraping.
Afternoon: Building first online scraper.

Day 3.
Morning: Principles for building robust online crawlers; Handling errors.
Afternoon: Designing, implementing, and testing a data scraping crawler

Day 4.
Morning: Introduction to APIs and .Json file format.
Afternoon: Extracting data from REST APIs

Day 5.
Morning: Understanding different types of API authentication.
Afternoon: Establishing OAuth connection with an API.

 

Class materials

All materials will be provided online.

 

Prerequisites

Participants are expected to have basic computer and statistical analysis skills. Good knowledge of R is necessary to participate in practical exercises and activities. The course will introduce participants to the basics of html, css, and JavaScript, so no prior knowledge of these topics is necessary.

 

Recommended readings or preliminary material

  • Altman, S., Behrman, B., & Wickham, H. (2021). Data Wrangling. https://dcl-wrangle.stanford.edu/
  • Bradley, A., & James, R. J. E. (2019). Web scraping using R. Advances in Methods and Practices in
  • Psychological Science, 2(3), 264-270.
  • Wickham H (2022). rvest: Easily Harvest (Scrape) Web Pages. https://rvest.tidyverse.org/