Web Scraping and Data Mining with R

Instructor: Lukasz Walasek

Modality: In presence

Week 2: 21 - 25 August 2023

Workshop contents and objectives

The widespread accessibility of the internet and the ongoing digitisation of information has transformed how people communicate and share knowledge. These changes have provided new possibilities for social scientists who can now use large sources of publicly available online data to gain new insights about individuals and their social environment. Creative uses of unsolicited online data can provide a unique window into people’s behaviours, attitudes, and beliefs in the context of significant social, political, and economic realities. Such “big data” approaches have led to significant advances in our understanding of the dynamics of political ideology, risk attitudes, health, wellbeing, misinformation, consumer behaviour, and sustainability, among many others.

The aim of this course is to introduce the core concepts and methodologies for web scraping and data mining approaches in R. During the sessions, participants will learn about the sources and types of online data that can be accessed by social scientists. They will also develop new skills in how such data can be scraped/harvested/extracted. In a series of practical exercises and activities, participants will learn about managing their own big data science projects and solving challenges associated with ethical considerations, data wrangling and formatting, web crawling, and data exploration through visualisation. During the course, participants will apply their new skills as they embark on their first data mining project under the supervision of the course lead. On the completion of the course, participants will be equipped with the necessary skills to identify, extract, process, and visualise large volumes of online data.

Course Structure
The entire course is split into theoretical and practical parts. Each day will begin with a class covering one of the core web scraping and/or data mining topics. Early afternoons (after lunch) will be devoted to practical exercises and activities, which will require participants to apply a range of data mining techniques to collect, process, and visualise online data. During the remainder of the day (late afternoon), participants will be given the opportunity to work on their own data science project. Participants are welcome to work on their own chosen topics or select a topic from a series of pre-determined mini-projects. Materials (lecture slides, sample datasets, handouts, exercises with solutions, annotated R scripts) will be made openly available via a GitHub repository to all participants.

Detailed Overview (specific activities are exemplary and subject to change)

Day 1
Morning: Basic concepts.
• Fundamental aspects of web scraping and data mining.
• R tools and skills for acquiring and processing large volumes of online data.
Afternoon: First dataset.
• Accessing and downloading large and unstructured online datasets.

Day 2
Morning: Scraping fundamentals.
• Sources of data and methods for accessing them using web crawlers.
• Practical and theoretical challenges associated with the collection and management of unstructured online datasets.
Afternoon: APIs.
• How to access data via official REST APIs.
• Pre-processing data stored in unique data formats.

Day 3
Morning: Advanced scraping.
• Understanding html structure of public websites.
• Extracting quantitative and textual data from an unstructured web source code.
Afternoon: HTML crawlers.
• Building your own html scraper.
• Deploying web crawler and extracting data from html and css code.

Day 4
Morning: Big data pitfalls.
• Pitfalls, challenges, and misuses of large volumes of data in social science.
• Good practices for performing and managing data mining projects.
Afternoon: Visualisation.
• Identifying sources of bias in a large and unstructured dataset of online communication.
• Visualisation of online data.

Day 5
Morning:
• Project work and presentation preparation.
Afternoon:
• Presentations of individual projects.

 

Prerequisites:

Participants are expected to have basic computer and statistical analysis skills. Good knowledge of R is necessary to participate in practical exercises and activities. The course will introduce participants to the basics of html, css, and JavaScript, but no prior knowledge of these topics is necessary.

Recommended Reading
Altman, S., Behrman, B., & Wickham, H. (2021). Data Wrangling. https://dcl-wrangle.stanford.edu/
Bradley, A., & James, R. J. E. (2019). Web scraping using R. Advances in Methods and Practices in Psychological Science, 2(3), 264-270.
Wickham H (2022). rvest: Easily Harvest (Scrape) Web Pages. https://rvest.tidyverse.org/