Content Analysis and Natural Language Processing
Lecturer: Thomas Hills
Week 1: 21-25 August 2023
Workshop Contents and Objectives
The aim of this workshop is to provide participants with a practical hands-on, and theoretical understanding of new methods in the content analysis made possible by applying digital technology to text corpora.
This approach scales from words to documents to large text corpora.
Some of the issues this approach addresses include the following:
- Understanding the speech of political leaders: What U.S. president is viewed most negatively? Does political speech on Twitter incite violence?
- Detecting historical changes in happiness: Which nations are happiest, and how has their happiness changed over time? Does national happiness correlate with GDP, longevity, democratisation, etc?
- Predicting views of brands: What does it mean to be a luxury brand? What associations do people have with different products?
- Using language to predict personality or changes across an individual’s lifespan: How did the writing of Darwin, Mozart, and Van Gogh change across their lifespan?
The course will begin by providing participants with an understanding of what natural language processing offers content analysis. Automation can allow interesting content questions to be answered in very short periods of time (sometimes minutes), saving weeks or months of research time. It can also introduce new questions that lead to innovative research programmes.
Specific cases will be used to show how natural language processing can be applied to theoretical questions in the social sciences. Each day will present published research and then demonstrate how the research was done, providing code and data.
On completion of the course, participants will be able to recognize and implement many common approaches to content analysis using natural language processing and take the first steps towards formulating and addressing problems of their own in social data science or the digital humanities. Participants will also be provided with detailed information about how to follow up and learn more with respect to their particular area of interest.
Students taking this workshop should have at least basic experience in R or another programming language. There are a number of free or inexpensive online courses well worth the investment in time (e.g., Datacamp) that offer introductory courses in R that are sufficient prerequisites for this course. A general introductory book to statistics in R will also work. Though the course will primarily use R, I will provide all the code. Therefore, this course can be a way to improve your R skills as well.
Students are advised that prior knowledge with R will help them advance more quickly with their applications, but this knowledge is not necessary to learn from this course. The course will provide a general introduction to R and, more importantly, a strong conceptual foundation for understanding what natural language processing can achieve.
- Dodds, P. S., & Danforth, C. M. (2010). Measuring the happiness of large-scale written expression: Songs, blogs, and presidents. Journal of Happiness Studies, 11(4), 441-456. https://link.springer.com/content/pdf/10.1007/s10902-009-9150-9.pdf
- Lansdall-Welfare, T., Sudhahar, S., Thompson, J., Lewis, J., Team, F. N., & Cristianini, N. (2017). Content analysis of 150 years of British periodicals. Proceedings of the National Academy of Sciences, 201606380. https://www.pnas.org/content/pnas/114/4/E457.full.pdf
- Müller, K., & Schwarz, C. (2019). Fanning the flames of hate: Social media and hate crime. Available at SSRN 3082972. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3082972
- Hills, T., Proto, E., & Sgroi, D. (2019). Historical analysis of national subjective wellbeing using millions of digitized books. Nature Human Behavior, 1-5. https://warwick.ac.uk/fac/sci/psych/people/thills/thills/2019_hillsproto...