Predicting inequality from digital textual data


To build appropriate machine learning methods for effectively predicting multiple inequality outcomes from textual data, and to evaluate the transferability of prediction models developed in text-rich settings to cities with limited data.

Over half of the world's population (i.e., 4.2 billion people) live in cities. This percentage is predicted to grow, with 68% of the world's population expected to reside in cities by 2050. In the UK, 83% of people live in metropolitan areas. Within large cities, significant inequalities exist in terms of income, safety, and wellbeing. Especially, people living in cities are more likely to face depression, anxiety, and addiction (Sundquist et al., 2018). Predicting economic, social, and health outcomes is critical for informing and evaluating policies aimed at reducing inequality (Islam and Winkel, 2017), but it also presents significant challenges because data on those outcomes come from varied sources with different formats and are usually collected through time-consuming processes (e.g., field work, interviews, surveys).In addition, changes in surveys, data constraints, coding practices, and other factors are likely to have an impact on estimates  of  inequality,  making developing a  consistent  model challenging. 

How can we predict inequality using rich data sources in a timely, accurate, and cost-effective manner that allows fora head-to-head comparison for multiple inequality outcomes? Effective prediction indicators and models will help us better understand the elements that lead to inequality trends and how inequality changes as a result of policy changes. 

The purpose of this research is to assess the feasibility and performance of combining publicly available textual data from online news and social media with machine learning to predict urban inequalities in terms of income, safety, and wellbeing. It will provide a comparative assessment of multiple inequality-related outcomes from a single data source using a consistent  methodology  framework. 

Textual data is increasingly being used as an information source in studies in economics and health with machine learning methods (Guo et al., 2018; Chen et al., 2021). Textual data from online news and social media is available through public and crowdsourced data sources, which are comparably cheaper and easier to acquire on a large scale. The idea that textual data can be used to track and predict multiple outcomes of inequality is premised on the following (Suel et al., 2019): 1) certain aspects of city status, such as housing prices and crime, are directly signalled in local news; 2) some others, such as poverty and living quality, are observable through news about lifestyle, the environment, local businesses, and tourist destinations; 3) perhaps most importantly, certain psychological aspects, such as stress and emotional wellbeing, may be detectable through news comments and social media posts (Kosinski et al., 2013), which are typically difficult to observe directly by visual inspection in cities. 

This project incorporates two novel elements: 

First, using a ‘open-vocabulary’ approach (Schwartz et al., 2013),it will directly predict inequality outcomes from textual data without relying on user-defined words or judgments (some of which may be 'stereotypes' or 'biases'). This research will extract data-driven textual features from news and tweets that are most representative of specific inequality outcomes. This enables the discovery of unexpected results that are usually not captured by traditional methods. The interpretable textual features will aid in elucidating the factors that influence inequality in cities. 

Second, the project will create models using data from Glasgow, where extensive, fine-granular datasets of urban inequality are accessible from the Urban Big Data Centre (UBDC,, and then test the generalisation capacity  of  developed  models  for  other  cities.  It  will  examine  if  models  built  on  data  from  a  single  city  can  be transferred to other cities, indicating the extent to which textual features associated within equality outcomes are shared across cities worldwide. The research will be expanded to include data on additional cities both nationally and globally in a subsequent project.


The outputs of this project will give key evidence on the feasibility and performance of predicting urban inequality using textual data and machine learning. The research will conduct a comparison of a variety of economic, social, and health outcomes. It will provide scientific evidence on whether combining news data and machine learning canpredict inequality outcomes, whether inequalities in some outcomes can be predicted more accurately than others, and so on. This will be of interest to researchers in economics and management, as well as in other fields such as politics, sociology, and psychology. The extracted textual features (e.g., words, phrases, and topics) and their associations with inequality will serve as useful indicators to identify what supports vulnerable groups need and how policymakers and third-sector practitioners can support such needs.