The increasing use of big data with its inability to accurately take account of the most vulnerable has the potential to exacerbate socio-economic gaps and result in states’ failure to meet their obligation to protect those “most left behind”. This Think Piece addresses the challenges of both identifying and understanding the position of the most vulnerable in big data, and demonstrates the limitations of existing alternatives like disaggregation to include the most vulnerable statistically. The piece concludes by suggesting how the gap between the increasing use of big data and the exclusion of the most vulnerable in data can be filled.
The Challenges and Opportunities of Big Data
The inclusion of the most vulnerable in national development policies is established in the UN Declaration on the Right to Development, and was recently reiterated in the Sustainable Development Goals (SDGs), which emphasize the need to “leave no one behind”. At the same time, another trend has also emerged: the use of big data to support decision making and guide social policy. While big data presents many exciting opportunities for development, there are also numerous challenges that must be urgently addressed to ensure that it can contribute to the sustainable, inclusive future envisioned by the SDGs.
An important starting point is, of course, what exactly do we mean by big data? In short, big data consists of the accumulation and use of aggregated information. This means the combination, on a large scale, of related categories of data, to provide a bigger picture from which observations can be taken. An example of this would be social media mining through which the search for key words can be indicative of public opinion on a particular matter, or information automatically delivered to a supplier through the use of their product, such as a phone or web application.
While there is certainly great potential to use big data to address development challenges, the size of the data and the method through which it is collected and aggregated mean that it cannot be sufficiently categorized to distinguish individuals or groups, making it difficult if not impossible under current approaches to identify the most vulnerable and their needs. It should be noted that while the identification of the most vulnerable within an aggregate data set is in our interest, identification post-aggregation can amount to rights violations, namely the right to privacy. When trying to understand how specific demographics of a population are impacted by a certain change, or how they are using a particular tool, however, it is important to understand who those groups are so as to address their specific needs.
Is Disaggregation Really the Solution?
Disaggregated data is one way in which the most vulnerable can be identified; it breaks down observations into finer detail, for example by income, sex, age, race, ethnicity, migratory status, disability and geographic location, or other characteristics. This is possible when primary observations are coded in a way that allows for subdivision into a more detailed analysis.
This is the approach used by the Inter-agency and Expert Group on SDG Indicators (IAEG-SDG) to monitor the SDGs in their attempt to ensure that no one is left behind, for example with SDG 1, Target 1, which aims to assess the proportion of the population below the international poverty line (currently defined at USD 1.25 per day, adjusted for purchasing power parity). Rather than generally stating what percentage of people fall within this category, according to the IAEG-SDG, data should be collected in such a way that allows distinctions to be made within the data set. These categories would, in this particular case, include disaggregation by sex, age, ethnicity, disability and potentially other qualifiers that would present a more accurate picture of who falls below the poverty line, ultimately assisting policy makers in identifying and addressing the needs of the most vulnerable.
Nevertheless, there are disadvantages associated with the use of disaggregated data, namely the added data collection and processing burden, which the statistical infrastructure of some countries is not capable of providing, and the lack access by the most vulnerable to the means to participate. This means that the least well off cannot participate in digital activities because of lack of access to technology, education, remoteness and lack of infrastructure, and are therefore not represented in big data sets. Developing capabilities to build adequate statistical infrastructure for the collection of disaggregated data requires substantial time, resources and political will and often still excludes the most vulnerable, as Paul Hunt and Carmel Williams have concluded in their recent paper.
Another Way: the Need for a Feasible Alternative
If disaggregated data collection and management is not practically feasible and the most vulnerable are not being included in big data (even if unintentionally), then new methods are needed to make the most vulnerable part of the equation and give them adequate weight in calculations and decision making. Promising new methodological design elements centre on the idea of rapidly identifying what is known as skewness, or asymmetry in any statistical distribution of big data. This approach can avoid manipulation by any particular party and provide information to fill existing identified knowledge gaps through available data.
Skewness in big data would mean that in combining various aggregate data sets there is an imbalance in the distribution of sources from which the data is collected. For example, if one were interested in monitoring the economic resilience of people affected by natural disaster in country X to create better humanitarian aid strategies, and financial transactions data was set as an indicator, those who do not use or have access to banks or debit and credit cards would be excluded from the data set and effectively ignored. However, combining this information with environmental data sets, or areas in country X affected by this natural disaster together with available demographic information, could provide an assessment whether the data points originated from one particular area. If so, they would not be representative of the population affected by the natural disaster, thereby making the financial transaction data set skewed.
Using Data to Find Knowledge Gaps
Considerations of both aggregated and disaggregated data illustrate the difficulty of identifying the most vulnerable, since both make the underlying assumption that we must first know who the most vulnerable are and then see how they are impacted by a particular change. What if we were to change this strategy? Rather than being obliged to first identify the most vulnerable in order then to include them in a calculation, we might be able to develop techniques within the aggregated calculations themselves that could warn us about skewness in data pointing to knowledge gaps. This could then alert us to the need for targeted field work and qualitative research that could reveal special impacts on vulnerable parts of the population that had previously been ignored.
Taking the resilience example above, if all data points in a big data set come from a given location in country X, a mere look at them would not make it obvious that the data points are skewed because of the massive nature of big data. However, an algorithm or computation would be able to identify an imbalance through simple statistical calculations of distribution of the given data sets and notify the controller of such skewness. With this knowledge, the controller may even be able to identify areas not contained in the data set. For example, if all financial data points are coming from cities D and E, but we know the natural disaster has equally environmentally impacted cities A, B and C, then based on the data available and the identified skewness, knowledge gaps are highlighted. Knowing these gaps exist can then trigger further targeted research that may take the form of qualitative field work. In this way we allow big data to notify us about things we don’t know, rather than making conclusions of what the given data tells us about the data subjects.
Safeguards, such as the notification of skewedness demonstrated above, need to be put in place to assess the accuracy of representation that any particular data set offers in accordance with the purpose it aims to fulfil. This is particularly important for those data sets used to support social policy, such as those aiming to increase the resilience of individuals affected by natural disasters. If such precautions are not taken, bigger questions may have to be asked: if policy makers increasingly rely on big data without due safeguards, could this eventually constitute a failure to uphold the duty to protect and the right to development? Focus on qualitative research should not be undermined amid the rising trend of big data, especially in the effort to identify the position and needs of the most vulnerable. Big data, also in aggregate form, has tremendous potential to support development objectives, as the UN Global Pulse is demonstrating. However, the weight and focus given to the most vulnerable must be firmly consolidated in the methodological approach. Careful consideration must be given to those aspects of big data that have the ability to undermine a state’s obligations and may even fortify their ability to justify overriding fundamental human rights.