Inclusivity and Diversity in AI Data

5 min readApr 4, 2023

Last year, we wrote about how Buddywise uses synthetic data for our deep learning models. While the world of synthetic data has exploded to help train artificial intelligence (AI) models with an abundance of data that is privacy protection compliant, low-cost, time efficient, and more, the bias we discussed in the data is still there and stronger than ever. As AI becomes more popular with the surge of tools like OpenAI’s ChatGPT, we should ask ourselves how to handle this inherent bias, and how we can be more inclusive and diverse in the source data that is powering our AI models.

An image generated by OpenAI’s DALL-E using the title of this article as a prompt. Experiment with your own image here: https://labs.openai.com/

Data: How AI Models Work

In order for AI models to output successful predictions, they must ingest large sources of data for their “pre-training” dataset. This pre-training dataset is then used to “train” the AI models like the neural networks Buddywise uses to perform computer vision recognition tasks. Once the models are trained and able to recognize specific motions or images, they can be used on live customer work sites with real camera data and real humans.

Overall data quality is one of the most important aspects of data science, but even defining high quality data is highly subjective. The ways we should judge data ranges from reliability (i.e. if we are to measure the variable multiple times for the same observation, it should give the same result) to coverage (i.e. what proportion of the observations in focus are in the data). Ultimately, this data needs to be as reflective of real-life situations as possible so our AI models are as accurate as possible.

High-level there are different types of data that can be used in AI models:

Structured: Typically numbers or words that are stored in relational databases. Examples: Quantitative (discrete, continuous), qualitative (nominal, ordinal, interval, ratio/scale).
Unstructured: Any type of data that is not stored in non-relational databases. Examples: Images, video, audio, sensor.
Semi-Structured: Structured data with tags or other markers but does not obey the organization of relational databases. Examples: HTML code, graphs/tables, e-mails, XML documents.

As a computer vision company, Buddywise mostly works with unstructured data that comes from video like synthetically created video and live video feeds from our customers’ worksites. We then use this data as part of our data pipeline to pre-train our computer vision AI models before rolling out to individual customer work sites to detect live workplace safety incidents like trips and slips. Pre-training our neural networks means first assigning weights to variables randomly before optimizing in order to perform the image recognition more accurately.

Since our source data is video, this unstructured data captures everything that the human eye can detect and even more. As a result, differences in the human form may be filmed and subsequently detected by the AI models so the outputs can vary from human to human. For example, if a human’s body type or skin color in the live video data is different from the source video data the pre-trained model was trained on, the performance of the models may be different. Another example is if people without walking assistance devices like canes are not included in the source video data either, the AI model in production may not be able to recognize this walking pattern properly. Dealing with issues of having an inclusive and diverse pre-training source data is a major question facing the AI industry today.

Issue of Lack of Inclusivity and Diversity

As the CEO of OpenAI Sam Altman said in a recent interview, “No two people will ever agree [an AI] model is unbiased”. It is challenging to even define what is bias or lack of inclusivity and diversity in AI. However, we believe that the first step is ensuring different types of people are included in the pre-training source data with appropriate labels so differences can be accounted for. When data does not include everyone, the AI models are systematically ignoring segments of the population and excluding them from the benefits AI can provide. It is important to note removing demographic data entirely will actually make the AI models become even more discriminate because these differences do exist in reality.

Harvard Business Review recently published an article regarding this topic showcasing real world examples of when gender or race has been removed as a variable within the pre-training source data. Analysts saw that including gender “significantly decreases discrimination — by a factor of 2.8 times”. Including gender helps account for unfair discrimination that may be introduced into the AI models by allowing weights to be added and bias to be properly examined. Even for unstructured data, it is important to include diversity since the models may recognize the differences naturally and create biased insights as a result.

There are many other examples of how biased source data impacts the learnings of AI and tech overall. Caroline Criado Perez also examined the role of gender in her book Invisible Women: Exposing Data Bias in a World Designed for Men. Many everyday objects like the seatbelt were designed for the male experience which led to women being 47% more likely to be seriously injured in a car accident. When differences are not taken into consideration when building tools, they will not perform to the same accuracy for everyone.

As innovation in the AI field is moving fast, regulators cannot keep up with the types of wide-sweeping guidelines that should be in place to enforce diversity and inclusivity in AI. We must hold ourselves accountable to ask questions on how we are dealing with these issues like race and gender in the AI source data we are using. Ethical thinking in AI is critical because as scientists and developers, we are able to set constraints and limits.

How Buddywise Combats Lack of Inclusion and Diversity

At Buddywise, we are committed to building inclusive and diverse AI products to benefit all rather than reinforce the existing prejudices. We believe that the first step in building inclusive AI products is ensuring we have an inclusive team! That’s why we are committed to diversity in our hiring process and make sure we represent different backgrounds from genders to nationalities.

Learn more at buddywise.co.

Inclusivity and Diversity in AI Data

Data: How AI Models Work

Issue of Lack of Inclusivity and Diversity

How Buddywise Combats Lack of Inclusion and Diversity

Written by Buddywise