Ethical & Legal AI Data Collection in 2024: Comprehensive Best Practices and Analysis

The data fueling AI systems has become a gold mine for tech companies. But without ethical practices for sourcing and managing that data, organizations face a rising tide of public distrust, regulatory fines, and reputational damage.

Content Navigation show

In this expert guide, we will dig deep on the emerging best practices, troubling examples, and key regulations shaping ethical AI data collection in 2024 and beyond.

Why Ethics are Fundamental for AI Data Collection

Remember Tay, Microsoft‘s disastrous experimental chatbot? Within 24 hours of launch, internet trolls had manipulated Tay into tweeting offensive language, forcing Microsoft to shut it down.

The core issue was Tay‘s training data. Microsoft sourced it reactively from public social media, leading to an AI model that reflected the worst of online behavior – not an ethical approach.

This cautionary tale illustrates how flawed or biased data can lead AI astray, undermining public trust. But improper data practices can cause harm even with a functioning model:

Scraping personal data without consent, like Facebook‘s leaked audio datasets.
Failing to inform users or secure their data, enabling breaches, like the 2021 T-Mobile hack.
Excluding or misrepresenting groups due to biased data collection, as happened with many facial recognition algorithms that struggle with darker skin tones.

Without deliberate efforts to align data practices with ethical principles, organizations leave themselves open to major backlash, loss of user trust, and regulatory action.

Watch Out for These Ethical and Legal Considerations While Collecting AI Training Data

So what should organizations prioritize when sourcing data for AI systems? Frameworks like the OECD AI Principles, Gov.uk Data Ethics Framework, and Deloitte AI Ethics practices all provide guidance centered around transparency, fairness, accountability and minimizing harm.

These core tenets establish a strong ethical foundation when specifically applied to the process of data collection and curation:

Obtain Explicit Informed Consent

Gaining informed consent entails clearly communicating how data will be used and ensuring people can make specific, deliberate choices to opt-in or opt-out of data collection. Vague notifications buried in dense legal documents do not qualify.

Best practices include:

Granular options to consent to only specific data uses, not bundled "take it or leave it" deals. If you want to use data for improving A and analyzing B, get separate permissions.
Active opt-in prompts at the point where data is collected, like an upload dialogue box. Don‘t assume silence is consent.
Dashboard views and downloadable reports of all user data an organization holds, with retraction mechanisms.

Ensure Transparency Around Data Practices

Transparency builds trust. Be clear about:

What types of data are collected and via what methods.
How data is managed and secured.
What problems the organization aims to solve with the data.
Who can access the data and under what policies.
How users can get copies of their data or request deletion.

Iterate based on user feedback and audit regularly for compliance.

Enable Individual Data Control

Don‘t lock users in indefinitely once they provide data. Maintain:

Easy retraction mechanisms to revoke consent and delete data if desired.
An ability to review and edit data for inaccuracies.
Portable data downloads in interoperable formats.

Periodically remind users of these rights.

Limit Data Uses to Stated Purposes

Avoid function creep. Only use data for the specific problems and products the user consented for.

Get new permissions for any additional purposes, like training new AI models.

Secure and Protect User Data

No amount of consent makes up for negligent data practices. Reasonable security steps include:

Access controls to limit internal data access to only required staff on a need-to-know basis. Never leave data openly accessible on the internet.

Encryption to protect data in transit and at rest. Properly hash and salt personal identifiers.

Vulnerability auditing and penetration testing to catch security holes proactively.

Breach monitoring and notification plans to promptly alert users and authorities in the event of compromise.

Anonymize Where Possible

When collecting personal data, strip identifying fields or employ pseudonymization unless you can demonstrate those details are essential. This limits risks.

Assess Broader Societal Impacts

Looking beyond immediate business goals, consider how your data collection program could impact rights and freedoms, economic outcomes, public safety, user autonomy, and equity.

Troubling Examples of Unethical Data Collection for AI Training Sets

Unfortunately, clear guidelines have not yet prevented organizations from sourcing AI training data through dubious means:

Scraping Facebook and YouTube to Train Voice Assistants

Meta and Microsoft were recently found to have scraped audio data from their own platforms – including private voice messages – to train AI voice assistants. While legal, this violates user trust and expectations when not transparently communicated. It also risks imbalanced or low-quality training sets.

"Meta cannot reasonably claim that users who sent audio messages intended for a specific individual consented to the use of those messages to train AI," noted Albert Fox Cahn, Executive Director of the Surveillance Technology Oversight Project.

Exploiting Driver‘s License Photos for Facial Recognition

ICE and the FBI have tapped into state DMV databases with millions of Americans‘ driver‘s license photos to power facial recognition aimed at tracking undocumented immigrants.

Beyond inherently biased targeting, this co-opts data collected for ID purposes for entirely unrelated AI training. It violates consent, transparency and the expectation of regulatory oversight.

Swiping Hospital Patient Data to De-identify

In 2022, AI startup Paige sought to acquire hospitals‘ historical imaging data to pool into a massive training dataset, promising to de-identify the data using AI techniques. But efforts to strip protected health information (PHI) can be highly fallible.

Patients surely expect their scans to be used only for direct care and not external research. Even ethically, hospitals likely cannot unilaterally volunteer patient data without consent for third party use.

"It‘s hard to make the case that you‘re benefiting the public good when you‘re accessing millions of patients‘ data without them really understanding how it‘s being used," said Maggie Horton, a lawyer focused on AI ethics.

These examples demonstrate the need for thoughtful oversight and accountability around AI training data practices.

Key Regulations Guiding Ethical AI Data Collection and Use

Various emerging laws, regulations and proposals specifically govern the ethical handling of personal data that may be used for AI systems:

European Union GDPR

The EU‘s General Data Protection Regulation grants users rights to:

Access their collected data
Receive details on data use
Have data deleted and transfer it elsewhere
Opt out of data sales or certain uses

It also requires informed consent for data collection and limits collection to only what is necessary. Stiff fines up to 4% of global revenue await violators.

United States COPPA

The Children‘s Online Privacy Protection Act requires parental consent to collect data on children under 13. Methods must be transparent with no tracking without consent. Platforms must allow parents to view, edit or delete data.

United States GINA

The Genetic Information Nondiscrimination Act prohibits health insurance companies and employers from requesting, buying or using individuals‘ genetic data, such as for setting insurance rates. This recognizes the sensitivity of biological data.

China PIPL

China‘s Personal Information Protection Law requires consent for collecting sensitive biometric, financial, health or location data. It restricts transferring data overseas unless to countries with "comparable" protection to China, aiming to keep data localized.

Proposed Algorithmic Accountability Act

The proposed US Algorithmic Accountability Act would require corporations to study high-risk AI systems for biases and discrimination while protecting citizens‘ right to opt out of data collection. It emphasizes understanding AI impacts on accuracy, fairness, bias, discrimination, privacy, and security.

Proposed European Union AI Act

The draft EU AI Act puts forward comprehensive regulations including required transparency for how training data is sourced, mechanisms to fix errors, assessments of high-risk systems‘ risks, and more baked-in privacy protections.

Practical Steps for ML Teams Collecting Training Data

Turning ethical principles into practice requires diligent work across the organization. For machine learning teams, I recommend these steps to ensure responsible and ethical data sourcing:

Perform bias audits on datasets using tools like IBM‘s AI Fairness 360. Document where your data comes from and any gaps in representation.

Train test sets on different data distributions than your main training data to catch issues generalizing across demographics.

Pseudonymize data by replacing personal identifiers with random strings. Hash identifiers if needed to match records.

Layer consent and access by tagging data with allowed usage conditions based on user permissions.

Establish data pipelines that enable updating, correcting or removing data in bulk in response to user feedback.

Document everything in a metadata repository detailing dataset lineages, curation and monitoring.

Enlist ethics advisors and independent auditors to provide accountability around data practices.

Looking Ahead: Building an Ethical Data Culture

Approaching AI development with an "ethics after the fact" mentality is dangerous and shortsighted. Privacy and transparency should be baked into data pipelines from the start.

Establishing an ethical data culture requires buy-in across teams – from onboarding data science teams, to evaluating models for bias, to writing transparent AI principles customers can trust.

It‘s also an evolving process as risks emerge and standards crystallize. But organizations that embrace that nuance and prioritize users‘ long-term interests will build loyalty and trust that fuels sustainable success.

The data fueling AI holds immense opportunity to benefit society – but only if collected and managed thoughtfully. With ethical foundations, thoughtful governance and continuous engagement, organizations can ensure their data practices align with their principles.