Protecting Your Users' Data from AI Training Crawlers
- Share:
How to Protect Your User’s Data from AI Training with FGA
AI technology is advancing faster than ever, and while this development creates many new exciting tools and opportunities, it also comes with a long line of concerns regarding data privacy violations.
It seems like every person creating original content, be it artists, writers, or software developers, has constant qualms about how their data is being used by AI. This is especially true when it comes to training AI models with data gathered by crawlers resulting in unauthorized use of personal or proprietary information and leading to significant privacy violations and ethical dilemmas.
The issues don’t end with the exploitation of original content. Sensitive personal data can also get caught up in the mix, potentially violating privacy laws like GDPR, CCPA, and HIPAA.
Cloudflare has already declared war with Microsoft, Google, and OpenAI's bots, with tools to block all crawler services. Meanwhile, Adobe had to issue apologies and clarifications following the crossfire of intellectual property lawsuits after announcing new terms of service that many users understood as forcing them to grant unlimited access to their work for purposes of training Adobe’s generative AI.
On the other hand, not every bot entity interacting with our original content is harmful. Some are genuinely helpful - whether it is Grammarly (or any other grammar AI application) helping us correct grammar mistakes or one of Google’s search indexing bots helping our content appear in Google’s search results.
One thing is certain—as developers, we need an effective way to monitor and manage bots' access to our applications and provide our users with an experience that safeguards their data in our applications.
We’ve already published an article on how you can safely introduce relevant AI bots to our application. This time, we discuss how we can safeguard our user’s data from bots. Here are three steps you can take:
- Identify and rank AI bots to distinguish between beneficial and malicious interactions.
- Classify user data based on its sensitivity and apply Fine-Grained Authorization (FGA) controls.
- Give users direct control over which services can access their data through easy-to-use interfaces.
In the following sections, we will explore in detail how tools like ArcJet and Permit.io can help us achieve these goals. Let’s get to it -
Understanding and Ranking AI Bot Activity
Traditionally, the question of ‘Who is our user?’ was one answered by our authentication layer, while ‘What can this user do?’ is left in the hands of authorization. Now, hybrid identities like automated systems or bots interact with our application programmatically and can easily use a non-native method to authenticate (Think of Grammarly having access to read a document). This means trusting a binary answer from our authentication service can lead to permission flaws, and we now have to find a new way to ask the “Who” question on the authorization side.
To better answer this question, we need a method for ranking these identities while considering multiple factors. This system helps us classify bots somewhere between machines and humans, allowing us to make better decisions. For this, we recommend using ArcJet, an open-source startup that specializes in application security tools, including a bot ranking system.
ArcJet’s system provides ranks for identities, which help us get a more in-depth answer for our “Who” question. Here’s an example of an ArcJet-based bot rating system:
- NOT_ANALYZED: Request analysis failed; might lack sufficient data. Score: 0.
- AUTOMATED: Confirmed bot-originated request. Score: 1.
- LIKELY_AUTOMATED: Possible bot activity with varying certainty, scored 2-29.
- LIKELY_NOT_A_BOT: Likely human-originated request, scored 30-99.
- VERIFIED_BOT: Confirmed beneficial bot. Score: 100.
Each rank level also has a detailed score, allowing us to fine-tune the levels for configuration.
By ranking user activity in real-time, we establish the first aspect required to answer the question of “Who is accessing the data?” when determining “What can they do?” This ranking provides a more sophisticated and accurate basis for making authorization decisions.
Using ArcJet’s ranking system, we can distinguish between different types of bots and human interactions, taking one step further in protecting user data from being inadvertently or maliciously used for AI training.
Fine-Grained Authorization
Once you’ve got your users (human or bot) ranked, the next step is to classify the data they’re trying to access. Not all data is created equal—some are more sensitive than others and need better protection.
This is where Fine-Grained Authorization (FGA) comes into play. Essentially, you’re asking the question, “Who can do what inside our application, and under what circumstances?”
To implement FGA, you need to consider both user attributes (like the rankings we just talked about) and resource attributes (the sensitivity of the data).
Creating access control rules (policies) like “An AI crawler bot can only have read access to documents marked as public” is a classic use case for Attribute-Based Access Control (ABAC). Fortunately, you don’t have to write these rules from scratch. Instead, you can create these policies by using Permit.io’s no-code policy editor.
Permit.io allows us to define FGA policies, including ReBAC and ABAC. With ABAC in Permit we can define User Sets (combinations of users and their attributes) and Resource Sets (combinations of data and its attributes). This way, you can set up strict rules about which users get access to which types of data.
Example use case - Blogging platform
Consider a platform where users can post blogs. Here’s how you might want it to work:
- The platform has publicly available blogs that anyone, except for bots defined as harmful, can read.
- Some articles are available only to users who have signed up as members of the platform. Search engine crawlers should also access these blogs to improve their search engine ranking. AI crawlers, however, should not have access to them.
- Some articles should be accessible only to signed-up members who subscribe to a specific author. These should be available only to subscribed users and the authors themselves.
- Articles defined as drafts should only be accessible by their authors.
With this example in mind, let's see how we can model such a system in Permit.io:
Classifying User Data by Defining Resource Sets:
Resource sets are a combination of the resources we want to protect (In this case, blogs) with the attributes that define them. Here are the resources from our example and their conditions as defined in Permit:
Public Blogs: Freely available on the platform.
The screenshots display how we define resource attributes in the Permit UI.
Member Blogs: Intended for people with an account on the platform.
Subscriber-Limited Blogs: Intended for people with an account on the platform who are subscribed to the blog author.
As you can see here, the resource type is defined as ‘subscriber-limited-blog,’ while the resource owner assumes a user attribute that contains all of the user's subscriptions. This means we could match it to see if the user trying to access this resource has the relevant subscription as part of their user attribute.
Owned Drafts: These can be read only by the blog author.
As in the previous example, this condition will check if the user trying to access this resource is its owner (author) based on the user's unique key.
With our resources defined, we can now define our users.
Classifying Users and Bots:
Based on our example, we can use the data extracted from ArcJet to categorize our users into four different categories:
Harmful: A bot deemed harmful based on its ArcJet ranking
AI Crawler: ****A bot designated for AI training
Search Crawler: A bot ****designated for search engine optimization
Human: A verified human user
To make the implementation more efficient, we can define three types of user sets:
Reader: Any non-harmful bot and every user who isn’t a member of our blogging platform.
Based on the example, this type of user should only have access to blogs defined as “Public”.
Member: Members are either users who have accounts on our platform or search crawlers.
Again, based on the example, there users should have access to all “Member” blogs.
Human Member:
Lastly, users who subscribe to specific blogs and blog authors can be categorized as human users, as the resource-based attributes we defined in the previous section monitor their access to specific blogs labeled as “Subscribed” or “Drafts.”
Based on these defined resource and user sets, we can proceed to define our ABAC policy.
Defining an ABAC Authorization Policy:
As you can easily see, we configured the following rules in the table:
- Human Members (Authors and Subscribers) can read the blogs to which they are subscribed and create, delete, edit, and read the blogs to which they authored.
- Members and search crawlers can read member blogs.
- Readers (All users except for harmful bots) can read blogs labeled as public.
With all of our user sets, resource sets, and policies defined, Permit will proceed to generate policy-as-code for us. Thus, by utilizing the ArcJets bot ranking system and Permit’s ABAC authorization, we are able to manage which users and bots are allowed to access our users' data. But there is another part to this—what if we want to give our users the ability to make their own decisions about who can access their data?
Putting Control in Users’ Hands with Embeddable Interfaces
So far, we’ve set limitations on which user data can be accessed by which entities in our application. But this is only a partial solution - in many cases, it is crucial to let our users define for themselves who, whether human or bot, is allowed to access their data.
To achieve this, we can use Permit.io’s Elements feature - a suite of prebuilt, embeddable UI components designed to streamline access control in applications. These components allow users to manage their own permissions within safe, pre-defined boundaries, making sure their data is only accessible to those they trust.
If we take our blogging platform as an example, Permit Elements allows you to embed access control elements directly into your application, through which your users can control access to their content.
The User Management Element can enable an author who, for example, published a “Member” blog to allow specific bots or users to view or edit their blog posts directly within the platform.
The Access Request element ****could be employed to allow a specific user to request access to specific content. The author receives this request and can approve or deny it based on the requestor’s identity and purpose.
Permit Elements puts fine-grained authorization directly in the users' hands, allowing them to manage access according to their needs and the platform’s security requirements.
- Authors can use Permit Share-If to grant temporary access to their content, whether to a colleague or a content analyzer bot, and get an easy overview of who exactly has access to their content.
- Platform Members can request additional access to content, which can then be reviewed and approved by authors.
- Platform Admins can oversee and audit all access requests and approvals, ensuring that no unauthorized access slips through the cracks.
This approach enhances the user experience by providing more control and strengthens security by maintaining the principle of least privilege. Users are empowered to make informed decisions about who accesses their content, ensuring that sensitive data is protected while allowing flexibility where needed.
Empowering Users While Protecting Their Data
As AI continues to integrate into our lives, protecting user data from being misused for AI training is more important than ever - a task for which developers have a critical responsibility. By ranking AI bots, implementing Fine-Grained Authorization (FGA), and putting control in users' hands, developers can ensure that sensitive information stays safe.
Tools like ArcJet and Permit.io make it easier to strike the right balance between utilizing AI and maintaining data privacy. As we move forward, staying proactive in these areas will not only keep us compliant with regulations but also build trust with our users.
In the end, it's all about giving users peace of mind that their data is safe while still enjoying the benefits of AI.
Written by
Daniel Bass
Application authorization enthusiast with years of experience as a customer engineer, technical writing, and open-source community advocacy. Comunity Manager, Dev. Convention Extrovert and Meme Enthusiast.