Authenticating Web Crawlers
Websites traditionally rely on web crawlers to correctly identify themselves via the User-Agent
header field and follow robots.txt
rules. However, in practice, many crawlers do not follow these conventions, often spoofing well-known crawlers or ignoring the restrictions defined in the robots.txt
file.
Using RSL, websites can enforce stricter control over their content usage rights by blocking crawlers that have not obtained a free or paid license from an RSL License Server. When a crawler requests a page that is managed by an RSL license from your website, it must include a valid RSL License Token for the page in the HTTP header using the new proposed License RFC 7235 HTTP Authentication scheme. This ensures that only licensed crawlers that have agreed to the terms of your RSL license can access your content.
Example Code: Defining a Crawling License
Below is an example RSL license file that specifies that crawlers must first obtain a free license from the RSL license server at https://rslstandard.org/api
:
<rsl xmlns="https://rslstandard.org/rsl">
<content url="/" server="https://rslcollective.org/api">
<license>
<permits type="usage">tdm</permits>
</license>
</content>
</rsl>
Example: Crawler Request with License Authentication
A licensed crawler would authenticate each request by sending the license token in the Authorization
header:
GET /data HTTP/1.1
User-Agent: GPTBot
Authorization: License <license_token>
Handling Unauthorized or Unlicensed Crawlers
If a crawler does not present a License Authorization
header or presents an invalid, expired, or revoked <license_token>
, the server should respond with an HTTP 401 Unauthorized
status code. The response must include a WWW-Authenticate
header with information about how to obtain a valid license.
Example: HTTP 401 Unauthorized Response
HTTP/1.1 401 Unauthorized
WWW-Authenticate: License error="invalid_request", \
error_description="Access to this resource requires a valid license", \
authorization_uri="https://rslcollective.org/api"
Content-Type: text/plain
License required. Please obtain a license at https://rslcollective.org/api
WWW-Authenticate Header Fields
Field | Description |
---|---|
License (scheme) | Indicates that the request must use the RSL authentication protocol. |
error | Provides a machine-readable error code, such as invalid_request or invalid_license . |
error_description | A human-readable explanation of why the authentication failed. |
authorization_uri | A URL where the crawler can obtain or request a valid RSL license. |
RSL Authentication Error Codes
Error Code | Meaning |
---|---|
invalid_request | Bad request format (missing or broken license_token ) |
invalid_license | Bad license_token (expired, revoked, or malformed) |