Skip to content

Authenticating Web Crawlers

Websites traditionally rely on web crawlers to correctly identify themselves via the User-Agent header field and follow robots.txt rules. However, in practice, many crawlers do not follow these conventions, often spoofing well-known crawlers or ignoring the restrictions defined in the robots.txt file.

Using RSL, websites can enforce stricter control over their content usage rights by blocking crawlers that have not obtained a free or paid license from an RSL License Server. When a crawler requests a page that is managed by an RSL license from your website, it must include a valid RSL License Token for the page in the HTTP header using the new proposed License RFC 7235 HTTP Authentication scheme. This ensures that only licensed crawlers that have agreed to the terms of your RSL license can access your content.

Example Code: Defining a Crawling License

Below is an example RSL license file that specifies that crawlers must first obtain a free license from the RSL license server at https://rslstandard.org/api:

xml
<rsl xmlns="https://rslstandard.org/rsl">
  <content url="/" server="https://rslcollective.org/api">
    <license>
      <permits type="usage">tdm</permits>
    </license>
  </content>
</rsl>

Example: Crawler Request with License Authentication

A licensed crawler would authenticate each request by sending the license token in the Authorization header:

yaml
GET /data HTTP/1.1
User-Agent: GPTBot
Authorization: License <license_token>

Handling Unauthorized or Unlicensed Crawlers

If a crawler does not present a License Authorization header or presents an invalid, expired, or revoked <license_token>, the server should respond with an HTTP 401 Unauthorized status code. The response must include a WWW-Authenticate header with information about how to obtain a valid license.

Example: HTTP 401 Unauthorized Response

yaml
HTTP/1.1 401 Unauthorized
WWW-Authenticate: License error="invalid_request", \
  error_description="Access to this resource requires a valid license", \
  authorization_uri="https://rslcollective.org/api"
Content-Type: text/plain

License required. Please obtain a license at https://rslcollective.org/api

WWW-Authenticate Header Fields

FieldDescription
License (scheme)Indicates that the request must use the RSL authentication protocol.
errorProvides a machine-readable error code, such as invalid_request or invalid_license.
error_descriptionA human-readable explanation of why the authentication failed.
authorization_uriA URL where the crawler can obtain or request a valid RSL license.

RSL Authentication Error Codes

Error CodeMeaning
invalid_requestBad request format (missing or broken license_token)
invalid_licenseBad license_token (expired, revoked, or malformed)

RSL™, Really Simple Licensing™, and the RSL Logo are trademarks of RSL Foundry. Terms of Service. Privacy.