Bot Wrangler is a Traefik plugin designed to improve your web application's security and performance by managing bot traffic effectively. With the rise of large language model (LLM) data scrapers, it has become crucial to control automated traffic from bots. Bot Wrangler provides a solution to log, block, or otherwise handle traffic from unwanted LLM bots, ensuring that your resources are protected and your content remains accessible only to those desired.
By default, Bot user agents are retrieved from ai-robots-txt. Any queries to a service where this middleware is implemented will provide this list when /robots.txt is queried. If an incoming request matches the bot list, meaning it is ignoring your robots.txt, a configurable remediation action is taken:
PASS: Do nothing (no-op)LOG: write a log message about the visitor, the default behaviorBLOCK: reject the request with a static response (a 403 error by default)PROXY: proxy the request to a "tarpit" or other service to handle bot traffic, such as Nepenthes, iocaine, etcPlease read if you plan to deploy this plugin!
robotsTxtFilePath, ensure that the robots.txt template is available to Traefik at startup. For Docker, this means passing the file in as a mount. For Kubernetes, mounting the template in a ConfigMap is easiest.botProxyUrl is unbuffered. If you are passing this request to another reverse proxy in front of a tarpit-style application, ensure proxy buffering is disabled.The follow parameters are exposed to configure this plugin
| Name | Default Value | Description |
|---|---|---|
| enabled | true | Whether or not the plugin should be enabled |
| botAction | LOG | How the bot should be wrangled. Available: PASS (do nothing), LOG (log bot info), BLOCK (log and return static error response), PROXY (log and proxy to botProxyUrl) |
| botProxyUrl | "" | The URL to pass a bot's request to, if PROXY is the set botAction |
| botBlockHttpCode | 403 | The HTTP response code that should be returned when a BLOCK action is taken |
| botBlockHttpContent | "Your user agent is associated with a large language model (LLM) and is blocked from accessing this resource" | The value of the 'message' key in the JSON response when a BLOCK action is taken. If an empty string, the response body has no content. |
| cacheUpdateInterval | 24h | How frequently sources should be refreshed for new bots. Also flushes the User-Agent cache. |
| cacheSize | 500 | The maximum size of the cache of User-Agent to Bot Name mappings. Rolls over when full. |
| logLevel | INFO | The log level for the plugin |
| robotsTxtFilePath | "" | The file path to a custom robots.txt Golang template file. This must end in /robots.txt. If omitted, a default will be generated based on the user agents from your robotsSourceUrl. See example here. |
| robotsTxtDisallowAll | false | A config option to generate a robots.txt file that will disallow all user-agents. This does not change the blocking behavior of the middleware. |
| robotsSourceUrl | https://cdn.jsdelivr.net/gh/ai-robots-txt/ai.robots.txt/robots.json | A comma separated list of URLs to retrieve a bot list. You can provide your own, but read the notes below! |
| robotsSourceRetryInterval | 5m | If retrieving data from a source fails, how frequently to retry |
| setNoArchiveHeader | true | Set the X-Robots-Tag header to noarchive in responses to detected bot traffic. Used by Bing and Amazon, possibly others. |
| useFastMatch | true | When true, use an Aho-Corasick automaton for speedily matching uncached User-Agents against Bot Names. Consumes more memory. false relies on a slower, simple substring match. |
Presently, three different types of source files are supported.
robots.txt styled formatting, from which a bots list will be extracted.In any case, you should ensure that the server serving your source file provides a proper Content-Type header. Of particular note, using content from raw.githubusercontent.com fails to do this. If you wish to use a file hosted on GitHub, check out jsdelivr which can proxy the file with the proper headers. It is recommended to pin the source to a specific git tag or commit.
There are many applications that folks have wrote that are meant to handle LLM in traffic in some way to waste their time, usually based off Markov Chains, or even a local LLM instance to generate some random text. Some you need to provide training data to, some are already trained. Some are more malicious in nature than others, so deploy at your own risk!
I have not tested this plugin with this list, nor is it an exhaustive list of all projects in this space. If you find this plugin has issues with one, please open an issue. Thanks to iocaine for providing this initial list!
The Traefik static configuration must define the module name:
# Static configurationexperimental:plugins:wrangler:moduleName: github.com/holysoles/bot-wrangler-traefik-pluginversion: vX.Y.Z # find latest release here: https://github.com/holysoles/bot-wrangler-traefik-plugin/releases
For actually including the plugin as middleware, you'll need to include it in Traefik's dynamic configuration.
After including the plugin module in Traefik's static configuration, you'll need to setup the dynamic configuration to actually use it.
Here is an example of a file provider dynamic configuration (given here in YAML). note the http.routers.my-router.middlewares and http.middlewares sections:
# Dynamic configurationhttp:routers:my-router:rule: host(`demo.localhost`)service: service-fooentryPoints:- webmiddlewares:- bot-wranglerservices:service-foo:loadBalancer:servers:- url: http://127.0.0.1:5000middlewares:bot-wrangler:plugin:wrangler:logLevel: INFObotAction: BLOCK
If Traefik is deployed with the official helm chart, you'll need to include these values in your Values.yaml for the release:
experimental:plugins:wrangler:moduleName: "github.com/holysoles/bot-wrangler-traefik-plugin"version: vX.Y.Z # find latest release here: https://github.com/holysoles/bot-wrangler-traefik-plugin/releasesvolumes:- name: botwrangler-robots-templatemountPath: /etc/traefik/bot-wrangler/type: configMap
If you want to use a custom robots.txt file template for the plugin to render, we'll need to create the configMap being referenced (ensure its in the same namespace!):
---apiVersion: v1kind: ConfigMapmetadata:name: botwrangler-robots-templatedata:robots.txt: |{{ range $agent := .UserAgentList }}User-agent: {{ $agent }}{{- end }}Disallow: /
Then we'll need to create the middleware object:
---apiVersion: traefik.io/v1alpha1kind: Middlewaremetadata:name: botwranglerspec:plugin:wrangler:robotsTxtFilePath: /etc/traefik/bot-wrangler/robots.txt# Any other config options go here
As for actually including the middleware, you can either include the middleware per-IngressRoute:
---apiVersion: traefik.io/v1alpha1kind: IngressRoutemetadata:name: whoami-ingressspec:entryPoints:- webroutes:- kind: Rulematch: Host(`example.com`)middlewares:- name: botwranglerservices:- name: whoamiport: 8080
Or per-entrypoint (in your Values.yaml) if you want to broadly apply the plugin:
additionalArguments:- "--entrypoints.web.http.middlewares=traefik-botwrangler@kubernetescrd"- "--entrypoints.websecure.http.middlewares=traefik-botwrangler@kubernetescrd"- "--providers.kubernetescrd"
Traefik offers a developer mode that can be used for temporary testing of unpublished plugins.
To use a plugin in local mode, the Traefik static configuration must define the module name (as is usual for Go packages) and a path to a Go workspace, which can be the local GOPATH or any directory.
A test dev instance can be easily setup by using commands in the provided makefile (e.g. make run_local, make restart_local) and modifying the docker-compose.local.yml file.
Contributions to this project are welcome! Please use conventional commits, and retain a linear git history.
Special thanks to the following projects: