Configuration¶
Everything is set via .env (see .env.example).
Environment variables¶
| var | default | purpose |
|---|---|---|
SEARXNG_SECRET |
(required) | HMAC signing key for SearXNG — not an API key |
CRAWL4AI_API_TOKEN |
(empty) | optional bearer token if Crawl4AI is locked down |
MCP_TRANSPORT |
http |
http (streamable-http) or stdio (subprocess) |
MCP_PORT |
8000 |
host port for the MCP server (also used for make smoke) |
PLAYGROUND_PORT |
8001 |
host port for make playground |
MCP_CACHE_TTL |
300 |
web_search cache TTL in seconds (0 disables) |
EXTRACT_CACHE_TTL |
1800 |
web_extractor cache TTL in seconds (0 disables) |
REQUEST_TIMEOUT |
30 |
SearXNG request timeout (s) |
EXTRACT_TIMEOUT |
60 |
Crawl4AI request timeout (s) |
MAX_CONCURRENCY |
5 |
parallel Crawl4AI requests per web_extractor call |
VALKEY_IMAGE |
valkey/valkey:8-alpine |
pin Valkey image |
SEARXNG_IMAGE |
searxng/searxng:latest |
pin SearXNG image |
CRAWL4AI_IMAGE |
unclecode/crawl4ai:latest |
pin Crawl4AI image (recommended for prod) |
After changing MCP_PORT, run make restart (not just up) so the new host-port mapping takes effect.
SEARXNG_SECRET is not an API key — nothing sends it. SearXNG uses it server-side to sign image-proxy URLs (HMAC) and internal tokens; it just needs to be random and stable. install.sh generates it; the SearXNG container won't start without one.
Search quality is tuned in searxng/settings.yml (engines enabled, weights, categories) — restart the searxng service after editing.
MCP tools¶
Tool implementations live in mcp/tools.py and are shared with the dev playground.
web_search¶
Queries the self-hosted SearXNG instance and returns ranked, de-duplicated results. Cached in-process for MCP_CACHE_TTL seconds. On failure the dict includes an "error" key and "results": [].
| param | type | default | notes |
|---|---|---|---|
query |
string | required | search query |
num_results |
int | 10 |
clamped to 1..50 |
categories |
string | "general" |
SearXNG category: general, news, science, it, images, … |
language |
string | "auto" |
"en", "vi", …; "auto" lets SearXNG decide |
time_range |
string | null | null |
"day", "week", "month", "year" |
Returns:
{
"query": "...",
"results": [
{ "title": "...", "url": "https://...", "snippet": "...", "engine": "...", "score": 1.0 }
],
"answers": [],
"suggestions": ["..."],
"number_of_results": 12345
}
Results are de-duplicated by normalized URL and truncated to num_results.
web_extractor¶
Fetches one or more URLs via Crawl4AI and returns clean Markdown. URLs are fetched in parallel up to MAX_CONCURRENCY, max 20 per call. Cached for EXTRACT_CACHE_TTL seconds.
| param | type | default | notes |
|---|---|---|---|
urls |
string | list[string] | required | one URL or a list — max 20 per call |
mode |
string | "fit" |
fit (pruned main content), raw (full page), bm25 / llm (relevance-filtered — requires query) |
query |
string | null | null |
focus query for bm25 / llm mode |
bypass_cache |
bool | false |
skip the local cache and ask Crawl4AI to re-fetch |
Returns:
{
"results": [
{
"url": "https://...",
"status": "ok",
"markdown": "# Heading\n...",
"word_count": 1234,
"error": "..."
}
]
}
Order matches the input URL list. Per-URL failures surface as {"status": "error", "error": "...", "markdown": ""} without aborting the whole call. URLs must use http or https; bm25 and llm mode return a per-URL error when query is missing.
The URL policy currently performs lightweight validation only. Crawl4AI's domain/link filters are useful for crawl behavior, but they are not treated as this project's SSRF protection boundary. If you expose the MCP server beyond a trusted local network, put it behind access control and add any deployment-specific URL restrictions in mcp/url_policy.py.
Dev playground (FastAPI)¶
Useful for poking the tools from curl without wiring up an MCP client. Not in docker-compose.yml — run on demand as a one-shot container off the existing web-mcp image (the main stack must already be up).
Listens on http://localhost:${PLAYGROUND_PORT} (default 8001). Ctrl-C stops it; container is removed automatically.
| method | path | body | description |
|---|---|---|---|
| GET | /healthz |
— | liveness probe → {"ok": true} |
| POST | /search |
SearchReq |
calls web_search_impl, same shape as the MCP tool |
| POST | /extract |
ExtractReq |
calls web_extractor_impl, same shape as the MCP tool |
Bodies mirror the tool signatures one-for-one. Swagger UI is auto-generated at /docs, ReDoc at /redoc.
Dev-only
No auth, verbose errors, accepts arbitrary URLs. Don't expose PLAYGROUND_PORT publicly.