Skip to content

Configuration

Everything is set via .env (see .env.example).

Environment variables

var default purpose
SEARXNG_SECRET (required) HMAC signing key for SearXNG — not an API key
CRAWL4AI_API_TOKEN (empty) optional bearer token if Crawl4AI is locked down
MCP_TRANSPORT http http (streamable-http) or stdio (subprocess)
MCP_PORT 8000 host port for the MCP server (also used for make smoke)
PLAYGROUND_PORT 8001 host port for make playground
MCP_CACHE_TTL 300 web_search cache TTL in seconds (0 disables)
EXTRACT_CACHE_TTL 1800 web_extractor cache TTL in seconds (0 disables)
REQUEST_TIMEOUT 30 SearXNG request timeout (s)
EXTRACT_TIMEOUT 60 Crawl4AI request timeout (s)
MAX_CONCURRENCY 5 parallel Crawl4AI requests per web_extractor call
VALKEY_IMAGE valkey/valkey:8-alpine pin Valkey image
SEARXNG_IMAGE searxng/searxng:latest pin SearXNG image
CRAWL4AI_IMAGE unclecode/crawl4ai:latest pin Crawl4AI image (recommended for prod)

After changing MCP_PORT, run make restart (not just up) so the new host-port mapping takes effect.

SEARXNG_SECRET is not an API key — nothing sends it. SearXNG uses it server-side to sign image-proxy URLs (HMAC) and internal tokens; it just needs to be random and stable. install.sh generates it; the SearXNG container won't start without one.

Search quality is tuned in searxng/settings.yml (engines enabled, weights, categories) — restart the searxng service after editing.

MCP tools

Tool implementations live in mcp/tools.py and are shared with the dev playground.

Queries the self-hosted SearXNG instance and returns ranked, de-duplicated results. Cached in-process for MCP_CACHE_TTL seconds. On failure the dict includes an "error" key and "results": [].

param type default notes
query string required search query
num_results int 10 clamped to 1..50
categories string "general" SearXNG category: general, news, science, it, images, …
language string "auto" "en", "vi", …; "auto" lets SearXNG decide
time_range string | null null "day", "week", "month", "year"

Returns:

{
  "query": "...",
  "results": [
    { "title": "...", "url": "https://...", "snippet": "...", "engine": "...", "score": 1.0 }
  ],
  "answers": [],
  "suggestions": ["..."],
  "number_of_results": 12345
}

Results are de-duplicated by normalized URL and truncated to num_results.

web_extractor

Fetches one or more URLs via Crawl4AI and returns clean Markdown. URLs are fetched in parallel up to MAX_CONCURRENCY, max 20 per call. Cached for EXTRACT_CACHE_TTL seconds.

param type default notes
urls string | list[string] required one URL or a list — max 20 per call
mode string "fit" fit (pruned main content), raw (full page), bm25 / llm (relevance-filtered — requires query)
query string | null null focus query for bm25 / llm mode
bypass_cache bool false skip the local cache and ask Crawl4AI to re-fetch

Returns:

{
  "results": [
    {
      "url": "https://...",
      "status": "ok",
      "markdown": "# Heading\n...",
      "word_count": 1234,
      "error": "..."
    }
  ]
}

Order matches the input URL list. Per-URL failures surface as {"status": "error", "error": "...", "markdown": ""} without aborting the whole call. URLs must use http or https; bm25 and llm mode return a per-URL error when query is missing.

The URL policy currently performs lightweight validation only. Crawl4AI's domain/link filters are useful for crawl behavior, but they are not treated as this project's SSRF protection boundary. If you expose the MCP server beyond a trusted local network, put it behind access control and add any deployment-specific URL restrictions in mcp/url_policy.py.

Dev playground (FastAPI)

Useful for poking the tools from curl without wiring up an MCP client. Not in docker-compose.yml — run on demand as a one-shot container off the existing web-mcp image (the main stack must already be up).

make up           # if not already running
make playground   # alias: make play

Listens on http://localhost:${PLAYGROUND_PORT} (default 8001). Ctrl-C stops it; container is removed automatically.

method path body description
GET /healthz liveness probe → {"ok": true}
POST /search SearchReq calls web_search_impl, same shape as the MCP tool
POST /extract ExtractReq calls web_extractor_impl, same shape as the MCP tool

Bodies mirror the tool signatures one-for-one. Swagger UI is auto-generated at /docs, ReDoc at /redoc.

Dev-only

No auth, verbose errors, accepts arbitrary URLs. Don't expose PLAYGROUND_PORT publicly.