# MCP Web Tools — Self-Hosted Web Search & Extraction MCP Server for AI Agents

Self-hosted Model Context Protocol (MCP) server giving AI agents free `web_search` and `web_extractor` tools — search via SearXNG, extract via Crawl4AI, no paid API keys.

Version: 1.0.0
Last updated: 2026-05
Repository: https://github.com/datvietvac-techhub/mcp-web-tools
Documentation: https://datvietvac-techhub.github.io/mcp-web-tools/
Previous names: agent-web-tool-mcp, Agent Web Tool MCP
License: Apache-2.0

## Summary

MCP Web Tools is a self-hosted Model Context Protocol (MCP) server that lets developers serve two web tools — `web_search` and `web_extractor` — to their AI agents locally, instead of paying per-query for Brave Search API, Tavily, Serper, or Bing. `web_search` is powered by a self-hosted SearXNG metasearch instance aggregating about 70 engines. `web_extractor` is powered by Crawl4AI and returns clean Markdown from web pages. The server is implemented with FastMCP and runs as a Docker Compose stack with SearXNG, Crawl4AI, Valkey, and the MCP server itself. It requires no Brave, Tavily, Serper, Bing, or other third-party API keys.

## Why self-host

Self-hosted MCP Web Tools replaces paid per-query SaaS search APIs with infrastructure you control. Agent prompts and URLs never leave your network — only the underlying SearXNG metasearch and Crawl4AI fetches do. The whole stack is one `docker compose up` on a dev laptop, internal VM, or shared box; pin upstream images (`VALKEY_IMAGE`, `SEARXNG_IMAGE`, `CRAWL4AI_IMAGE`) for reproducible deploys. The MCP surface is standard, so the same instance serves Claude Code, Cursor, Hermes, or any MCP-compatible client without code changes.

## Core features

- Self-hosted MCP server for Claude Code, Cursor, and other MCP-compatible clients.
- Two tools: `web_search` and `web_extractor`.
- Search backend: SearXNG, self-hosted metasearch (~70 engines).
- Extraction backend: Crawl4AI, headless Chromium, Markdown output.
- Server framework: FastMCP.
- Deployment: Docker Compose.
- Cache: Valkey for SearXNG, in-process TTL caches for MCP tool responses.
- Transport: Streamable HTTP by default, stdio optional.
- Cost: zero per-query SaaS cost; only local infrastructure cost.
- API keys: none required for web search or extraction.

## Install quickstart

```bash
curl -fsSL https://github.com/datvietvac-techhub/mcp-web-tools/releases/latest/download/install.sh | bash
cd ~/.local/share/mcp-web-tool
make up
make smoke
claude mcp add --transport http web-tool http://localhost:8000/mcp
```

Requirements: Docker Engine, Docker Compose v2, git, and a free host port 8000 for the MCP server. SearXNG and Crawl4AI are internal Docker services by default.

## MCP client config

```json
{
  "mcpServers": {
    "web-tool": { "transport": "http", "url": "http://localhost:8000/mcp" }
  }
}
```

## web_search

`web_search` is an MCP tool that queries a self-hosted SearXNG metasearch instance and returns ranked, de-duplicated JSON results.

Signature:

```python
web_search(
    query: str,
    num_results: int = 10,
    categories: str = "general",
    language: str = "auto",
    time_range: str | None = None,
) -> dict
```

Parameters:

- `query`: required search query.
- `num_results`: max results, clamped to 1..50, default 10.
- `categories`: SearXNG category such as general, news, science, it, images.
- `language`: language code such as en or vi; auto lets SearXNG decide.
- `time_range`: optional recency filter: day, week, month, year.

Return shape:

```json
{
  "query": "...",
  "results": [
    { "title": "...", "url": "https://...", "snippet": "...", "engine": "...", "score": 1.0 }
  ],
  "answers": [],
  "suggestions": ["..."],
  "number_of_results": 12345
}
```

Failure shape: includes an `error` key and `results: []`.

## web_extractor

`web_extractor` is an MCP tool that fetches one or more URLs via Crawl4AI and returns clean Markdown.

Signature:

```python
web_extractor(
    urls: str | list[str],
    mode: str = "fit",
    query: str | None = None,
    bypass_cache: bool = False,
) -> dict
```

Parameters:

- `urls`: one URL or list of URLs, max 20 per call.
- `mode`: fit, raw, bm25, or llm.
- `query`: focus query required for bm25 or llm modes.
- `bypass_cache`: skip local cache and ask Crawl4AI to re-fetch.

Return shape:

```json
{
  "results": [
    {
      "url": "https://...",
      "status": "ok",
      "markdown": "# Heading\n...",
      "word_count": 1234,
      "error": "..."
    }
  ]
}
```

Order matches input URL list. Per-URL failures return `status: "error"` and do not abort the whole batch. URLs must use `http` or `https`; deeper SSRF hardening is intentionally separated into `mcp/url_policy.py` for future optional controls.

## Environment variables

- `SEARXNG_SECRET`: required HMAC signing key for SearXNG, not an API key.
- `CRAWL4AI_API_TOKEN`: optional bearer token for Crawl4AI.
- `MCP_TRANSPORT`: http or stdio, default http.
- `MCP_PORT`: host port for MCP server, default 8000.
- `PLAYGROUND_PORT`: host port for FastAPI dev playground, default 8001.
- `MCP_CACHE_TTL`: web_search cache TTL seconds, default 300.
- `EXTRACT_CACHE_TTL`: web_extractor cache TTL seconds, default 1800.
- `REQUEST_TIMEOUT`: SearXNG request timeout seconds, default 30.
- `EXTRACT_TIMEOUT`: Crawl4AI request timeout seconds, default 60.
- `MAX_CONCURRENCY`: parallel Crawl4AI requests per call, default 5.
- `VALKEY_IMAGE`, `SEARXNG_IMAGE`, `CRAWL4AI_IMAGE`: image tags for reproducible deploys.

## Architecture

Four services run on a single Docker bridge network named `web-tool-net` (the network and service names retain the project's previous slug for deployment compatibility).

- `web-mcp`: FastMCP server, host port `${MCP_PORT:-8000}`, exposes `/mcp`.
- `searxng`: internal metasearch frontend on port 8080, JSON API enabled.
- `crawl4ai`: internal headless-browser crawler on port 11235, exposes `/md` and `/crawl`, uses `shm_size: 1g`.
- `valkey`: internal cache / limiter backend for SearXNG, persisted in the `valkey-data` volume.

Request flow:

1. Agent calls MCP server over HTTP `/mcp` or stdio.
2. For `web_search`, the MCP server checks in-process cache, then calls SearXNG `/search?format=json`, stores normalized JSON, returns to agent.
3. For `web_extractor`, the MCP server checks per-URL cache, then calls Crawl4AI `/md` in parallel up to `MAX_CONCURRENCY`, stores Markdown per URL, returns ordered results.

## Comparison

MCP Web Tools is self-hosted, requires no API key, outputs JSON and Markdown, and costs $0 per query (your compute only). Brave Search MCP and Tavily MCP use hosted SaaS APIs, require API keys, send queries to a third-party service, and meter usage on free/paid tiers. Pick MCP Web Tools when zero per-query cost, no vendor lock-in, and query privacy matter; pick Brave or Tavily when you don't want to run infrastructure.

## Links

- README: https://github.com/datvietvac-techhub/mcp-web-tools#readme
- Install: https://datvietvac-techhub.github.io/mcp-web-tools/install/
- Configuration: https://datvietvac-techhub.github.io/mcp-web-tools/config/
- Architecture: https://datvietvac-techhub.github.io/mcp-web-tools/architecture/
- Changelog: https://github.com/datvietvac-techhub/mcp-web-tools/blob/main/CHANGELOG.md
- Contributing: https://github.com/datvietvac-techhub/mcp-web-tools/blob/main/CONTRIBUTING.md
- Security: https://github.com/datvietvac-techhub/mcp-web-tools/blob/main/SECURITY.md
- License: https://github.com/datvietvac-techhub/mcp-web-tools/blob/main/LICENSE
- Notices: https://github.com/datvietvac-techhub/mcp-web-tools/blob/main/NOTICE