How It Works
Gostly is a transparent HTTP proxy with three operating modes: LEARN, MOCK, and TRANSITIONING. Understanding the pipeline between them is the key to getting the most out of the tool.
The pipeline
1. LEARN
The proxy forwards every request to your upstream and records the verbatim response. Your app sees no difference.
2. TRANSITION
Recorded traffic is scrubbed, pattern-extracted, and written to the mock library. A brief interstitial mode returns 503 + Retry-After.
3. MOCK
All requests are served from the mock library. No upstream required. Unmatched requests fall through to AI generation if enabled.
LEARN mode β recording traffic
In LEARN mode the proxy is a transparent pass-through. Every inbound request is forwarded to the configured upstream URL. The response is returned to the caller and simultaneously written to a local JSONL file on disk:
# Each line in traffic/{service}.jsonl is one recorded interaction
{
"timestamp": "2026-04-23T09:14:22Z",
"method": "GET",
"uri": "/users/42",
"request_headers": { "accept": "application/json" },
"request_body": null,
"status": 200,
"response_headers": { "content-type": "application/json" },
"response_body": { "id": 42, "name": "Jane Smith", "role": "admin" }
}On sensitive headers
JSONL files live on the customer's machine and are never transmitted anywhere. The verbatim format preserves full fidelity β tests that pattern-match on specific field values work correctly because the recorded data is production-accurate.
Transition β building the mock library
When you trigger a transition, the API reads the raw JSONL, runs it through a scrub pipeline, and writes the results to the Postgres-backed mock library:
Request/response bodies are scanned for credentials, PII patterns, and any field paths you've configured. Matched values are replaced with [REDACTED]. The scrubbed_at timestamp is set β this is the permanent safety boundary.
URI paths are normalised to templates (e.g. /users/42 β /users/{id}). The extracted patterns drive AI training and smart-swap matching.
Scrubbed entries are inserted into the mock_library table. The proxy is signalled to reload β it reads the library and serves from it on the next request.
During transition the proxy enters TRANSITIONING mode and returns 503 Service Unavailable with a Retry-After header. This is intentional β it prevents partial-library matches during the write.
Start a transition via the API (or use the dashboard at localhost:3000):
curl -X POST http://localhost:8000/v1/transition/start
# Returns: { "job_id": "..." }
# Poll until complete
curl http://localhost:8000/v1/transition/{job_id}/statusMOCK mode β serving responses
Switch to MOCK mode via the dashboard or the API:
curl -X POST http://localhost:8000/v1/mode \
-H 'Content-Type: application/json' \
-d '{"mode": "MOCK"}'In MOCK mode the proxy matches each inbound request against the library using a tiered strategy. Earlier tiers are cheaper; later tiers are more capable:
Exact match
All tiersMethod + URI + request body hash matches a recorded entry exactly. Instant β O(1) hash lookup.
Smart swap
All tiersΒΉURI path parameters are normalised to templates (/users/{id}) and matched structurally. A recording of /users/42 will serve a request to /users/99. Enable with SMART_SWAP_ENABLED=true on the proxy.
AI generation
Pro+No recorded match. A fine-tuned model (or retrieval-augmented generation) generates a realistic response based on recorded patterns for this service.
ΒΉ Smart swap is available on all tiers but requires SMART_SWAP_ENABLED=true on the proxy.
Chaos injection
AI pipeline (Pro+)
When a request has no recorded match, Gostly routes it to the inference server. The inference server runs two optional modes, both disabled by default and enabled via environment variables:
ENABLE_RAG=trueLoads the all-MiniLM-L6-v2 sentence encoder and builds a per-service semantic index from your mock library. Incoming requests are matched by cosine similarity β above 0.92 the recorded response is replayed directly; above 0.75 it becomes a grounded generation template; below that, pure generation. This is the recommended first step.
ENABLE_GENERATION=trueLoads Qwen2.5-0.5B-Instruct (configurable via GEN_MODEL) and serves LoRA adapter responses. For teams with 50+ recorded interactions per endpoint, optional fine-tuning produces a per-service adapter that improves consistency. Requires ~2 GB RAM; the first request after startup may briefly 503 while the model loads.
The AI pipeline is entirely local β the inference server runs inside your Docker stack. No request bodies or response contents are sent to any external model provider.