GEO Index Docs

Technical Documentation · 01

How GEO Index works,
end to end.

GEO Index turns ordinary WordPress content into machine-readable schema.org JSON-LD — the structured data that search engines and AI answer engines read to understand a page. It is two cooperating systems: a WordPress plugin that owns the content and the page, and a Python service that does the language-model heavy lifting.

2
cooperating systems
3
plugin phases
~75
schema.org types
LLM failover models

01The two halves

WordPress plugin

GEOData v1.7.0

Lives inside WordPress (C:\_DEV\Claude\GEOJson2). It indexes the site's content into a custom database table, gives the administrator a UI to classify each post with a schema.org type, calls the Python service to generate the JSON-LD, stores the result, and injects it into the public page <head>.

Prefix gd78_ · table wp_gd78_GEOData · admin-only.

Python API

GetJsonLd FastAPI · v0.2.0

A stateless FastAPI microservice (C:\_DEV\Claude\GEOPythonAPI) reachable at https://geoapipy.com. Given a URL, a target type, and metadata, it fetches the page, extracts the readable content, builds a strictly-constrained LLM prompt, and routes it through a 3-model failover on OpenRouter to return clean JSON-LD.

No database · gunicorn + Nginx · property-whitelisted output.

The contract between them. The plugin always sends the same compact JSON — { url, classification, metadata } — to two endpoints. Everything the API needs to ground its output travels in that single payload. The API owns no state; the plugin owns all of it.

02The plugin's three phases

Phase 1 · Index

From the GEO Index screen, the admin clicks Populate and picks which post types to scan. The work runs in AJAX batches of 20 (ajax_init_populationajax_process_batch) so large sites never time out. Each post becomes one row in wp_gd78_GEOData via GD78_DB::index_post(), capturing its URL, slug, post type, template, title, and a whitelisted slice of post-meta (Yoast / Rank Math fields, featured image, etc.). A synthetic image_url is resolved with a three-tier fallback: featured image → SEO social image → first real in-content image (site chrome like logos and icons is filtered out).

Phase 2 · Classify

On the Classification screen, every indexed row gets a schema.org type from a dropdown sourced from data/SchemaTypes.json (~75 types across groups like Creative Work, Commerce, Events, Food). Admins can filter, search, group by post type or template, and bulk-apply a type to a whole group. Classification can also happen one post at a time from the GEO Index meta box on the post editor. Saving routes through GD78_DB::save_classification(), which is also the cache gatekeeper — see below.

Phase 3 · Generate & inject

From the Schema Writer screen (single rows or "generate for all"), or the post meta box, the plugin POSTs to the API's /GetJsonLd endpoint, validates that the response is real JSON (never an HTML error page), and stores it in the json_ld column with a schema_last_updated stamp. On the public side, GD78_Public::output_json_ld() — hooked to wp_head — emits that stored JSON-LD into the page head for every singular view.

03Inside the API pipeline

A request to /GetJsonLd moves through four stages. Each stage fails loudly and specifically so the plugin can show the admin exactly what went wrong.

1. Fetch

extractor.fetch_html() pulls the page with httpx (browser user-agent, follows redirects, 20s) and verifies it actually returned HTML.

2. Extract

extract_main_content() runs Trafilatura in precision mode to strip nav/ads/boilerplate; extract_action_links() harvests off-domain anchors for URL-typed properties.

3. Prompt

build_prompt() loads the per-type property whitelist from schemadata/fields/{Type}.json, truncates content to 12k chars, and assembles 10 hard rules forbidding invented properties or fabricated facts.

4. Generate

generate_json_ld() tries each model in order with a 45s timeout, strips code fences, and validates the result parses as a JSON object before returning it.

3-model failover. The order is google/gemini-3.1-flash-liteopenai/gpt-oss-20bopenai/gpt-5.4-nano. If a model times out, errors, or returns non-JSON, the next one is tried. Only if all three fail does the caller get a 502 with a per-model failure list.

04The request / response contract

Both API endpoints accept the identical request body. /LLM_Prompt_Tester stops after building the prompt (no model call) and is used by the plugin's API Tester screen for inspection; /GetJsonLd goes all the way to JSON-LD.

Request — sent by the plugin

// POST https://geoapipy.com/GetJsonLd  (Content-Type: application/json)
{
  "url": "https://example.com/best-sourdough",
  "classification": "Recipe",
  "metadata": {
    "_yoast_wpseo_title": "No-Knead Sourdough",
    "image_url": "https://example.com/img/loaf.jpg"
  }
}

Response — the raw JSON-LD body

{
  "@context": "https://schema.org",
  "@type": "Recipe",
  "url": "https://example.com/best-sourdough",
  "name": "No-Knead Sourdough",
  "recipeIngredient": ["500g flour", "10g salt", "…"],
  "image": "https://example.com/img/loaf.jpg"
}
EndpointMethodReturnsPlugin uses it for
/GetJsonLdPOSTRaw JSON-LD object (the whole body)Schema Writer & meta box → stored in json_ld
/LLM_Prompt_TesterPOSTExtraction + assembled llm_prompt (no model call)API Tester → prompt stored in notes
/healthzGET{ "status": "ok" }Load-balancer / uptime checks

When things go wrong

StatusMeaningTypical cause
422UnprocessablePage couldn't be fetched/extracted, or the classification has no whitelist file
500Server errorUnexpected extraction failure
502Bad gatewayAll three LLM models failed — body lists each model and reason

05Storage, caching & invalidation

GEO Index uses durable database caching, not transients. Generated JSON-LD lives permanently in the json_ld column until something changes it, so page loads never trigger an API call — injection is a single, cheap database read.

The one rule that keeps schema honest lives in save_classification(): if an admin changes a post's classification and stored JSON-LD already exists, both json_ld and schema_last_updated are cleared. A Recipe reclassified as an Article can't keep serving recipe markup — the row is flagged for regeneration and the UI tells the admin to re-run it.

ColumnHolds
idWordPress post ID (primary key)
url · slug · path · canonical_urlWhere the content lives
post_type · template_name · title · author_idWhat the content is
classificationChosen schema.org type (e.g. Recipe)
post_metaWhitelisted post-meta as JSON (becomes the API's metadata)
notesAdmin notes — also stores the returned llm_prompt from the API Tester
json_ldThe generated structured data (LONGTEXT)
schema_last_updatedWhen the JSON-LD was last produced

06Safe injection into the page

The public output path is deliberately conservative — it would rather emit nothing than emit broken structured data. output_json_ld() bails silently unless every check passes: front-end context, a singular view, a record exists, and the stored JSON-LD still parses. Before printing, a literal </ is rewritten to <\/ so a stray </script> inside a string value can't close the tag early — a round-trip-safe transform, since JSON parsers read \/ back as /.

<!-- what a visitor (and an AI crawler) receives in <head> -->
<script type="application/ld+json">
{ "@context": "https://schema.org", "@type": "Recipe", … }
</script>
Why not escape it? Running the value through esc_html() or wp_json_encode() would HTML-encode the quotes and corrupt the JSON. The stored value is trusted — it was validated as JSON both when it was saved and again right before output.

07Security & deployment

Plugin hardening

  • Every admin screen & AJAX route checks manage_options
  • Nonce verification on all AJAX (gd78_ajax_nonce)
  • Prepared statements for all SQL
  • Output escaped everywhere except the validated JSON-LD
  • CSV exports guard against formula injection

API deployment

  • gunicorn (3 uvicorn workers) bound to loopback
  • Nginx reverse proxy terminating TLS
  • Stateless — scales horizontally, no DB
  • Per-request logging incl. each model attempt
  • OpenRouter key held in app/config.py (server-side only)
Hardening note. The API endpoints are currently unauthenticated and the OpenRouter key is committed in config.py. Before exposing the service beyond trusted callers, move the key to an environment variable and add a shared secret or IP allow-list between the plugin and the API.

→ See the whole journey as a data-flow diagram  ·  Browse every function