Build a RAG Pipeline Inside Joomla for Intelligent Site Search

PHP CMS Frameworks March 31, 2026

Joomla's built-in search has always had the same fundamental limitation. It is keyword-based. A visitor types "how do I reset my account" and the search engine looks for articles containing those exact words. If your article uses the phrase "recover your login credentials" instead, it does not show up. The visitor gets no results, concludes your site does not have the answer, and leaves.

This is not a Joomla problem specifically. It is what keyword search does. It matches strings, not meaning. RAG, Retrieval-Augmented Generation, solves this at the architecture level. Instead of matching keywords, it converts both your content and the search query into vector embeddings, finds content that is semantically similar, and uses an LLM to generate a direct answer from that content. A visitor asking "how do I reset my account" gets a proper answer even if none of your articles use those exact words.

I will walk through the full implementation. We will cover the three main vector storage options honestly so you can make the right choice for your setup, then go deep on building the complete RAG pipeline inside a custom Joomla component using PostgreSQL with pgvector and OpenAI.

What you need: Joomla 4 or 5, PHP 8.1+, Composer, PostgreSQL with the pgvector extension installed, and an OpenAI API key.

What RAG Actually Does, Step by Step

Before writing any code it is worth being clear about what the pipeline actually does at each stage. RAG is one of those terms that gets thrown around loosely and the implementation details matter.

Phase 1: Indexing (runs once, then on content updates)
        ↓
Fetch all published Joomla articles
        ↓
Split long articles into chunks
        ↓
Send each chunk to OpenAI Embeddings API
        ↓
Store the chunk text and its embedding vector in PostgreSQL

Phase 2: Search (runs on every user query)
        ↓
User submits a search query
        ↓
Convert query to an embedding vector via OpenAI
        ↓
Find the most semantically similar chunks using pgvector
        ↓
Send retrieved chunks plus the original query to GPT-4o
        ↓
GPT-4o generates a direct answer grounded in your content
        ↓
Return the answer and source article links to the user

The indexing phase is the slower one and only needs to run when content changes. The search phase is what your visitors experience and it needs to be fast. Keeping those two concerns separate in the architecture makes both easier to manage.

Three Vector Storage Options for Joomla

This is the decision that shapes the rest of the implementation. There is no universally correct answer here, the right choice depends on your infrastructure, team, and content volume.

Option 1: PostgreSQL with pgvector

pgvector is an open source PostgreSQL extension that adds a native vector data type and similarity search operators. You store embeddings directly in a PostgreSQL table alongside your chunk text and metadata. Similarity search runs as a standard SQL query using the cosine distance operator.

The big advantage is that you are not adding a new infrastructure dependency. If you are already running PostgreSQL, this is just an extension install and a new table. Queries are fast, the data lives in your existing database stack, and you have full control. The limitation is that at very large scale, hundreds of thousands of chunks, you need to tune the index carefully to maintain query speed.

This is what we are building in this post. It is the right default for most Joomla sites.

Option 2: MySQL with a Vector Similarity Workaround

Joomla ships with MySQL as its default database, so this is the path of least resistance from an infrastructure standpoint. MySQL 9.0 added experimental vector support but it is not production-ready for most use cases yet. The practical workaround is to store embeddings as JSON or a serialised float array, fetch candidate chunks using a broad text filter, then do the cosine similarity calculation in PHP.

This works for small content sets, a few hundred articles. It gets slow quickly as the content volume grows because you are doing similarity math in PHP rather than in an optimised database index. If your Joomla site runs MySQL and you cannot add PostgreSQL, this is a viable starting point but plan for a migration if the search volume grows.

Option 3: External Vector Store, Pinecone or Qdrant

Pinecone and Qdrant are purpose-built vector databases. You send embeddings to their API, they handle storage and indexing, and you query them via HTTP. Both have generous free tiers for getting started.

The advantage is performance at scale and zero infrastructure management on your end. The disadvantages are an additional external dependency, data leaving your infrastructure, API rate limits, and ongoing costs that grow with your content volume. For enterprise Joomla sites with strict data residency requirements, an external service is often a non-starter.

Good fit for teams that want to move fast without managing PostgreSQL, or sites with very high search volume where a dedicated vector store makes sense operationally.

We are going with pgvector. Here is the full build.

Install pgvector and Set Up the Database Table

First, install the pgvector extension in your PostgreSQL database. If you have superuser access:

CREATE EXTENSION IF NOT EXISTS vector;

If you are on a managed PostgreSQL service like AWS RDS or Supabase, pgvector is available as an enabled extension in the console without needing superuser access.

Create the table that will store your article chunks and their embeddings. Run this in your PostgreSQL database, this is a separate database from Joomla's MySQL database:

CREATE TABLE joomla_article_embeddings (
    id           SERIAL PRIMARY KEY,
    article_id   INTEGER NOT NULL,
    article_title TEXT NOT NULL,
    chunk_index  INTEGER NOT NULL,
    chunk_text   TEXT NOT NULL,
    embedding    vector(1536),
    url          TEXT,
    created_at   TIMESTAMP DEFAULT NOW(),
    updated_at   TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON joomla_article_embeddings
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

CREATE INDEX ON joomla_article_embeddings (article_id);

The embedding dimension is 1536 because that is what OpenAI's text-embedding-3-small model outputs. If you use text-embedding-3-large instead, change this to 3072. The ivfflat index is what makes similarity search fast at scale. The lists value of 100 is a reasonable starting point, tune it upward if you have more than 100,000 chunks.

Custom Joomla Component Structure

We will build this as a custom Joomla component. Create the following structure under components/com_ragsearch:

components/com_ragsearch/
    ragsearch.xml
    src/
        Service/
            OpenAIService.php
            VectorStoreService.php
            ArticleChunkerService.php
            RAGSearchService.php
        Controller/
            SearchController.php
        View/
            Search/
                HtmlView.php
                tmpl/
                    default.php
    tmpl/
        index.php

administrator/components/com_ragsearch/
    src/
        Controller/
            IndexController.php

Install the OpenAI PHP client and a PostgreSQL driver via Composer in your Joomla root:

composer require openai-php/client
composer require doctrine/dbal

The OpenAI Service

Create src/Service/OpenAIService.php:

<?php

namespace Joomla\Component\Ragsearch\Site\Service;

use OpenAI;

class OpenAIService
{
    private $client;

    public function __construct()
    {
        $params      = \JComponentHelper::getParams('com_ragsearch');
        $apiKey      = $params->get('openai_api_key');
        $this->client = OpenAI::client($apiKey);
    }

    public function embed(string $text): array
    {
        $response = $this->client->embeddings()->create([
            'model' => 'text-embedding-3-small',
            'input' => $text,
        ]);

        return $response->embeddings[0]->embedding;
    }

    public function embedBatch(array $texts): array
    {
        $response = $this->client->embeddings()->create([
            'model' => 'text-embedding-3-small',
            'input' => $texts,
        ]);

        $embeddings = [];
        foreach ($response->embeddings as $item) {
            $embeddings[$item->index] = $item->embedding;
        }

        return $embeddings;
    }

    public function generateAnswer(string $query, array $chunks): string
    {
        $context = implode("\n\n---\n\n", array_column($chunks, 'chunk_text'));

        $response = $this->client->chat()->create([
            'model'       => 'gpt-4o',
            'temperature' => 0.3,
            'max_tokens'  => 600,
            'messages'    => [
                [
                    'role'    => 'system',
                    'content' => 'You are a helpful site assistant. Answer the user question
                                  using only the content provided below. If the content does
                                  not contain enough information to answer, say so honestly.
                                  Do not make up information. Keep answers clear and concise.',
                ],
                [
                    'role'    => 'user',
                    'content' => "Content from our site:\n\n{$context}\n\nQuestion: {$query}",
                ],
            ],
        ]);

        return $response->choices[0]->message->content;
    }
}

Notice the embedBatch method. When indexing articles, sending texts in batches rather than one at a time cuts the number of API calls significantly and speeds up the indexing process. Use it during the indexing phase, use embed for single query embeddings at search time.

The Article Chunker Service

Long articles need to be split into chunks before embedding. Embedding an entire 3,000-word article as a single vector produces a representation that is too diffuse to be useful for retrieval. Smaller focused chunks give the similarity search something meaningful to match against.

Create src/Service/ArticleChunkerService.php:

<?php

namespace Joomla\Component\Ragsearch\Site\Service;

class ArticleChunkerService
{
    private int $chunkSize    = 400;
    private int $chunkOverlap = 50;

    public function chunk(string $text): array
    {
        // Strip HTML tags from article body
        $clean = strip_tags($text);

        // Normalise whitespace
        $clean = preg_replace('/\s+/', ' ', $clean);
        $clean = trim($clean);

        $words  = explode(' ', $clean);
        $total  = count($words);
        $chunks = [];
        $start  = 0;

        while ($start < $total) {
            $end        = min($start + $this->chunkSize, $total);
            $chunkWords = array_slice($words, $start, $end - $start);
            $chunks[]   = implode(' ', $chunkWords);

            // Move forward by chunkSize minus overlap
            // so consecutive chunks share some context
            $start += ($this->chunkSize - $this->chunkOverlap);

            if ($start >= $total) {
                break;
            }
        }

        return array_filter($chunks, fn($c) => strlen(trim($c)) > 50);
    }
}

The overlap between chunks matters more than it might seem. If a key sentence sits right at the boundary between two chunks, without overlap it gets split in half and neither chunk represents that idea well. A 50-word overlap means boundary content appears in both adjacent chunks, so the similarity search is more likely to retrieve it when it is relevant.

The Vector Store Service

Create src/Service/VectorStoreService.php:

<?php

namespace Joomla\Component\Ragsearch\Site\Service;

use Doctrine\DBAL\DriverManager;

class VectorStoreService
{
    private $conn;

    public function __construct()
    {
        $params      = \JComponentHelper::getParams('com_ragsearch');

        $this->conn = DriverManager::getConnection([
            'dbname'   => $params->get('pg_database'),
            'user'     => $params->get('pg_user'),
            'password' => $params->get('pg_password'),
            'host'     => $params->get('pg_host', 'localhost'),
            'port'     => $params->get('pg_port', 5432),
            'driver'   => 'pdo_pgsql',
        ]);
    }

    public function upsertChunk(
        int    $articleId,
        string $title,
        int    $chunkIndex,
        string $chunkText,
        array  $embedding,
        string $url
    ): void {
        // Delete existing chunks for this article and index first
        $this->conn->executeStatement(
            'DELETE FROM joomla_article_embeddings
             WHERE article_id = :id AND chunk_index = :idx',
            ['id' => $articleId, 'idx' => $chunkIndex]
        );

        $vectorLiteral = '[' . implode(',', $embedding) . ']';

        $this->conn->executeStatement(
            'INSERT INTO joomla_article_embeddings
                (article_id, article_title, chunk_index, chunk_text, embedding, url)
             VALUES
                (:article_id, :title, :chunk_index, :chunk_text, :embedding, :url)',
            [
                'article_id'  => $articleId,
                'title'       => $title,
                'chunk_index' => $chunkIndex,
                'chunk_text'  => $chunkText,
                'embedding'   => $vectorLiteral,
                'url'         => $url,
            ]
        );
    }

    public function similaritySearch(array $queryEmbedding, int $topK = 5): array
    {
        $vectorLiteral = '[' . implode(',', $queryEmbedding) . ']';

        $sql = "SELECT
                    article_id,
                    article_title,
                    chunk_text,
                    url,
                    1 - (embedding <=> :embedding::vector) AS similarity
                FROM joomla_article_embeddings
                ORDER BY embedding <=> :embedding::vector
                LIMIT :limit";

        $stmt = $this->conn->executeQuery(
            $sql,
            [
                'embedding' => $vectorLiteral,
                'limit'     => $topK,
            ]
        );

        return $stmt->fetchAllAssociative();
    }

    public function deleteArticle(int $articleId): void
    {
        $this->conn->executeStatement(
            'DELETE FROM joomla_article_embeddings WHERE article_id = :id',
            ['id' => $articleId]
        );
    }
}

The <=> operator is pgvector's cosine distance operator. Cosine distance measures the angle between two vectors rather than the straight-line distance between them, which works better for text embeddings because it focuses on direction, meaning, rather than magnitude. The similarity score in the SELECT is calculated as 1 - cosine_distance, so a score of 1.0 is a perfect match and 0.0 is completely unrelated.

The Indexing Controller

This runs from the Joomla administrator backend. It fetches all published articles, chunks them, embeds them in batches, and stores everything in PostgreSQL. You run this once to build the initial index and then on a schedule or via a hook when articles are updated.

Create administrator/components/com_ragsearch/src/Controller/IndexController.php:

<?php

namespace Joomla\Component\Ragsearch\Administrator\Controller;

use Joomla\CMS\MVC\Controller\BaseController;
use Joomla\Component\Ragsearch\Site\Service\OpenAIService;
use Joomla\Component\Ragsearch\Site\Service\VectorStoreService;
use Joomla\Component\Ragsearch\Site\Service\ArticleChunkerService;

class IndexController extends BaseController
{
    private int $batchSize = 20;

    public function build(): void
    {
        $db      = $this->app->getDatabase();
        $chunker = new ArticleChunkerService();
        $openai  = new OpenAIService();
        $store   = new VectorStoreService();

        // Fetch all published Joomla articles
        $query = $db->getQuery(true)
            ->select(['a.id', 'a.title', 'a.introtext', 'a.fulltext'])
            ->from($db->quoteName('#__content', 'a'))
            ->where($db->quoteName('a.state') . ' = 1');

        $articles = $db->setQuery($query)->loadObjectList();

        $indexed = 0;
        $errors  = 0;

        foreach ($articles as $article) {
            try {
                $fullContent = $article->title . "\n\n"
                             . strip_tags($article->introtext) . "\n\n"
                             . strip_tags($article->fulltext);

                $chunks = $chunker->chunk($fullContent);

                if (empty($chunks)) {
                    continue;
                }

                $url = \JRoute::_(
                    'index.php?option=com_content&view=article&id=' . $article->id
                );

                // Delete old embeddings for this article before re-indexing
                $store->deleteArticle($article->id);

                // Process chunks in batches to reduce API calls
                $chunkBatches = array_chunk($chunks, $this->batchSize);

                foreach ($chunkBatches as $batchIndex => $batch) {
                    $embeddings = $openai->embedBatch($batch);

                    foreach ($batch as $i => $chunkText) {
                        $globalIndex = ($batchIndex * $this->batchSize) + $i;
                        $embedding   = $embeddings[$i] ?? null;

                        if (!$embedding) {
                            continue;
                        }

                        $store->upsertChunk(
                            $article->id,
                            $article->title,
                            $globalIndex,
                            $chunkText,
                            $embedding,
                            $url
                        );
                    }

                    // Small pause between batches to stay within API rate limits
                    usleep(200000);
                }

                $indexed++;

            } catch (\Exception $e) {
                $errors++;
                \JLog::add(
                    'RAG indexing failed for article ' . $article->id . ': ' . $e->getMessage(),
                    \JLog::ERROR,
                    'com_ragsearch'
                );
            }
        }

        $this->app->enqueueMessage(
            "Indexing complete. Articles indexed: {$indexed}. Errors: {$errors}.",
            $errors > 0 ? 'warning' : 'success'
        );

        $this->setRedirect('index.php?option=com_ragsearch');
    }
}

The RAG Search Service

Create src/Service/RAGSearchService.php:

<?php

namespace Joomla\Component\Ragsearch\Site\Service;

class RAGSearchService
{
    public function __construct(
        private OpenAIService     $openai,
        private VectorStoreService $store
    ) {}

    public function search(string $query): array
    {
        if (strlen(trim($query)) < 3) {
            return [
                'answer'  => 'Please enter a more specific question.',
                'sources' => [],
            ];
        }

        // Convert the query to a vector embedding
        $queryEmbedding = $this->openai->embed($query);

        // Find the most semantically similar chunks
        $chunks = $this->store->similaritySearch($queryEmbedding, topK: 5);

        if (empty($chunks)) {
            return [
                'answer'  => 'No relevant content found for your query. Try rephrasing your question.',
                'sources' => [],
            ];
        }

        // Filter out low-similarity results
        $relevantChunks = array_filter(
            $chunks,
            fn($c) => ($c['similarity'] ?? 0) > 0.75
        );

        if (empty($relevantChunks)) {
            return [
                'answer'  => 'I could not find content closely matching your question. Please try different keywords.',
                'sources' => [],
            ];
        }

        // Generate a direct answer grounded in the retrieved chunks
        $answer = $this->openai->generateAnswer($query, $relevantChunks);

        // Deduplicate sources by article ID
        $sources = [];
        foreach ($relevantChunks as $chunk) {
            $aid = $chunk['article_id'];
            if (!isset($sources[$aid])) {
                $sources[$aid] = [
                    'title' => $chunk['article_title'],
                    'url'   => $chunk['url'],
                ];
            }
        }

        return [
            'answer'  => $answer,
            'sources' => array_values($sources),
        ];
    }
}

The similarity threshold of 0.75 is worth paying attention to. Below that score the retrieved chunks are probably not relevant enough to be useful for generating an answer. You can adjust this up or down depending on how your content is structured and how specific the queries on your site tend to be. Start at 0.75 and tune based on real search results.

The Search Controller and View

Create src/Controller/SearchController.php:

<?php

namespace Joomla\Component\Ragsearch\Site\Controller;

use Joomla\CMS\MVC\Controller\BaseController;
use Joomla\Component\Ragsearch\Site\Service\OpenAIService;
use Joomla\Component\Ragsearch\Site\Service\VectorStoreService;
use Joomla\Component\Ragsearch\Site\Service\RAGSearchService;

class SearchController extends BaseController
{
    public function search(): void
    {
        $query = trim($this->input->getString('q', ''));

        $result = ['answer' => '', 'sources' => [], 'query' => $query];

        if (!empty($query)) {
            try {
                $service = new RAGSearchService(
                    new OpenAIService(),
                    new VectorStoreService()
                );

                $result = array_merge($result, $service->search($query));

            } catch (\Exception $e) {
                $result['answer'] = 'Search is temporarily unavailable. Please try again shortly.';
                \JLog::add('RAG search error: ' . $e->getMessage(), \JLog::ERROR, 'com_ragsearch');
            }
        }

        $this->app->setUserState('com_ragsearch.result', $result);
        $this->setRedirect(\JRoute::_('index.php?option=com_ragsearch&view=search'));
    }
}

The Blade-equivalent Joomla view template at src/View/Search/tmpl/default.php:

<?php defined('_JEXEC') or die; ?>

<div class="rag-search">

    <form method="POST" action="<?php echo JRoute::_('index.php?option=com_ragsearch&task=search.search'); ?>">
        <?php echo JHtml::_('form.token'); ?>
        <input type="text"
               name="q"
               value="<?php echo htmlspecialchars($this->result['query'] ?? ''); ?>"
               placeholder="Ask anything about our site..."
               autocomplete="off">
        <button type="submit">Search</button>
    </form>

    <?php if (!empty($this->result['answer'])) : ?>

        <div class="rag-answer">
            <h3>Answer</h3>
            <p><?php echo nl2br(htmlspecialchars($this->result['answer'])); ?></p>
        </div>

        <?php if (!empty($this->result['sources'])) : ?>
            <div class="rag-sources">
                <h4>Sources</h4>
                <ul>
                    <?php foreach ($this->result['sources'] as $source) : ?>
                        <li>
                            <a href="<?php echo htmlspecialchars($source['url']); ?>">
                                <?php echo htmlspecialchars($source['title']); ?>
                            </a>
                        </li>
                    <?php endforeach; ?>
                </ul>
            </div>
        <?php endif; ?>

    <?php endif; ?>

</div>

Keeping the Index fresh

The index goes stale the moment an article is updated and not re-indexed. There are two clean ways to handle this in Joomla.

The first is a Joomla plugin that hooks into onContentAfterSave and triggers re-indexing for the saved article specifically. This keeps the index fresh in real time but adds latency to every article save operation.

<?php

class PlgContentRagsearchIndex extends JPlugin
{
    public function onContentAfterSave(
        string $context,
        object $article,
        bool   $isNew
    ): void {
        if ($context !== 'com_content.article') {
            return;
        }

        if ((int) $article->state !== 1) {
            return;
        }

        // Dispatch a Joomla queue task instead of indexing synchronously
        // to avoid blocking the article save response
        \Joomla\CMS\Queue\QueueFacade::push('ragsearch.index_article', [
            'article_id' => $article->id,
        ]);
    }
}

The second approach is a scheduled CLI task that re-indexes all articles on a schedule, say every hour or every night. For sites where content does not change frequently, a nightly re-index via Joomla's task scheduler is simpler and puts zero overhead on the save operation.

For most sites the scheduled approach is the right default. Use the plugin approach only if your content changes continuously throughout the day and freshness matters within minutes.

What this looks like for a Real Visitor

Here is a concrete example. Say your Joomla site has articles about software products and a visitor types: "what happens if I cancel my subscription mid-month?"

Your articles probably use phrases like "pro-rata refund policy", "billing cycle", "account downgrade", not the exact words the visitor used. Keyword search returns nothing. The RAG pipeline converts the query to a vector, finds three chunks from your billing and account articles that are semantically close to that question, feeds them to GPT-4o, and returns something like:

"If you cancel mid-month, your account remains active until the end of your current billing period. You will not be charged for the following month. Refunds for unused days are not issued automatically but can be requested within 7 days of cancellation by contacting support."

Below the answer, the visitor sees links to the two source articles that information came from. They got a direct answer, they can read the full policy if they want to, and they did not have to trawl through search results guessing which article might be relevant.

A Few things to know before Go Live

The embedding cost for the initial indexing run is usually smaller than people expect. A site with 500 articles at 400 words each, split into chunks of 400 words with 50-word overlap, produces roughly 600 to 700 chunks. At OpenAI's current pricing for text-embedding-3-small, that initial index costs well under a dollar. Ongoing costs per search query are minimal, one embedding call per query.

Caching search results is worth adding early. Many visitors on the same site ask very similar questions. Store recent query-answer pairs in Joomla's cache layer with a TTL of a few hours. The cache hit rate on popular queries tends to be high and it cuts both API costs and response time meaningfully.

Finally, keep an eye on what people actually search for. Log the queries, log whether the similarity search returned results above the threshold, and log whether users clicked the source links. After a few weeks you will see which questions the pipeline handles well and which ones consistently miss. That data tells you whether your chunking strategy needs adjusting, whether your similarity threshold is set correctly, and whether there are content gaps on your site worth addressing.

The RAG pattern is one of the most practically useful things you can add to a content-heavy Joomla site. It turns a search box that frustrates visitors into one that actually helps them find what they need, in their own words, without requiring your content to match their exact phrasing.

Comments · 0

Post a Comment