In many Symfony-based CMS and blog applications, duplicate content is a silent issue. Similar articles are rewritten by editors, documentation develops naturally, and eventually you have several pages expressing the same idea in different ways.
The exact text matching used in traditional duplicate detection fails as soon as the wording is altered.
In this tutorial, we'll use OpenAI embeddings to create an AI-powered duplicate content detector in Symfony. We'll compare semantic meaning instead of matching keywords, which is the same method that contemporary search engines employ.
What We're Constructing
After completing this guide, you will have:
- AI-produced embeddings for every article
- A cosine similarity-based semantic similarity checker
- A command for the console to find duplicates
- A threshold for similarity (e.g., 85%+) to mark content
- Any Symfony CMS can be integrated with this foundation.
This is effective for:
- Blogs
- Knowledge bases
- Portals for documentation
- Pages with e-commerce content
Requirements
- Symfony 6 or 7
- PHP 8.1+
- Doctrine ORM
- MySQL / PostgreSQL
- An OpenAI API key
Step 1: Add an Embedding Column to Your Entity
Assume an Article entity.
src/Entity/Article.php
#[ORM\Column(type: 'json', nullable: true)]
private ?array $embedding = null;
public function getEmbedding(): ?array
{
return $this->embedding;
}
public function setEmbedding(?array $embedding): self
{
$this->embedding = $embedding;
return $this;
}
Create and run migration:
php bin/console make:migration
php bin/console doctrine:migrations:migrate
Step 2: Generate Embeddings for Articles
Create a Symfony command:
php bin/console make:command app:generate-article-embeddings
GenerateArticleEmbeddingsCommand.php
namespace App\Command;
use App\Entity\Article;
use Doctrine\ORM\EntityManagerInterface;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
class GenerateArticleEmbeddingsCommand extends Command
{
protected static $defaultName = 'app:generate-article-embeddings';
public function __construct(
private EntityManagerInterface $em,
private string $apiKey
) {
parent::__construct();
}
protected function execute(InputInterface $input, OutputInterface $output): int
{
$articles = $this->em->getRepository(Article::class)->findAll();
foreach ($articles as $article) {
if ($article->getEmbedding()) {
continue;
}
$embedding = $this->getEmbedding(
strip_tags($article->getContent())
);
$article->setEmbedding($embedding);
$this->em->persist($article);
$output->writeln("Embedding generated for article ID {$article->getId()}");
}
$this->em->flush();
return Command::SUCCESS;
}
private function getEmbedding(string $text): array
{
$payload = [
'model' => 'text-embedding-3-small',
'input' => mb_substr($text, 0, 4000)
];
$ch = curl_init('https://api.openai.com/v1/embeddings');
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
"Content-Type: application/json",
"Authorization: Bearer {$this->apiKey}"
],
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => json_encode($payload)
]);
$response = curl_exec($ch);
curl_close($ch);
return json_decode($response, true)['data'][0]['embedding'] ?? [];
}
}
Store the API key in .env.local
OPENAI_API_KEY=your_key_here
Step 3: Cosine Similarity Helper
Create a reusable helper.
src/Service/SimilarityService.php
namespace App\Service;
class SimilarityService
{
public function cosine(array $a, array $b): float
{
$dot = 0;
$magA = 0;
$magB = 0;
foreach ($a as $i => $val) {
$dot += $val * $b[$i];
$magA += $val ** 2;
$magB += $b[$i] ** 2;
}
return $dot / (sqrt($magA) * sqrt($magB));
}
}
Step 4: Detect Duplicate Articles
Create another command:
php bin/console make:command app:detect-duplicates
DetectDuplicateContentCommand.php
namespace App\Command;
use App\Entity\Article;
use App\Service\SimilarityService;
use Doctrine\ORM\EntityManagerInterface;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
class DetectDuplicateContentCommand extends Command
{
protected static $defaultName = 'app:detect-duplicates';
public function __construct(
private EntityManagerInterface $em,
private SimilarityService $similarity
) {
parent::__construct();
}
protected function execute(InputInterface $input, OutputInterface $output): int
{
$articles = $this->em->getRepository(Article::class)->findAll();
$threshold = 0.85;
foreach ($articles as $i => $a) {
foreach ($articles as $j => $b) {
if ($j <= $i) continue;
if (!$a->getEmbedding() || !$b->getEmbedding()) continue;
$score = $this->similarity->cosine(
$a->getEmbedding(),
$b->getEmbedding()
);
if ($score >= $threshold) {
$output->writeln(
sprintf(
"⚠ Duplicate detected (%.2f): Article %d and %d",
$score,
$a->getId(),
$b->getId()
)
);
}
}
}
return Command::SUCCESS;
}
}
Step 5: Run via Cron (Optional)
To scan regularly, add a cron job:
0 2 * * * php /path/to/project/bin/console app:detect-duplicates
You can store results in a table or send email notifications.
Example Output
Duplicate detected (0.91): Article 12 and 37
Duplicate detected (0.88): Article 18 and 44
Useful Improvements
This system can be expanded with:
- Admin UI for reviewing duplicates
- Canonical page suggestions automatically
- Weighting of the title and excerpt
- Similarity detection at the section level
- Using Messenger for batch processing
- Large-scale vector databases
Cost & Performance Advice
- Create embeddings for each article only once.
- Before embedding, limit the length of the content.
- Ignore the draft content
- Cache similarity findings
- For big datasets, use queues.
Next up, we’ll build a AI SEO Content Quality Analyzer for WordPress.
0 comments:
Post a Comment