December 29, 2025

December 29, 2025

In many Symfony-based CMS and blog applications, duplicate content is a silent issue. Similar articles are rewritten by editors, documentation develops naturally, and eventually you have several pages expressing the same idea in different ways.

The exact text matching used in traditional duplicate detection fails as soon as the wording is altered.

In this tutorial, we'll use OpenAI embeddings to create an AI-powered duplicate content detector in Symfony. We'll compare semantic meaning instead of matching keywords, which is the same method that contemporary search engines employ.

What We're Constructing

After completing this guide, you will have:

  • AI-produced embeddings for every article
  • A cosine similarity-based semantic similarity checker
  • A command for the console to find duplicates
  • A threshold for similarity (e.g., 85%+) to mark content
  • Any Symfony CMS can be integrated with this foundation.

This is effective for:

  • Blogs
  • Knowledge bases
  • Portals for documentation
  • Pages with e-commerce content

Requirements

  • Symfony 6 or 7
  • PHP 8.1+
  • Doctrine ORM
  • MySQL / PostgreSQL
  • An OpenAI API key

Step 1: Add an Embedding Column to Your Entity

Assume an Article entity.

src/Entity/Article.php

    
        #[ORM\Column(type: 'json', nullable: true)]
        private ?array $embedding = null;

        public function getEmbedding(): ?array
        {
            return $this->embedding;
        }

        public function setEmbedding(?array $embedding): self
        {
            $this->embedding = $embedding;
            return $this;
        }
    

Create and run migration:

    
        php bin/console make:migration
        php bin/console doctrine:migrations:migrate
    

Step 2: Generate Embeddings for Articles

Create a Symfony command:

    
        php bin/console make:command app:generate-article-embeddings
    

GenerateArticleEmbeddingsCommand.php

    
        namespace App\Command;

        use App\Entity\Article;
        use Doctrine\ORM\EntityManagerInterface;
        use Symfony\Component\Console\Command\Command;
        use Symfony\Component\Console\Input\InputInterface;
        use Symfony\Component\Console\Output\OutputInterface;

        class GenerateArticleEmbeddingsCommand extends Command
        {
            protected static $defaultName = 'app:generate-article-embeddings';

            public function __construct(
                private EntityManagerInterface $em,
                private string $apiKey
            ) {
                parent::__construct();
            }

            protected function execute(InputInterface $input, OutputInterface $output): int
            {
                $articles = $this->em->getRepository(Article::class)->findAll();

                foreach ($articles as $article) {
                    if ($article->getEmbedding()) {
                        continue;
                    }

                    $embedding = $this->getEmbedding(
                        strip_tags($article->getContent())
                    );

                    $article->setEmbedding($embedding);
                    $this->em->persist($article);

                    $output->writeln("Embedding generated for article ID {$article->getId()}");
                }

                $this->em->flush();
                return Command::SUCCESS;
            }

            private function getEmbedding(string $text): array
            {
                $payload = [
                    'model' => 'text-embedding-3-small',
                    'input' => mb_substr($text, 0, 4000)
                ];

                $ch = curl_init('https://api.openai.com/v1/embeddings');
                curl_setopt_array($ch, [
                    CURLOPT_RETURNTRANSFER => true,
                    CURLOPT_HTTPHEADER => [
                        "Content-Type: application/json",
                        "Authorization: Bearer {$this->apiKey}"
                    ],
                    CURLOPT_POST => true,
                    CURLOPT_POSTFIELDS => json_encode($payload)
                ]);

                $response = curl_exec($ch);
                curl_close($ch);

                return json_decode($response, true)['data'][0]['embedding'] ?? [];
            }
        }
    

Store the API key in .env.local

    
        OPENAI_API_KEY=your_key_here
    

Step 3: Cosine Similarity Helper

Create a reusable helper.

src/Service/SimilarityService.php

    
        namespace App\Service;

        class SimilarityService
        {
            public function cosine(array $a, array $b): float
            {
                $dot = 0;
                $magA = 0;
                $magB = 0;

                foreach ($a as $i => $val) {
                    $dot += $val * $b[$i];
                    $magA += $val ** 2;
                    $magB += $b[$i] ** 2;
                }

                return $dot / (sqrt($magA) * sqrt($magB));
            }
        }
    

Step 4: Detect Duplicate Articles

Create another command:

    
        php bin/console make:command app:detect-duplicates
    

DetectDuplicateContentCommand.php

    
        namespace App\Command;

        use App\Entity\Article;
        use App\Service\SimilarityService;
        use Doctrine\ORM\EntityManagerInterface;
        use Symfony\Component\Console\Command\Command;
        use Symfony\Component\Console\Input\InputInterface;
        use Symfony\Component\Console\Output\OutputInterface;

        class DetectDuplicateContentCommand extends Command
        {
            protected static $defaultName = 'app:detect-duplicates';

            public function __construct(
                private EntityManagerInterface $em,
                private SimilarityService $similarity
            ) {
                parent::__construct();
            }

            protected function execute(InputInterface $input, OutputInterface $output): int
            {
                $articles = $this->em->getRepository(Article::class)->findAll();
                $threshold = 0.85;

                foreach ($articles as $i => $a) {
                    foreach ($articles as $j => $b) {
                        if ($j <= $i) continue;
                        if (!$a->getEmbedding() || !$b->getEmbedding()) continue;

                        $score = $this->similarity->cosine(
                            $a->getEmbedding(),
                            $b->getEmbedding()
                        );

                        if ($score >= $threshold) {
                            $output->writeln(
                                sprintf(
                                    "⚠ Duplicate detected (%.2f): Article %d and %d",
                                    $score,
                                    $a->getId(),
                                    $b->getId()
                                )
                            );
                        }
                    }
                }

                return Command::SUCCESS;
            }
        }
    

Step 5: Run via Cron (Optional)

To scan regularly, add a cron job:

    
        0 2 * * * php /path/to/project/bin/console app:detect-duplicates
    

You can store results in a table or send email notifications.

Example Output

    
        Duplicate detected (0.91): Article 12 and 37
        Duplicate detected (0.88): Article 18 and 44
    

Useful Improvements

This system can be expanded with:

  • Admin UI for reviewing duplicates
  • Canonical page suggestions automatically
  • Weighting of the title and excerpt
  • Similarity detection at the section level
  • Using Messenger for batch processing
  • Large-scale vector databases

Cost & Performance Advice

  • Create embeddings for each article only once.
  • Before embedding, limit the length of the content.
  • Ignore the draft content
  • Cache similarity findings
  • For big datasets, use queues.

Next up, we’ll build a AI SEO Content Quality Analyzer for WordPress.

Next
This is the most recent post.
Older Post

0 comments:

Post a Comment