Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track amount of visits per short URL so that we don't have to do COUNT aggregates at runtime #2036

Closed
acelaya opened this issue Feb 29, 2024 · 3 comments · Fixed by #2074
Closed

Comments

@acelaya
Copy link
Member

acelaya commented Feb 29, 2024

The short URLs list can be inefficient when ordering by amount of visits, if there are a lot of visits in the database. This is because we have to join with visits, and group the list by the COUNT(DISTINCT visits)) aggregate.

In other queries we have solved similar problems by using sub-queries, but this presents other problems and introduces a lot of complexity because of doctrine limitations.

One possible solution would be to track an incremental value per short URL that gets updated every time there's a new visit, and use that value when ordering by short URL or when trying to get the amount of visits for a short URL.

This would introduce another problem though. It would lock the short URL row for every visit, while trying to update it, causing delays and database locks for instances with a lot of traffic.

There's a pattern to work around that though, the Slotted Counter Pattern, which consists on tracking that in multiple rows from a related table, rather than the short URLs table itself.

Let's say there's 100 rows per short URL, and each one of those holds a partial amount of visits. When a visit occurs, one of those rows is picked randomly and updated, reducing the chances of concurrent visits to lock each other 100 times.

In order to get the actual amount of visits, we would have to SUM all partial totals for a short URL, which is way more efficient than COUNTing different visits.

Considerations

  • When migrating to this approach, we should create an initial entry per short URL with current amount of visits, and then let it naturally evolve from there.
  • We return amounts of bot visits and regular visits, so we should track those two values separately.
  • We could add a feature flag that allows this behavior to be disabled, so that people can choose to use the old less efficient query.

This could potentially be used to improve performance when loading tags with stats, as there are also COUNT aggregates and sub-queries involved there. It would not cover the case in which there are a lot of short URLs, but I imagine the usual is to have way more visits than short URLs in a particular instance.

@acelaya
Copy link
Member Author

acelaya commented Mar 1, 2024

How to generate a random number in every DB engine:

  • SQLite: SELECT random()
  • PostgreSQL: SELECT random()
  • MySQL/MariaDB: SELECT RAND()
  • MS SQL: SELECT RAND()

These are not supported by DQL by default, so we may need to add a custom function that can be used with all of them https://www.doctrine-project.org/projects/doctrine-orm/en/3.0/cookbook/dql-user-defined-functions.html

This library can be used for inspiration, as it implements a bunch of them: https://github.com/beberlei/DoctrineExtensions

@acelaya
Copy link
Member Author

acelaya commented Mar 17, 2024

This should allow supporting RAND() in DQL, in a way that it gets translated to the right platform function.

use Doctrine\DBAL\Platforms\PostgreSQLPlatform;
use Doctrine\DBAL\Platforms\SQLitePlatform;
use Doctrine\ORM\Query\AST\Functions\FunctionNode;
use Doctrine\ORM\Query\Parser;
use Doctrine\ORM\Query\SqlWalker;
use Doctrine\ORM\Query\TokenType;

class Rand extends FunctionNode
{
    public function getSql(SqlWalker $sqlWalker): string
    {
        return match ($sqlWalker->getConnection()->getDatabasePlatform()::class) {
            PostgreSQLPlatform::class, SQLitePlatform::class => 'random()',
            default => 'RAND()',
        };
    }

    public function parse(Parser $parser): void
    {
        $parser->match(TokenType::T_IDENTIFIER);
        $parser->match(TokenType::T_OPEN_PARENTHESIS);
        $parser->match(TokenType::T_CLOSE_PARENTHESIS);
    }
}

Registered as

$entityManager->getConfiguration()->addCustomNumericFunction('RAND', Rand::class);

However, considering the insertion may happen at an entity level, it might make sense to generate the random number in PHP code, via rand(1, 100).


EDIT

Actually, the kind of query needed for this to be fully efficient will require a native query, not DQL or entities, so the above is probably not needed.

INSERT INTO short_url_visits_counts(short_url_id, potential_bot, slot_id, count)
VALUES (123, false, RAND() * 100, 1)
ON DUPLICATE KEY UPDATE count = count + 1;

@acelaya
Copy link
Member Author

acelaya commented Mar 22, 2024

A listener to set/update counts every time a Visit is persisted:

use Doctrine\Common\EventSubscriber;
use Doctrine\DBAL\Connection;
use Doctrine\DBAL\Platforms\PostgreSQLPlatform;
use Doctrine\DBAL\Platforms\SQLitePlatform;
use Doctrine\DBAL\Platforms\SQLServerPlatform;
use Doctrine\ORM\EntityManagerInterface;
use Doctrine\ORM\Event\PreFlushEventArgs;
use Shlinkio\Shlink\Core\Visit\Entity\Visit;

$em->getEventManager()->addEventSubscriber(new class implements EventSubscriber {
    public function getSubscribedEvents(): array
    {
        return ['preFlush'];
    }

    public function preFlush(PreFlushEventArgs $args): void
    {
        $em = $args->getObjectManager();
        $entitiesToBeCreated = $em->getUnitOfWork()->getScheduledEntityInsertions();

        foreach ($entitiesToBeCreated as $entity) {
            $this->trackVisitCount($em, $entity);
        }
    }

    private function trackVisitCount(EntityManagerInterface $em, object $entity): void
    {
        // This is not a non-orphan visit
        if (!$entity instanceof Visit || $entity->shortUrl === null) {
            return;
        }
        $visit = $entity;

        // The short URL is not persisted yet
        $shortUrlId = $visit->shortUrl->getId();
        if ($shortUrlId === null || $shortUrlId === '') {
            return;
        }

        $isBot = $visit->potentialBot;
        $conn = $em->getConnection();
        $platformClass = $conn->getDatabasePlatform();

        match ($platformClass::class) {
            PostgreSQLPlatform::class => $this->incrementForPostgres($conn, $shortUrlId, $isBot),
            SQLitePlatform::class, SQLServerPlatform::class => $this->incrementForOthers($conn, $shortUrlId, $isBot),
            default => $this->incrementForMySQL($conn, $shortUrlId, $isBot),
        };
    }

    private function incrementForMySQL(Connection $conn, string $shortUrlId, bool $potentialBot): void
    {
        $this->incrementWithPreparedStatement($conn, $shortUrlId, $potentialBot, <<<QUERY
            INSERT INTO short_url_visits_counts (short_url_id, potential_bot, slot_id, count)
            VALUES (:short_url_id, :potential_bot, RAND() * 100, 1)
            ON DUPLICATE KEY UPDATE count = count + 1;
            QUERY);
    }

    private function incrementForPostgres(Connection $conn, string $shortUrlId, bool $potentialBot): void
    {
        $this->incrementWithPreparedStatement($conn, $shortUrlId, $potentialBot, <<<QUERY
            INSERT INTO short_url_visits_counts (short_url_id, potential_bot, slot_id, count)
            VALUES (:short_url_id, :potential_bot, random() * 100, 1)
            ON CONFLICT (short_url_id, potential_bot, slot_id) DO UPDATE
              SET count = count + 1;
            QUERY);
    }

    private function incrementWithPreparedStatement(
        Connection $conn,
        string $shortUrlId,
        bool $potentialBot,
        string $query,
    ): void {
        $statement = $conn->prepare($query);
        $statement->bindValue('short_url_id', $shortUrlId);
        $statement->bindValue('potential_bot', $potentialBot);
        $statement->executeStatement();
    }

    private function incrementForOthers(Connection $conn, string $shortUrlId, bool $potentialBot): void
    {
        $slotId = rand(1, 100);

        // For engines without a specific UPSERT syntax, do a regular locked select followed by an insert or update
        $qb = $conn->createQueryBuilder();
        $qb->select('id')
           ->from('short_url_visits_counts')
           ->where($qb->expr()->and(
               $qb->expr()->eq('short_url_id', ':short_url_id'),
               $qb->expr()->eq('potential_bot', ':potential_bot'),
               $qb->expr()->eq('slot_id', ':slot_id'),
           ))
           ->setParameter('short_url_id', $shortUrlId)
           ->setParameter('potential_bot', $potentialBot)
           ->setParameter('slot_id', $slotId)
           ->forUpdate()
           ->setMaxResults(1);

        $resultSet = $qb->executeQuery()->fetchOne();
        $writeQb = ! $resultSet
            ? $conn->createQueryBuilder()
                   ->insert('short_url_visits_counts')
                   ->values([
                       'short_url_id' => ':short_url_id',
                       'potential_bot' => ':potential_bot',
                       'slot_id' => ':slot_id',
                   ])
            : $conn->createQueryBuilder()
                   ->update('short_url_visits_counts')
                   ->set('count', 'count + 1')
                   ->where($qb->expr()->and(
                       $qb->expr()->eq('short_url_id', ':short_url_id'),
                       $qb->expr()->eq('potential_bot', ':potential_bot'),
                       $qb->expr()->eq('slot_id', ':slot_id'),
                   ));

        $writeQb->setParameter('short_url_id', $shortUrlId)
                ->setParameter('potential_bot', $potentialBot)
                ->setParameter('slot_id', $slotId)
                ->executeStatement();
    }
});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant