The Hourly Count Paradox: Why Daily Totals Tell a Different Story / Part 3: Correcting Overcounting with Advanced SQL Queries
Jul 23, 2024
In this section, we'll dive into the SQL query used for retrieving unique visits by hour and explore the results through example images. We will also understand how the ROW_NUMBER() function contributes to accurate visit tracking and how this approach helps prevent overcounting issues in monthly and yearly counts.
SQL Query Breakdown
The query for fetching unique visits by hour handles timezone conversions and counts the first visit per user per hour. Here’s the SQL query:
public function getUniqueVisitsByHour(Request $request)
{
try {
$selectedDate = $request->input('selectedDate');
// Validate the date format
if (!$selectedDate ||
!Carbon::hasFormat($selectedDate, 'Y-m-d')) {
return response()->json(
['error' => 'Invalid date format.'],
400
);
}
$startDateTime = Carbon::parse(
"{$selectedDate} 00:00:00",
'Asia/Bangkok'
)->setTimezone('UTC');
$endDateTime = Carbon::parse(
"{$selectedDate} 23:59:59",
'Asia/Bangkok'
)->setTimezone('UTC');
$sql = "
WITH hourly_first_visit AS (
SELECT
user_id,
DATE_FORMAT(
CONVERT_TZ(created_at, '+00:00', '+07:00'),
'%Y-%m-%d %H:00:00'
) AS hour_slot,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY created_at
) AS rn
FROM
user_visits
WHERE created_at
BETWEEN :startDateTime AND :endDateTime
)
SELECT
hour_slot,
COUNT(user_id) AS unique_visits
FROM
hourly_first_visit
WHERE
rn = 1
GROUP BY
hour_slot
ORDER BY
hour_slot
";
$results = DB::select(DB::raw($sql), [
'startDateTime' => $startDateTime,
'endDateTime' => $endDateTime
]);
return response()->json($results);
} catch (\Exception $e) {
// Return a JSON response with the error message
return response()->json(
['error' => 'An error occurred while processing your request.'],
500
);
}
}
Query Explanation
Common Table Expression (CTE): hourly_first_visit
- DATE_FORMAT(CONVERT_TZ(...)) AS hour_slot: Converts created_at to the local timezone and formats it to the hour level.
- ROW_NUMBER() OVER (...) AS rn: Assigns a unique row number to each visit within the partition of user_id. This means the row numbering restarts for each user, and rows are ordered by created_at.
Main Query
- SELECT hour_slot, COUNT(user_id) AS unique_visits: Counts the number of unique users per hour slot.
- WHERE rn = 1: Ensures only the first visit per user per hour is considered.
Understanding ROW_NUMBER()
The ROW_NUMBER() function is crucial for determining the earliest visit within each hour slot for each user. It provides a unique sequential number to each row within the partition.
Example Data and rn Result:
Consider the following data:
Intermediate Result:
Here, rn is used to identify the first visit (rn = 1) for each user within each hour slot.
Final Aggregated Result:
Visual Representation:
Diagram 1: Sum of Member Visit Count by Hour
This diagram illustrates that the sum of member visit counts by hour on July 12, 2024, is 9.
Diagram 2: Member Visit Count by Day
This diagram summarizes the total daily visits for July 12, 2024:
- Total Visits: 9
Reconciling the Discrepancy
Previously, discrepancies existed between hourly and daily visit counts. With the revised query and counting method, both diagrams now show an equal total of 9 member visits for July 12, 2024. This confirms the accuracy of our tracking method and ensures consistent results across different time frames.
Preventing Overcounting Issues
The approach demonstrated in this query is not only useful for hourly analysis but also for monthly and yearly counts. By applying the ROW_NUMBER() function with the PARTITION BY clause, you ensure that each user's visit is counted only once per time period, thus preventing overcounting.
For example:
- Monthly Count: When aggregating visits by month, the same logic can be applied to ensure that each user is counted only once per month. This prevents the duplication of visit counts if a user visits multiple times within the same month.
- Yearly Count: Similarly, when calculating yearly statistics, using ROW_NUMBER() helps in accurately counting unique visits by considering only the first occurrence within each year for each user.
This method provides a precise count of unique visits, avoiding inflated metrics caused by multiple entries for the same user within the same time period.
Conclusion
In this part, we have detailed the SQL query used to calculate unique visits by hour, explained the role of the ROW_NUMBER() function, and visualized the results. Additionally, we have highlighted how this approach helps prevent overcounting issues in monthly and yearly statistics, ensuring accurate and meaningful analysis of user engagement.