In commit e51e03446, we started using the same code to show stats in the
public area and in the admin area. However, in doing so we introduced a
bug, since stats in the public area are only shown after a certain part
of the process has finished, meaning the stats appearing on the page
never change (in theory), so it's perfectly fine to cache them. However,
in the admin area stats can be accessed while the process is still
ongoing, so caching the stats will lead to the wrong results being
displayed.
We've thought about expiring the cache when new supports or ballot lines
are added; however, that means the methods calculating the stats for the
supporting phase would expire when supports are added/removed but the
methods calculating the stats for the voting phase would expire when
ballot lines are added/removed. It gets even more complex because the
`headings` method calculates stats for both the supporting and the
voting phases.
So, since loading stats in the admin section is fast even without the
cache because they only load very basic statistics, we're taking the
simple approach of disabling the cache in this case, so everything works
the same way it did before commit e51e03446.
Co-authored-by: Javi Martín <javim@elretirao.net>
When we first started caching the stats, generating them was a process
that took several minutes, so we never expired the cache.
However, there have been cases where we run into issues where the stats
shown on the screen were outdated. That's why we introduced a task to
manually expire the cache.
But now, generating the stats only takes a few seconds, so we can
automatically expire them every day, remove all the logic needed to
manually expire them, and get rid of most of the issues related to the
cache being outdated.
We're expiring them every day because it's the same day we were doing in
public stats (which we removed in commit 631b48f58), only we're using
`expires_at:` to set the expiration time, in order to simplify the code.
Note that, in the test, we're using `travel_to(time)` so the test passes
even when it starts an instant before midnight. We aren't using
`:with_frozen_time` because, in similar cases (although not in this
case, but I'm not sure whether that's intentional), `travel_to` shows
this error:
> Calling `travel_to` with a block, when we have previously already made
> a call to `travel_to`, can lead to confusing time stubbing.
Since now generating stats (assuming the results aren't in the cache)
only takes a few seconds even when there are a hundred thousand
participants, as opposed to the several minutes it took to generate them
when we introduced the Cron job, we can simply generate the stats during
the first request to the stats page.
Note that, in order to avoid creating a temporary table when the stats
are cached, we're making sure we only create this table when we need to.
Otherwise, we could spend up to 1 second on every request to the stats
page creating a table that isn't going to be used.
Also note we're using an instance variable to check whether we're
creating a table; I tried to use `table_exists?`, but it didn't work. I
wonder whether `table_exists?` doesn't detect temporary tables.
Since we're doing many queries to get stats for each age group and each
geozone, testing shows these indices make stats calculation about 25%
faster on processes with 100,000 participants.
Debugging shows that the bottleneck in the stats calculation is the
number of times we're querying the users table using the same array of
IDs in the `where` condition but each time combined with other
conditions.
So we're inserting the results of querying the users table with the
array of IDs in a temporary table and using this temporary table for the
other calculations. When querying this temporary table, there's no need
to filter for IDs anymore.
For budget stats, the `generate` method is now about 10-20 times faster
for a budget with 20,000 participants. For budgets with only a few dozen
participants, there's no significant difference in performance.
I thought about modifying the `participants` method and use the
temporary table there. The problem, however, is that in this case it
isn't clear when to drop the temporary table, and we could end up with
thousands of temporary tables in the database if we don't do it right.
Creating and dropping the temporary table in the same transaction, on
the other hand, guarantees that won't be the case.
Note there's no risk of duplicate tables since they're created and
dropped inside a transaction, so we're always using the same table name
for the same resource. We're adding a test that fails with a
`PG::DuplicateTable: ERROR: relation "participants__1"` error if we
don't use a transaction.
Back in commit 383909e16, we said:
> Even if this class looks very simple now, we're trying a few things
> related to these stats. Having a class for it makes future changes
> easier and, if there weren't any future changes, at least it makes
> current experiments easier.
Since there haven't been any changes in the last 5 years and we've found
cases where using the GeozoneStats class results in a slightly worse
performance, we're removing this class. The code is now a bit easier to
read, and is consistent with the way we calculate participants by age.
This reverts commit 383909e16.
We were calculating the age stats based on the age of the users who
participated... at the moment where we were calculating the stats. That
means that, if 20 years ago, 1000 people who were 16 years old
participated, they would be shown as having 36 years in the stats.
Instead, we want to show the stats at the time when the process took
place, so we're implementing a `participation_date` method.
Note that, for polls, we could actually use the `age` column in the
`poll_voters` table. However, doing so would be harder, would only work
for polls but not for budgets, and it wouldn't be statistically very
relevant, since the stats are shown by age groups, and only a small
percentage of people would change their age group (and only to the
nearest one) between the time they participate and the time the process
ends.
We might use the `poll_voters` table in the future, though, since we
have a similar issue with geozones and genders, and using the
information in `poll_voters` would solve it as well (only for polls,
though).
Also note that we're using the `ends_at` dates because some people but
be too young to vote when a process starts but old enough to vote when
the process ends.
Finally, note that we might need to change the way we calculate the
participation date for a budget, since some budgets might not enabled
every phase. Not sure how stats work in that scenario (even before these
changes).
We were already applying these rules in most cases.
Note we aren't enabling the `MultilineArrayLineBreaks` rule because
we've got places with many elements whire it isn't clear whether
having one element per line would make the code more readable.
These methods are only used while stats are being generated; once stats
are generated, they aren't used anymore. So there's no need to store
them using the Dalli cache.
Furthermore, there are polls (and even budgets) with hundreds of
thousands of participants. Calculating stats for them takes a very long
time because we can't store all those records in the Dalli cache.
However, since these records aren't used once the stats are generated,
we can store them in an instance variable while we generate the stats,
speeding up the process.
We need a way to manually expire the cache for a budget or poll without
expiring the cache of every budget or poll.
Using the `updated_at` column would be dangerous because most of the
times we update a budget or a poll, we don't need to regenerate their
stats.
We've considered adding a `stats_updated_at` column to each of these
tables. However, in that case we would also need to add a similar column
in the future to every process type whose stats we want to generate.
If users participated and were hidden after participating, we should
still count them in the participants stats.
In the tests, we set users' `hidden_at` attribute before they vote.
Although in real life they would vote first and then they would be
hidden, I've written the tests like this for the sake of simplicity.
Even if this class looks very simple now, we're trying a few things
related to these stats. Having a class for it makes future changes
easier and, if there weren't any future changes, at least it makes
current experiments easier.
Note we keep the method `participants_by_geozone` to return a hash
because we're caching the stats and storing GeozoneStats objects would
need a lot more memory and we would get an error.
So if we don't have information regarding gender, age or geozone, stats
regarding those topics will not be shown.
Note we're using `spec/models/statisticable_spec.rb` because having the
same file in `spec/models/concerns` caused the tests to be executed
twice.
Also note the implementation behind the `gender?`, `age?` and `geozone?`
methods is a bit primitive. We might need to make it more robust in the
future.
It will make it far easier to call other methods on the stats object,
and we're already caching the methods.
We had to remove the view fragment caching because the stats object
isn't as easy to cache. The good thing about it is the view will
automatically be updated when we change logic regarding which stats to
show, and the methods taking long to execute are cached in the model.
For now we think showing them would be showing too much data and it
would be a bit confusing.
I've been tempted to just remove the view and keep the methods in the
model in case they're used by other institutions using CONSUL. However,
it's probably better to wait until we're asked to re-implement them, and
in the meantime we don't maintain code nobody uses. The code wasn't that
great to start with (I know it because I wrote it).
While we already had "one test to rule all stats", testing each method
individually makes reading, adding and changing tests easier.
Note we need to make all methods being tested public. We could also test
them using methods like `stats.generate[:total_valid_votes]` instead of
`stats.total_valid_votes`, but then the tests would be more difficult to
read.