Saturday, December 9, 2017

Auditing a Failed Election

Auditoría Integral y Seguridad de Sistemas de Información (Audisis),the Colombian firm hired by the TSE on November 13 to audit the election results for the November 26th election, has released an unsatisfactory memo detailing what caused the TSE computers to fail on November 29,  and what modifications were made to the system to fix it.  In the process it raises more questions than it answers.

The Tribunal Supremo Electoral (TSE) announced it had reached a contract with Audisis to audit the election results of the November 26th election on November 13th.  David Matamoros Batson informed everyone at the time that they were lucky to have found such a qualified firm for only $700,000.  Matamoros stated that Audisis and the TSE had agreed on 7 points to be audited:
(1) Security of the actas, that the image scanned in the voting center is the same as the image received by the TSE to count.
(2) verify the software used by the TSE for counting and publicizing the tallies is what is being used.
(3) network security
(4) verify the functioning of the vote tallying software for all the elections being run on November 26.
(5) security of the database
(6) make sure transmitted actas are shared with the political parties
(7) evaluate the suitability of the technology being used to publicize the results.

So with the above in mind, Audisis released, through the TSE, a "report" of what happened when the system went down on November 29th.  According to Audisis, the system went down at 9:42 AM because it had filled up its database.  This, by the way, contradicts President Juan Orlando Hernandez who claims it never crashed or became unavailable, just slow. 

What the graphic in the report shows is two server instances running with a 12 Terabyte SAN storage network, but only a 600 Gigabyte database allocated, and apparently shared between the two servers, which are clustered for high availability.  It then took them 3.2 hours to expand the database to 1.8 Terabytes, bring up the servers, and perform a minimal data audit.  The servers were up for production again at 1:08 pm and they began adding actas again at 1:10 pm.

 They continued to observe problems with database performance and decided to bring the system down again at 6 pm the same day.  They increased the database size again; this time to 6 Terabytes.  It took them 5 hours 30 minutes to reconfigure the system to use the additional capacity.  They also added a 3rd server, this one configured with a 1.8 Terabyte database, to receive replicated data from the original database as a check of system integrity.  The system returned to production around 11:30 pm that evening, almost 9 hours after it was halted.

The first thing you do when you design a database is design the table structure, then make a good faith estimate of how much storage space the data will need.  You always give a healthy overestimate because you have to remove the database from use to increase its size.  When you're processing election data, you don't want that to happen.  With the kinds of data we are dealing with here, you should be able to make very precise estimates of how much storage space should be needed. Yet somehow they failed.

I can't fathom how they filled up a 600 Gigabyte database, even with all the acta images from Presidential, Congressional, and Municipal elections and a complete voter roll stored in the database, I estimated it would only take about 34 Gigabytes of database storage to process the results of the election.  I, after all, stored the complete results of the 2013 election in a database on my laptop without it taking up even 20 Gigabytes.  Even with every conceivable kind of transaction logging turned on, I'd be hard pressed to design a database requiring 100 Gigabytes.  What were they doing?

Replication put simply is the ability to have two or more databases stay synchronized to provide greater availability.  If they are in geographically different locations they can also be used for disaster recovery.  Since the databases need to remain identical, normally they would need to be of the same size.  So why would you replicate data from a 6 Terabyte database to a 1.8 Terabyte database?  Doesn't that mean you didn't need the 6 Terabyte database size?  Or even the 1.8 Terabyte database size?  Unclear from the report released by the TSE is whether the third server added already had a cloned database, so that only newly added data needed to be replicated, or whether the database was empty and needed to replicate all the existing data across a network.  While replication is designed by database vendors to minimize its impact on servers, it still has a measurable impact.

In many ways the report released by the TSE, supposedly compiled by Audisis, raises more questions than it answers.  It provides an excuse for why the TSE systems went down for almost 10 hours, but the reason doesn't seem credible. 

No comments: