Comparing databases (and discovering just how far I have to go)

Since arriving in Mexico I have been steadily building a database of narcomensajes.  Over the course of several months, and working with different media sources, that base has grown by the thousands. It was never very clear, however, just how many messages I could expect to find. By one very rough estimate, I thought I should be able to find at least 3,500 messages from between 2006 and 2013. When my database passed 4,000 messages, I figured it was time to pause, and to try to get some sense of the real size of the phenomenon I am studying.

To do so, I planned to cross-check my database with another. One of the few previous studies of narcomensajes uses government-compiled crime data for the period 2006-2011, which was leaked to CIDE (a local university). That base only includes messages left at the scene of a homicide, whereas my database includes other types of messages as well. Still, I figured the government would have unparalleled access to data, and thus that by comparing my collection of messages left at crime scenes to this other collection of messages, I would gain some sense of how well my data collection was going. Furthermore, this other database is the only one I know of that isn’t compiled from media sources – which means that it is the only way I can peer outside of whatever biases creep into my data, based on the source material. The overall results of the comparison are as follows.

Cross-check results
Entries in my DB: 2,817
Entries in CIDE DB: 2,642
Matches: 915
Unique to my DB: 237
Unique to my DB (ineligible for CIDE DB): 1,660
Unique to CIDE DB: 1,727

At first glance, these results are a little dispiriting. There’s a huge number of messages that my data collection approach isn’t capturing – far bigger than I expected. A fairly comprehensive database of messages for the time period that I am covering should probably contain more than 6,000 messages, which means growing my current base by another 50%. While I can do much of this by importing the data from the other base, the CIDE base contains only sparse information – no message transcriptions, and almost no contextual details. Importing the data will be a start, but I need to dig deeper into the media archives.

There are a couple of other unexpected and more interesting  stories within the comparison results. First, the CIDE database officially starts in December 2006, but the first narco-message doesn’t appear until March 2007. This is the year for which my database has the highest relative number of messages that ought to be in the CIDE base. This tells me that the people compiling the data took a while to pick up on the emerging phenomenon of messages – quite a while longer than journalists. The earliest use of the term narcomensaje that I have found in the media dates back to 1999, and the term became more common – became something of a phenomenon – throughout 2006.

Related to this, with every year the gap between points unique to the CIDE base, and points unique to my base (that would be eligible for the CIDE one) grows. This tells me that over time, government security officials not only get a better sense of the emerging phenomenon, but also increase their ability to control the public perception of it. There are all sorts of reasons why messages might not make it into the press, but the overall trend is that the government officials are capturing and recording a greater portion of the messages – and also preventing journalists from doing the same. This accords with the newspaper reports that I have been reading, which over time become more dependent on official government accounts of the contents of messages.

Looking more carefully at the points that should be but aren’t in the CIDE base, I suspect that what is left out is not just a matter of what government officials accidentally overlooked. Some states, such as Sinaloa and Guerrero, have very high levels of violence, with even higher spikes of violence. It is unsurprising to see that a good number of the points unique to my database come from these states. Something strange happens, however, in the state of Nuevo León. This state saw relatively few messages until a massive increase in 2011. At the same time, however, the number of data points from NL that are unique to my base also spikes. State officials weren’t capturing the increase, despite the fact that it was highly public, widely reported, and came at the time of the state’s greatest control over the dissemination of these messages. It almost looks like this surge in messages is being kept out of the government data, so anomalous is the discrepancy.

Overall, the comparison shows that I have a lot of work still to do. My database can stand to grow a lot, and to add much richer data – I want as many of those message transcriptions as possible. The comparison also shows, however, that the CIDE database is less perfect than expected. I had assumed that this base would capture pretty much every message displayed, but clearly there are limits to even this expansive collection.