Comparing databases (and discovering just how far I have to go)

Since arriving in Mexico I have been steadily building a database of narcomensajes.  Over the course of several months, and working with different media sources, that base has grown by the thousands. It was never very clear, however, just how many messages I could expect to find. By one very rough estimate, I thought I should be able to find at least 3,500 messages from between 2006 and 2013. When my database passed 4,000 messages, I figured it was time to pause, and to try to get some sense of the real size of the phenomenon I am studying.

To do so, I planned to cross-check my database with another. One of the few previous studies of narcomensajes uses government-compiled crime data for the period 2006-2011, which was leaked to CIDE (a local university). That base only includes messages left at the scene of a homicide, whereas my database includes other types of messages as well. Still, I figured the government would have unparalleled access to data, and thus that by comparing my collection of messages left at crime scenes to this other collection of messages, I would gain some sense of how well my data collection was going. Furthermore, this other database is the only one I know of that isn’t compiled from media sources – which means that it is the only way I can peer outside of whatever biases creep into my data, based on the source material. The overall results of the comparison are as follows.

Cross-check results
Entries in my DB: 2,817
Entries in CIDE DB: 2,642
Matches: 915
Unique to my DB: 237
Unique to my DB (ineligible for CIDE DB): 1,660
Unique to CIDE DB: 1,727

At first glance, these results are a little dispiriting. There’s a huge number of messages that my data collection approach isn’t capturing – far bigger than I expected. A fairly comprehensive database of messages for the time period that I am covering should probably contain more than 6,000 messages, which means growing my current base by another 50%. While I can do much of this by importing the data from the other base, the CIDE base contains only sparse information – no message transcriptions, and almost no contextual details. Importing the data will be a start, but I need to dig deeper into the media archives.

There are a couple of other unexpected and more interesting  stories within the comparison results. First, the CIDE database officially starts in December 2006, but the first narco-message doesn’t appear until March 2007. This is the year for which my database has the highest relative number of messages that ought to be in the CIDE base. This tells me that the people compiling the data took a while to pick up on the emerging phenomenon of messages – quite a while longer than journalists. The earliest use of the term narcomensaje that I have found in the media dates back to 1999, and the term became more common – became something of a phenomenon – throughout 2006.

Related to this, with every year the gap between points unique to the CIDE base, and points unique to my base (that would be eligible for the CIDE one) grows. This tells me that over time, government security officials not only get a better sense of the emerging phenomenon, but also increase their ability to control the public perception of it. There are all sorts of reasons why messages might not make it into the press, but the overall trend is that the government officials are capturing and recording a greater portion of the messages – and also preventing journalists from doing the same. This accords with the newspaper reports that I have been reading, which over time become more dependent on official government accounts of the contents of messages.

Looking more carefully at the points that should be but aren’t in the CIDE base, I suspect that what is left out is not just a matter of what government officials accidentally overlooked. Some states, such as Sinaloa and Guerrero, have very high levels of violence, with even higher spikes of violence. It is unsurprising to see that a good number of the points unique to my database come from these states. Something strange happens, however, in the state of Nuevo León. This state saw relatively few messages until a massive increase in 2011. At the same time, however, the number of data points from NL that are unique to my base also spikes. State officials weren’t capturing the increase, despite the fact that it was highly public, widely reported, and came at the time of the state’s greatest control over the dissemination of these messages. It almost looks like this surge in messages is being kept out of the government data, so anomalous is the discrepancy.

Overall, the comparison shows that I have a lot of work still to do. My database can stand to grow a lot, and to add much richer data – I want as many of those message transcriptions as possible. The comparison also shows, however, that the CIDE database is less perfect than expected. I had assumed that this base would capture pretty much every message displayed, but clearly there are limits to even this expansive collection.

The perfect source, and its difficulties

A core part of my research involves the collection of data on the narcomensajes that have been appearing in Mexico since about 2006. Ideally, that data includes information about where and when the messages appear, contextual information such as whether the messages appear at a crime scene, and a full transcription of each message. There are numerous sources from which I can draw for the data collection, from national magazines, to local papers, to social media sites and narco blogs. The difficulty of data collection is not with the volume of sources, but with finding sources that can do what I need them to do: that are reliably searchable, have comprehensive archives, and that report the level of detail that I am looking for.

When I first started this project, conducting a preliminary investigation and working with the sources that I knew best, it would take me at least an hour to collect data on ten messages.

Before returning to my research this semester, I conducted a more comprehensive survey of media sources. This is how I found El Norte, a paper based in Monterrey, and part of the Reforma group of publications. El Norte had it all: an archive that dates back to 2006, a reliable search function, stories that cover all of Mexico, and reporting that includes all of the details that I am searching for (plus a lot more), presented in a succinct style. I had found close to my perfect source. When I started searching, I added about 30 data points in an hour.

Narcomensajes began appearing in Mexico in 2006, but at the time were a rare occurrence. With each passing year, however, the messages became more frequent. Searching the El Norte database, I could see certain patterns appear and disappear within the larger trend of messages, and I could watch certain cities or municipalities – Ciudad Juárez, Tijuana, Acapulco, Cuernavaca – being festooned with messages.

But then, searching the archives for the year 2011, that began to change. The violence that had mostly occurred in other parts of the country came to Monterrey and its surrounding municipalities.

El Norte was now reporting on violence taking place in the streets and neighbourhoods – sometimes literally on the doorsteps – of its core readership. The content of the newspaper articles began to change. Less transcriptions of messages were printed, and in their place vague allusions and paraphrasing were offered. Usually this amounted to generic lines such as “the message spoke of rivalry between criminal groups.” Reading between the lines, it is not hard to see that the paper was facing state pressure not to transmit the message of purported criminal groups. Very likely, the paper also faced pressure from rival criminal groups.

The reporting in El Norte also becomes much less outward looking at this time. Instead of setting the scene with the state and municipality within which a message appeared, stories start with a cross street or local landmark in greater Monterrey. For local readers, such detail provide a crucial sense of certainty, a better grasp of exactly what is happening around them. For the very distance researcher, unfortunately it means more searching for less results.

Even with this shift, El Norte has proven an invaluable source for my research. Thanks to the paper, and the efforts of its staff, I am going to have a halfway decent database. The difficulties encountered by the paper are also a reminder of just how dynamic of a research topic violence is. Violence can’t be reduced to an input or output. It changes everything it touches. That includes academics; we may be more removed than our sources, but we need to reflect on what we’re doing, and what our research is doing to us.