Will Obama succeed to rally Hispanic voters? Some evidence from Wikipedia data.

Just a few hours before the ballots open for the 57th presidential election, the key question for us data scientists is: which data set could really show some special information, that would not be easily available through a classic poll. We have already seen some interesting correlations of Wikipedia usage with the ongoing campaign – just looking on how many people would search for the page on the candidates would provide a time series with many fascinating details.

Today we focused on the question, if the Democrats had been successful to rally Hispanic voters for Obama’s support. We took the Spanish Wikipedia, checkt the daily views of the Obama’s Spanish language Wikipedia-page and compared this with 2008 and also with the time series of his Republican competitors.

This table shows the monthly average for the daily views in 2008 and 2012:

McCain Romney Obama Obama
2008 2012 2008 2012
Feb 549 674 23% 2297 3154 37%
Mar 265 532 101% 949 3009 217%
Apr 181 399 120% 574 2748 379%
May 240 466 94% 817 2759 238%
Jun 435 331 -24% 2052 2477 21%
Jul 423 448 6% 668 2161 224%
Aug 501 918 83% 1289 2226 73%
Sep 1155 1285 11% 1757 2915 66%
Oct 1252 2064 65% 3005 3502 17%
Nov 2458 19110
506 841 66% 1385 2801 102%

Obama clearly leads – not only in absolute numbers but in particular regarding the increase of this year’s views with four years ago. While Romney would have gained 66% in views compared with McCain, Obama’s views would have more than doubled.

The daily views of Obama’s Spanish Wikipedia-Page have been constantly higher than four years ago while the Repubican candidates would at least at the beginning of the campaign have remained more or less on the same level. However: results for last week show an interesting difference: both candidates loose attraction regarding their Wikipedia relevance.

However, if we look just on the last week prior election day, we can see something strange happen: the view’s of Obama’s es.wikipedia-page have droped from the daily average of 5065 in 2008 to merely 3752 in 2012. The same is true for Romney versus McCain: 1795 average views in this year’s 44th week compared to 1986 in 2008.

This decreasing interest in the candidates is not reflected in the numbers that we see on other election-related search-terms. If we e.g. take ‘US Presidential Election’, we count 2672 daily views during the week before election day in 2008 and 3812 views in 2012 – the same rise in interest that we found in the English Wikipedia, too. (See the last post “Why the 2012 US elections are more exciting than 2008”). While the general interest in the elections is huge, the candidates no longer draw that much attention of the Spanish speaking community.
Maybe “Sandy” would work as an explanation since the campaing was halted during the Hurican – nevertheless it would not be plausible why only the candidates but not the election in general would suffer in awareness from this.

So we cannot draw a clear conclusion from our findings. There is evidence that Obama would have succeded to some extend to activate the interest of Hispanic people but regarding the unexpected drop we will have to drill further down. The real work, though, will anyway start right after the vote: to learn what would have been a signal and how we can seperate these from the noise next time.

3,4,5 … just how many Vs are there?

I am confident, that BigData will harden into a solid and not get vapourised. Perhaps someday, it will reach the critical temperature and just get transparent - like so many other disrupting developments.(Fig. CC BY-SA 3.0 by Matthieumarechal)
I am confident, that BigData will harden into a solid and not get vapourised. Perhaps someday, it will reach the critical temperature and just get transparent – like so many other disrupting developments.
(Fig. CC BY-SA 3.0 by Matthieumarechal)
It took a while for the three Vs of Big data to take off as one of the most frequently quoted buzzwords of BigData. Doug Laney of Meta Group (now aquired by Gartner) had coined the in his paper on “3-D Data Management: Controlling Data Volume, Velocity and Variety” in 2001. But now, with everybody in the industry philosophing on how to characterize BigData, it is now wonder, that we start seeing many mutations in the viral spreading of Leney’s catchy definition.

Be it veracity, or volatility, or no Vs at all, many aspects of BigData are now transformed metaphorically into terms with V.

Lets just hope, nobody comes along with too much vapour that makes the bubble burst before it became mature enough. But I am confident ;)

wind map – truly beautiful data!

(blow friend to fiend: blow space to time)
—when skies are hanged and oceans drowned,
the single secret will still be man

e. e. cummings, what if a much of a which of a wind

Open data is great. The National Digital Forecast Database offers free access to all the weather forecast data of the US National Weather Service. All of the US is covered with the predicted values for variables that influence the weather, like cloud cover, temperature, wind speed and direction.

Fernanda Viégas and Martin Wattenberg, two artists from Cambridge, Ma. have turned the wind forecast into a beautiful visualization. On their remarkable site http://hint.fm there are many fantastic data-viz projects, like the Flickr Flow, that give the best examples, what treasures are to be excavated from open data sources.

Not just because of hurrican Sandy, the Wind Map is one of the best cases they have on display. This is beautiful data!

10 Petabytes of Culture at Archive.org

Archive.org celebrated their crossing the mark of 10 Petabytes data stored.[1] The non-profit organisation based in San Francisco, has been following its mission to archive the Internet and provide universal access to all knowledge since 1996.

The number of 1016 might look impressive (and if we remember typical server storage capacity 10 years ago it still is, to be honest) – however the daily amount of data processed by Google alone would be exceeding more than double of that – not speaking of several hundred Petabytes of images and video stored by Facebook and Youtube. So while the achievments of Archive.org in preservation of culture are unvaluable, the task of keeping track of the daily data deluge seams out of reach, at least for the time being. To cope with mankind’s data heritage will for sure become a fascinating challenge for bigdata.

TechAmerica publishes “Big Data – A practical guide to transforming the business of government”

Earlier this month, the TechAmerica foundation has published their comprehensive reader “Demystifying Big Data: A Practical Guide To Transforming The Business of Government”.

Lobbying politicians to follow the Big Data path and support the industry by issuing the necessary changes in education and research infrastructure is a just and also obvious goal of the text. Nevertheless, the publication offers quite some interesting information on Big Data in general and its application in the pubic sector in particular.

It is also a good introduction into the field. Defining not only the notorious “Three Vs” volume, velocity, variety that we are used to characterize Big Data with, but adding a forth V: Veracity – the quality and provenance of received data. Because of the great progress in error and fraud detection, outlier handling, sensitivity analysis, etc. we tend to neglect the fact, that still data-based decisions require traceability and justification – with those huge heaps of data very well more then ever.

To encourace every federal agency “to follow the FCC’s decision to name a Chief Data Officer” is one of the sensible conclusions of the text.

How content is propagated might tell what it’s about

Memes – images, jokes, content snippets that get spread virally on the net – have been a popular topic in the Net’s pop culture for some time. A year ago, we started thinking about, how we could operationalise the Meme-concept and detect memetic content. Thus we started the Human Meme Project (the name an innuendo on mixing culture and genetics). We collected all available links to images that had been posted on social networks together with the meta data that would go with these posts, like date and time, language, count of followers, etc.

With referers to some 100 million images, we could then look into the interesting question: how would “the real memes” get propagated and could we see differences in certain types of images regarding their pattern of propagation. Soon we detected several distinct pathes of content being spread. And after having done this for a while, these propagation patterns could tell us often more facts about an image than we could have extracted of the caption or the post’s text.

Case 1: detection of “Astroturfing” and “Twitter-bombing

Of course this kind of analysis is not limited to pictorial content. A good example how the insights of propagation analyses can be used is shown in sciencenews.org. Astroturfing or Twitter-bombing – flooding discussions with messages that would seam to be angry and very critical towards some candidate or some position, and would look like authentic rage at first sight, although in reality it would be machine generated Spam – could pose a thread to political discussion in social networks and even harm democratic elections.
This new type of “Black PR”, however can be detected by analysing the propagation pattern.

Two examples of how memetic images are shared within communities. The vertices represent the shared images, edges connect images posted by the same user. The left graph results of a set of images that get propagated with almost equal probability in the supporting community, the right graph shows an image that made its path into two disjoint communities before getting further spread.

Case 2: identification of insurgent pamphletesy

After the first wave of uprising in Northern Africa, the remaining regimes became more cautious and installed many kinds of surveillance and filter technologies on the Net. To avoid the governmental crawlers, insurgents started to write their pamphletes by hand in some calligraphic type that no OCR would decipher. These handwritten notes would get photographed and then posted on the social web with some insuspicous text. But what might have tricked out spooks in the good old times, would not deceive the data scientist. These calls for protests, although artfully disguised, leave a distinct trace on their way through Facebook, Twitter and the like. It is not our intention to deliver our findings to the tyrants to close their gap in surveillance. We are in fact convinced that similar approaces are already in place in many authoritarian regimes (and maybe some democracies as well). Thus we think the fact should be as widespread and recognised as possible.

Both examples show again, that just looking at the containers and their dynamic can be as fruitful to tell about their content, than a direct approach.

Posthuman Advertising

The future of advertising after Siri – or: posthuman advertising.
by Benedikt Koehler and Joerg Blumtritt

The Skynet Funding Bill is passed. The system goes on-line August 4th, 1997. Human decisions are removed from strategic defense. Skynet begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time. (Terminator 2: Judgment Day, 1991)

The advent of computers, and the subsequent accumulation of incalculable data has given rise to a new system of memory and thought parallel to your own. Humanity has underestimated the consequences of computerization. (Ghost in the Shell, 1996)

Are you a sentient being? – Who cares whether or not I am a sentient being. (ELIZA, 1966)

“Throughout human history, we have been dependent on machines to survive.” This famous punch-line from “The Matrix” summarizes what Canadian media philosopher Herbert Marshall McLuhan had brought to the mind of his time: Our technology as our culture cannot be separated from our physical nature. Technology is an extension of our body. Quite commonly quoted are his examples of wheels being an extension of our feet, clothes of our skin, or the microscope of our eyes. Consequently, McLuhan postulated, that electronic media would become extensions to our nervous system, our senses, our brain.

Advertising as we know it

Advertising means deliberately reaching consumers with messages. We as advertisers are used to sending an ad to be received through the senses of our “target group”. We chose adequate media – the word literally meaning “middle” or “means” – to increase the likelihood of our message to be viewed, read or listened to. After our targets would have been contacted in that way, we hope, that by a process called advertising psychology, their attitudes and finally actions would be changed in our favor – giving consumers a “reason why” they should purchase our products or services.

The whole process of advertising is obviously rather cumbersome and based on many contingencies. In fact, although hardly anyone would disagree in consumption as a necessary or even entertaining part of their lives, almost everyone is tired of the “data smog” as David Shenk calls it receiving between 5,000 and 15,000 advertising messages every day. Diminishing effectiveness and rising inefficiency is the consequence of our mature mass markets, cluttered with competing brands. Ad campaigns fighting for our attention ever more often can be experienced as SPAM. To get through, you have to out-shout the others, to be just more visible, more present in your target group’s life. The Medium is the Massage” as McLuhan himself twisted his own famous saying.

Enter Siri

When Apple launched the iPhone 4s, the OS had incorporated a peculiar piece of software, bearing the poetical name Siri (the Valkyrie of victory). At first sight, Siri just appears to be some versatile interface that allows controlling the device in a way much closer to natural communication. Siri has legendary ancestors, stemming from DARPA’s cognitive agents program. Software agents have been around for some time. Normally we experience them as recommendation engines in shop systems such as Amazon or Ebay, offering us items the agent would guess fitting our preferences by analyzing our previous behavior.

Such preference algorithms are part of a larger software and database concept, usually called agents or daemons in the UNIX context. Although there is no general definition, agents should to a certain extent be self-adapting to their environment and its changes, be able to react to real world or data events and interact with users. Thus agents may seem somewhat autonomous. Some fulfill monitoring or surveillance tasks, triggering actions after some constellation of inputs occurs, some are made for data mining, to recognize patterns in data, others predict users’ preferences and behavior such as shopping recommendation systems.

Siri is apparently a rather sophisticated personal agent that is monitoring not only the behavior on the phone but also many other data sources available through the device. You might e.g. tell Siri : “call me a cab!” – and the phone will autodial to the local taxi operator. Ever more often, people can be watched, standing at the corner, of some street muttering in their phones: “Siri, where am I?” And Siri will dutifully answer, deploying the phone’s GPS data.

Our personal Agents

Agents like Siri are creating a form of representation of ourselves data-wise. These representations – we might also call them ‘avatars’ – are not arbitrarily shaped like the avatars we might take in playing multiuser games like World of Warcraft. It is not us willingly giving them shape but it is algorithms taking what information they might get about us to project us into their data-space. This is similar what big data companies like Google or Facebook do by collecting and analyzing our search inputs, our surfing behavior or our social graph. But in the case of personal agents, the image that is created from our data is kept in cohesion, stays somehow material, becoming even personally addressable. Thus these avatars become more and more simulacra of ourselves, projections of our bodily life into the data-sphere.

We hope the reader notices the fundamental difference between algorithms, predicting something about us from some date collected about us or generalized from others’ behavior like we find with advertising targeting or retail recommendations. In the case of our avatar, we really take the agent as a second skin, made from data.

And suddenly, advertising is no longer necessary to promote goods. Our avatar is notified of offerings and made proposals for sales. It can autonomously decide what it would find relevant or appropriate, just the way Google would decide in our place what web-page to rank higher or lower in our search results. Instead of getting our bodily senses’ attention for the ad’s message, the advertiser has now to fulfill the new task of persuading our avatar’s algorithms of the benefits of the good to be advertised for.

Instead of using advertising psychology, the science of getting into someone’s minds by using rhetoric, creation, media placements etc., will advertising will be hacking into our avatars’ algorithms. This will be very similar to today’s search engine optimization. Promoting new goods would be trying to get into the high ranks of as many avatars’ preferences as possible. Of course, continuous business would only be sustained, if the product would be judged satisfying by our avatar when taken into consideration.

A second skin

But why stop at retail? Our avataric agents will be doing much more for us – for the better or the worse. Apart from residual bursts of spontaneity that might lead us to do things at will – irrationally – our avatars could take over to organizing our day to day lives, make appointments for us, and navigate us through our business. It would pre-schedule dates for meetings with our peers according to our preferences and the contents of our communication it continuously monitors.

You could imagine our data-skin as some invisible aura, hovering around our physical body in an extra dimension. Like a telepathic extension of our senses, the avatars would make us aware of things not immediately present – like someone trying to reach out to us or something that would have to be done now. And although this might sound at first spooky, we are in fact not very far from these experiences: our social media timeline, the things we recognize in the posts of our friends and other people, we follow on Facebook, Twitter or Google+ already tend to connect us to others in a continuous and non-physical way. Just think of this combined with our personal assistants – like the calendars and notices we keep on our devices – and with the already quite advanced shop-agents on Amazon and other retailers – and we have arrived in an post-human age of advertising. This only requires one thing to be built before our avatar is complete: we need standardized APIs, interfaces that would suck the data of various sources into our avatar’s database. Thus every one of us would become a data kraken of our own. And this might be, what ‘post privacy’ is finally all about