The Second Index. Search Engines, Personalization and Surveillance

Search engines have embarked on an extremely ambitious project, organizing the world’s information. As we are moving deeper into dynamic, informationrich environments, their importance is set to increase. In order to overcome the limitations of a general topological organization of information and to develop a personalized framework, a new model of the world is assumed: everyone’s world is different. To implement this, search engines need to know an almost infinite amount of information about their users.

Introduction

Google’s well-known mission is “to organize the world’s information”. It is, however, impossible to organize the world’s information without an operating model of the world. Melvil(le) Dewey (1851-1931), working at the height of Western colonial power, could simply take the Victorian world view as the basis for a universal classification system, which, for example, put all “religions other than Christianity” into a single category (no. 290). Such a biased classification scheme, for all its ongoing usefulness in libraries, cannot work in the irreducibly multi-cultural world of global communication. In fact, no uniform classification scheme can work, given the impossibility of agreeing on a single cultural framework through which to define the categories.² This, in addition to the scaling issues, is the reason why internet directories, as pioneered by Yahoo! and the Open Directory Project (demoz)^3, broke down after a short-lived period of success.

Search engines side-step this problem by flexibly reorganizing the index in relation to each query and using the self-referential method of link analysis to construct the ranking of the query list (see Katja Mayer in this volume). This ranking is said to be objective, reflecting the actual topology of the network that emerges unplanned through collective action. Knowing this topology, search engines favor link-rich nodes over link-poor outliers. This objectivity is one of the core elements of search engines, since it both scales well and increases the users’ trust in the system.

Yet, this type of objectivity has its inherent limits. The index knows a lot about the information from the point of view of the providers who create the topology of the network (through creating links), but knows nothing about the searcher’s particular journey through the informational landscape. What might be relevant on an aggregate, topological level could well be irrelevant on the level of the individual search interest. This problem is compounded by the fact that the actual customers of the search engines, the advertisers, are not interested in the topology of the network either, but in the individual users journeying through the network. These two forces, one intrinsic to the task of search engines themselves, the other intrinsic to what has become their dominant business model – advertising – drive the creation of the second index. This one is not about the world’s information, but about the world’s users of information. While the first one is based on publicly available information created by third parties, the second one is based on proprietary information created by the search engines themselves. By overlaying the two indexes the search engines hope to improve their core tasks: to deliver relevant search results to the users, and to deliver relevant users to advertisers. In the process, search engines create a new model of how to organize the world’s information, a model composed of two self-referential worlds: one operating on the collective level – one world emerging from the interactions of everyone (at least in its ideal version) – and the other operating on the individual level – one’s world as emerging from one’s individual history. Both of these levels are highly dynamic and by interrelating them, search engines aim to overcome the problem of information overload (too much irrelevant information), which both users and customers constantly encounter, the former when being presented with hundreds of “hits” while looking for just a handful (maybe even just one), the latter when having to relate to a mass of people who don’t care, rather than just a few prospective customers. Speaking about the user’s problem, Peter Fleischer, Google’s global privacy counsel, writes:

Developing more personalized search results is crucial given how much new data is coming online every day. The University of California Berkeley estimates that humankind created five exabytes of information in 2002 – double the amount generated in 1999. An exabyte is a one followed by 18 noughts. In a world of unlimited information and limited time, more targeted and personal results can really add to people’s quality of life.⁴

Fleischer and others usually present this as a straightforward optimization drive – improved search results and improved advertising. While this is certainly part of the story, it is not everything. This development also raises a number of troubling issues, ranging from surveillance, understood both as individual tracking and social sorting, to a potentially profound loss of autonomy. The latter is related to the fact that we are presented with a picture of the world (at least how it appears in search results) made up of what someone else, based on proprietary knowledge, determines to be suitable to one’s individual subjectivity.

In the following, we will try to address these issues. First, we will provide a sense of the scale of this second index as it is being compiled by Google, the most prolific gatherer of such data. Given the sensitive and proprietary nature of that index, we can only list some publicly visible means through which this information is generated and look at the particular types of information thus gathered. We have no inside knowledge about what is being done with that information. Based on this overview, we will discuss some aspects of surveillance and of personalization. It is key to note that these are not issues we can reasonably opt out of. The pressures that are working to create and use the second index are real, and they are also driven by legitimate user needs. Thus, the question raised at the end of this text is how to address these challenges in an adequately nuanced way.

Gathering Data

Since its inception, Google has been accumulating enormous amounts of information about its users. Yet surveys have shown that most users are not aware of the wide range of data collected; many still have an understanding of Google as a search engine rather than a multi-million dollar corporation making large profits from devising personalized advertising schemes.⁵

While we concentrate on Google, it is important to note that it does not differ from its competitors in principle. Other search engines like Yahoo! and Live also enable cookies to track users’ search histories; stores like Amazon.com also store information on their customers’ shopping habits or credit card numbers; social networking sites like Facebook have had their own share of privacy problems, such as when users found out they could not permanently delete their accounts.⁶ Cloud computing is promoted by companies like Microsoft or Dell, too. The case of Google, however, is special because of the sheer quantity of data that is gathered, because of the enormous variety and quality of this information, and because of Google’s growing market dominance in the fields of web search, advertising, and cloud computing.

Google employs manifold methods to collect data on its users. Click tracking has long enabled Google to log any click made by anyone on any of its servers. Log files store any server request ever made to any of Google’s servers and always save basic details like the user’s IP address, location, date, time, time zone, language, operating system, and browser (“standard log information”). Additional information is also sent back to Google and logged thanks to JavaScript and Web Beacons being embedded in websites.⁷ Cookies are used on all Google sites. Originally, these cookies were set to expire in 2038. In 2007 the system was changed so that currently Google cookies have a life expectancy of two years – unless you use a Google service in that time, which prompts a renewal by another two years, thereby rendering the cookie virtually immortal. Cookies reporting back to Google are not only used on sites that are on Google’s web properties, but also on seemingly unrelated sites which use AdSense, AdWords, or DoubleClick ads or the statistics tool Google Analytics. Most users, and even many web-administrators, are not aware of technical implications of those services, i.e. that these cookies help to collect data on the users’ movement around a large percentage of existing off-Google websites. In addition to these fairly opaque techniques, which are known only to Google’s more tech savvy audience, there are also data sets provided by the users themselves. Users voluntarily fill in forms in order to sign up for accounts, which results in a log of those very personal details. It was only by employing this function and by combining information collected from its wide range of services and from third parties that Google could begin to accumulate the expansive knowledge about its users that has made it so powerful and attractive to advertising partners.

For purposes of analysis we can differentiate between three types of profiles, which together create a comprehensive profile of each user. First, users are tracked as “knowledge beings” and, in a second step, as “social beings”, exploiting their real identities, contacts, and interaction with those contacts. A final data set captures the users as physical beings in real space.

The following survey is by no means a comprehensive collection of Google’s services, not least of all because they are constantly expanding. Yet we believe that the services which will be added in the future will further contribute to increasing the scope of any (or all) of these three profiles.

Layer 1: Assembling a Knowledge Profile

The basic service of Google Search has never required its users to register for an account or to actively provide any personal details. Standard log information (see above) is, however, always gathered in the background. In addition, Google servers record information specific to the search function: the search query itself, default language settings, the country code top-level domain (whether google. com or google.at is being used), the search engine result pages and the number of search results, as well as settings like safe search (filtering and blocking adult content), or additional search options like file type, and region.

Various other services can arguably be placed in the same category, firstly because they let users search for web information provided by third parties – be it websites, magazines, books, pictures, or products – and secondly because Google employs them to gather similar sets of data about its users. These services include Google Directory, Image Search, News Search, News Archive Search, Google Scholar, Google Books, Google Video, Blog Search, the LIFE photo archive hosted by Google, or Google Product Search (the service formerly known as Froogle).

Hal Roberts from the Berkman Center for Internet & Society at Harvard University writes that,

[i]n fact, it is likely that this collection of search terms, IP addresses, and cookies represents perhaps the largest, most sensitive single collection of data extant, on- or offline. Google may or may not choose to do the relatively easy work necessary to translate its collection of search data into a database of personally identifiable data, but it does have the data and the ability to query personal data out of the collection at any time if it chooses (or is made to choose by a government, intruder, disgruntled worker, etc).⁸

The conception of services like AdSense, AdWords, AdPlanner, or Analytics⁹ has helped Google to spread even further: Google Analytics can be installed on any website for free, but in exchange Google is entitled to collect information from all sites using its tool. Some estimates say that “the software is integrated into 80 percent of frequently visited German-language Internet sites”.¹⁰ The same principle governs Google’s advertising schemes. Visitors to off-Google sites are usually not aware that Google servers could nevertheless be logging their online activities because “[w]eb companies once could monitor the actions of consumers only on their own sites. But over the last couple of years, the Internet giants have spread their reach by acting as intermediaries that place ads on thousands of Web sites, and now can follow people’s activities on far more sites.”¹¹

Google’s new web browser Chrome, a good example of the development towards the Google Cloud, must also be seen as a tool to construct the user’s knowledge profile – with the additional benefit that it blurs the distinction of what the user “knows” online and offline. “Chrome’s design bridges the gap between desktop and so-called ‘cloud computing’. At the touch of a button, Chrome lets you make a desktop, Start menu, or Quick Launch shortcut to any Web page or Web application, blurring the line between what’s online and what’s inside your PC.”¹² Chrome’s “Omnibox” feature, an address bar which works like a Google search window with an auto-suggest function, logs every character typed by its users even without hitting the enter button.¹³ Faced with severe criticism, Google maintained that the data collected by this keystroke logging process (keystrokes and associated data such as the user’s IP address) would be rendered anonymous within 24 hours.¹⁴ Concerns have also been voiced regarding Chrome’s history search feature, which indexes and stores sensitive user data like personal financial or medical information even on secure (https://) pages.¹⁵

New services are introduced to systematically cover data which has not previously been included in the user profile. In the process, basic personal data is enriched by and may be combined with information covering virtually each and every aspect of a user’s information needs: Google servers monitor and log what users search for (Google Search, Google Toolbar, Google Web History), which websites they visit (Google Chrome, Web Accelerator), which files (documents, e-mails, chat, web history) they have on their computers (Google Desktop), what they read and bookmark (Google Books incl. personalized library, Google Notebook, Google Bookmarks, Google Reader), what they write about in e-mails (Gmail), in work files and personal documents online (Google Docs) and offline (Google Desktop), in discussion groups (Google Groups), blogs (Blogger), or chats (Google Talk), what they want translated and which languages are involved (Google Translate).

Layer 2: Constructing the User as a Social Being

With the advent of Web 2.0 it became clear that users are not just knowledge-seeking individuals, but also social beings intensely connected to each other online and offline. Google’s services have been expanding and more personalized programs have been added to its portfolio to capture the social persona and its connections.

Signing up for a Google Account allows users to access a wide range of services, most of which are not available “outside” the account. If you choose to edit your profile, basic personal details gleaned by Google from your account are not only the number of logins, your password, and e-mail address, but also your location, real-life first name and last name, nickname, address, additional e-mail address(es), phone number(s), date and place of birth, places you’ve lived in, profession, occupations past and present, a short biography, interests, photos, and custom links to online photo albums, social network profiles, and personal websites.

In addition, Google acquired previously independent sites like YouTube and Blogger, “migrated” their account information, and launched additional social networks like Orkut and Lively, Chat, Calendar, mobile phones and other social services.

On top of monitoring all the contents of communications (e-mails, postings, text messages sent or received), services like Google Groups, Gmail, Google Talk, Friends Connect, or Orkut store any external text, images, photos, videos, and audio files submitted, contact lists, groups a user joins or manages, messages or topics tracked, custom pages users create or edit, and ratings they make. Google servers also record who users network with (Orkut), where, when, and why they meet friends and work contacts, which friends replied to invitations or what these contacts’ e-mail addresses are (Google Calendar), where they do their online shopping and which products they look for (Catalog Search, Product Search, Google Store), which credit cards they use and what their card expiration date(s) and card verification number(s) are, where they buy from, where they want their purchase shipped, how much they bought and at what price, who they bought from, and which type of payment was used (Google Checkout, Google Video). In addition, stock portfolio information, i.e. stocks selected, the amount of a user’s shares, and the date, time, and price at which they were bought, is collected as well (Google Finance). All of these services transmit information, such as account activity, passwords, login time and frequency, location, size and frequency of data transfers, preferred settings, and all clicks, including UI elements, ads, and links. All this information is subsequently saved and stored in Google’s log files. This monitoring of the users’ social interactions gives Google the means to create ever more personalized data profiles. It is important to note that Google originally denied intentions that it was already connecting or planning to connect information from its various services.¹⁶ In 2004, however, the company was forced to make its privacy policy more understandable due to a new Californian law¹⁷ and the new wording included a passage stating that Google was allowed to “share the information submitted under [any Google] account among all of [their] services in order to provide [users] with a seamless experience”.

Layer 3: Recreating the User’s Real-Life Embodiment

For efficient advertising, the process of constructing the user increasingly tries to capture information about the individual as an embodied person and as an agent within a real-life environment. In order to add such data to its burgeoning second index, Google collects such disparate data as its users’ blood type, weight, height, allergies, and immunizations, their complete medical history and records such as lists of doctors appointments, conditions, prescriptions, procedures, and test results (Google Health). Google already knows where its users live (Google Account, Google Checkout, Google Maps default location) and where they want to go (Google Maps), which places they like (annotated maps with photos of and comments about “favorite places” in Google Maps), what their house looks like (Google Maps satellite function, Google Street View), which videos they watch (Google Video, YouTube), what their phone numbers are (Google Checkout), which mobile carrier and which mobile phones they use, and where (Dodgeball, GrandCentral, G1/ Android, MyLocation).

New data-rich sources are accessed through mobile phones, which store their owner’s browsing history and sensitive personal data that identifies them, and combine these with new functions such as location tracking. Android, Google’s new mobile phone platform, offers applications like Google Latitude and MyLocation, which are used to determine a user’s position; MyTracks is used to track users over time.¹⁸ Google will thereby be able to assemble even closer consumer profiles; by targeting users with geo-located ads, Google is creating a virtual gold mine for associated advertisers.¹⁹ This foray into the mobile market enables Google to collect mobile-specific information, which can later be associated with a Google Account or with some other similar account ID. Google Latitude also offers an option to invite a “friends” list to track the inviter. Privacy groups have drawn attention to the problem that user-specific location data can also be accessed by third parties without a user’s knowledge or consent and that the victim may remain unaware of being tracked²⁰, thereby rendering ordinary mobile phones useful tools for personal surveillance^.21

The development of Android is perhaps the clearest example for the amount of resources Google is willing to invest in order to generate and get access to data for expanding its second index; an entire communication infrastructure is being created that is optimized for gathering data and for delivering personalized services, which are yet another way of gathering more data.

Thus, Google systematically collects data to profile people in terms of their knowledge interests, social interaction, and physical embodiment. Outsiders can only observe the mechanism of gathering this data, which allows conclusions about the scope of the second index. What we cannot know is what precisely is being done with it, and even less what might be done with it in the future. What we can assess, though, are some issues arising from the mere existence of this index and the published intentions for its use.

Surveillance and Personalization

Considering the scope and detail of personal information gathered by Google (and other search engines), it is clear that very significant surveillance capacities are being generated and privacy concerns have been voiced frequently.²² As we are trying to unravel the potential of that surveillance, it is important to distinguish three kinds of surveillance. While they are all relevant to search engines, they are actually structured very differently and subject to very different influences. First is surveillance in the classic sense of Orwell’s Big Brother, carried out by a central entity with direct power over the person under surveillance, for example the state or the employer. Second is surveillance in the sense of an increasing capacity of various social actors to monitor each other, based on a distributed “surveillant assemblage” where the roles of the watchers and the watched are constantly shifting.²³ Finally, there is surveillance in the sense of social sorting, i.e. the coding of personal data into categories in order to apply differential treatment to individuals or groups.²⁴

In the first case, powerful institutions use surveillance to unfairly increase and exercise their power. Search engines, however, have no direct power over their users which they can abuse. Assuming that search engines are concerned with their reputation and do not provide data from their second index to third parties, their surveillance capacities of this type appear to be relatively benign. Of course, this assumption is deeply questionable. Complex information processing systems are always vulnerable to human and technical mistakes. Large data sets are routinely lost, accidentally disclosed, and illegally accessed. Just how sensitive such search engine data can be was demonstrated in August 2006, when AOL made available the search histories of more than 650,000 persons. Even though the search histories were anonymized, the highly personal nature of the data made it possible to track down some of the users whose data had been published. ²⁵ We can assume that large search engine companies have extensive policies and procedures in place to protect their data, thereby decreasing the likelihood of accidents and breaches. However, the massive centralization of such data makes any such incident all the more consequential, and the pull exerted by the mere existence of such large surveillance capacities is more problematic. Given the insatiable demands of business rivals, law enforcement and national security agencies, requests to access these data will certainly be made.²⁶ Indeed, they have already been made. In August 2005, all major search engines in the US received a subpoena to hand over records on millions of their users’ queries in the context of a review of an online pornography law. Among these, Google was the only one who refused to grant access to these records.²⁷ While this was the only publicized incident of this kind, it is highly unlikely that this was the only such request. The expanded powers of security agencies and the vast data sets held by search engines are bound to come in close contact with one another. Historically, the collaboration between telecom companies and national security agencies has been close and extensive. Massive surveillance programs, such as ECHELON, would have been impossible without the willing support of private industry.²⁸ It is thus well possible that search engines are already helping to expand big-brother surveillance capacities of the state, particularly the US government.

Another more subtle effect of surveillance was first utilized by Jeremy Bentham in the design of a correctional facility (1785) and later theorized as a more general governmental technique by Michel Foucault. In a Panopticon, a place where everything can be seen by an unseen authority, the expectation of being watched affects the behavior of the watched. It is normalized, that is, behavioral patterns which are assumed to be “normal” are reinforced simply by knowing that the authority could detect and react to abnormal behavior.²⁹ This has also been shown to occur³⁰ in the context of CCTV surveillance. In the case of search engines, the situation is somewhat different because the watching is not done by a central organization; rather, the search engines provide the means for everyone to place themselves at the center and watch without being watched (at least not by the person being watched). It is not far-fetched to assume that the blurring of private and public spheres on the web impacts on a person’s behavior. It is much too early to say anything systematic about the effects of this development, not least because this surveillance does not try to impose a uniform normative standard. Thus, it is not clear what normalization means under such circumstances. However, this is not related to the surveillance carried out by search engines themselves, but rather to their function of making any information accessible to the public at large and the willingness of people to publish personal information.

What is specific to the second index is the last dimension of surveillance, social sorting. David Lyon develops the concept in the following way:

Codes, usually processed by computers, sort out transactions, interactions, visits, calls and other activities. They are invisible doors that permit access to, or exclude from participation in a myriad of events, experiences and processes. The resulting classifications are designed to influence and manage populations and persons thus directly and indirectly affecting the choices and chances of data subjects.³¹

Rather than treating everyone the same, social sorting allows matching people with groups to whom particular procedures, enabling, disabling or modifying behavior, are assigned. With search engines, we encounter this as personalization. It is important to note that, as David Lyon stresses, “surveillance is not itself sinister any more than discrimination is itself damaging.”³² Indeed, the basic purpose of personalization is to help search engines to improve the quality of search results. It enables them to rank results in relation to individual user preferences, rather than to network topology, and helps to disambiguate search terms based on the previous path of a person through the information landscape. Personalization of search is part of a larger trend in the informational economy towards “mass individualization”, where each consumer/user is given the impression, rightly or wrongly, of being treated as a unique person within systems of production still relying on economies of scale.

Technically, this is a very difficult task for which vast amounts of personal data are needed, which, as we have seen, is being collected in a comprehensive, and systematic way. A distinction is often made between data that describes an individual on the one hand, and data which is “anonymized” and aggregated into groups on the basis of some set of common characteristics deemed relevant for some reason. This distinction plays an important role in the public debate, for example in Google’s announcements in September 2008 to strengthen user privacy by “deleting” (i.e. anonymizing) user data after nine rather than 18 months.³³ Questions have also been raised about how easily this “anonymization” can be reversed by Google itself or by third parties.³⁴ In some cases, Google offers the option to disable cookies or use Chrome’s Incognito mode (“porn mode” in colloquial usage), but instructions how to do this are well hidden, difficult to follow, and arguably only relevant for highly technologically literate users. Moreover, the US-American consumer group Consumer Watchdog has pointed out that “Chrome’s Incognito mode does not confer the privacy that the mode’s name suggests”, as it does not actually hide the user’s identity.³⁵ It is important to note that even if Google followed effective “anonymizing” procedures, this would matter only in terms of surveillance understood as personal tracking. If we understand it as social sorting, this has nearly no impact. The capacity to build a near infinite number of “anonymized” groups from this database and to connect individuals to these small groups for predictive purposes re-integrates anonymized and personalized data in practice. If the groups are fine-grained, all that is necessary is to match an individual to a group in order for social sorting to become effective. Thus, Hier concludes that “it is not the personal identity of the embodied individual but rather the actuarial or categorical profile of the collective which is of foremost concern.”³⁶ In this sense, Google’s claim to increase privacy is seriously misleading.

Like virtually all aspects of the growing power of search engines, personalization is deeply ambiguous in its social effects. On the one hand, it promises to offer improvements in terms of search quality, further empowering users by making accessible the information they need. This is not a small feat. By lessening the dependence on the overall network topology, personalization might also help to address one of the most frequently voiced criticisms of the dominant ranking schemes, namely that they promote popular content and thus reinforce already dominant opinions at the expense of marginal ones.³⁷ Instead of only relying on what the majority of other people find important, search engines can balance that with the knowledge of the idiosyncratic interest of each user (group), thus selectively elevating sources that might be obscure to the general audience, but are important to this particular user (set). The result is better access to marginal sources for people with an assumed interest in that subject area.

So far so good. But where is the boundary between supporting a particular special interest and deliberately shaping a person’s behavior by presenting him or her with a view shaped by criteria not his or her own? As with social sorting procedures in general, the question here is also whether personalization increases or decreases personal autonomy. Legal scholar Frank Pasquale frames the issue in the following way:

Meaningful autonomy requires more than simple absence of external constraint once an individual makes a choice and sets out to act upon it. At a minimum, autonomy requires a meaningful variety of choices, information of the relevant state of the world and of these alternatives, the capacity to evaluate this information and the ability to make a choice. If A controls the window through which B sees the world—if he systematically exercises power over the relevant information about the world and available alternatives and options that reaches B—then the autonomy of B is diminished. To control one’s informational flows in ways that shape and constrain her choice is to limit her autonomy, whether that person is deceived or not.³⁸

Thus, even in the best of worlds, personalization enhances and diminishes the autonomy of the individual user at the same time. It enhances it because it makes information available that would otherwise be harder to locate. It improves, so it is claimed, the quality of the search experience. It diminishes it because it subtly locks the users into a path-dependency that cannot adequately reflect their personal life story, but reinforces those aspects that the search engines are capable of capturing, interpreted through assumptions built into the personalizing algorithms. The differences between the personalized and the non-personalized version of the search results, as Google reiterates, are initially subtle, but likely to increase over time. A second layer of intransparency, that of the personalization algorithms, is placed on top of the intransparency of the general search algorithms. Of course, it is always possible to opt out of personalization by simply signing out of the account. But the ever increasing range of services offered through a uniform log-in actively works against this option. As in other areas, protecting one’s privacy is rendered burdensome and thus something few people are actively engaged in, especially given the lack of direct negative consequences for not doing it.

And we hardly live in the best of all worlds. The primary loyalty of search engines is – it needs to be noted again – not to users but to advertisers. Of course, search engines need to attract and retain users in order to be attractive advertisers, but the case of commercial TV provides ample evidence that this does not mean that user interests are always foregrounded.

The boundary between actively supporting individual users in their particular search history and manipulating users by presenting them with an intentionally biased set of results is blurry, not least because we actually want search engines to be biased and make sharp distinctions between relevant and irrelevant information. There are two main problems with personalization in this regard. On the one hand, personalization algorithms will have a limited grasp of our lives. Only selective aspects of our behavior are collected (those that leave traces in accessible places), and the algorithms will apply their own interpretations to this data, based on the dominant world-view, technical capacities and the particular goals pursued by the companies that are implementing them. On the other hand, personalization renders search engines practically immune to systematic, critical evaluation because it is becoming unclear whether the (dis)appearance of a source is a feature (personalization done right) or a bug (censorship or manipulation).

Comparing results among users and over time will do little to show how search engines tweak their ranking technology, since each user will have different results. This will exacerbate the problem already present at the moment: that it is impossible to tell whether the ranking of results changes due to aggregate changes in the network topology, or due to changes in the ranking algorithms, and, should it be the latter, whether these changes are merely improvements to the quality, or attempts to punish unruly behavior.³⁹ In the end, it all boils down to trust and expediency. The problem with trust is that, given the essential opacity of the ranking and personalization algorithms, there is little basis to evaluate such trust. But at least it is something that is collectively generated. Thus, a breach of this trust, even if it affects only a small group of users, will decrease the trust of everyone towards the particular service. Expediency, on the other hand, is a practical measure. Everyone can decide for themselves whether the search results delivered are relevant to them. Yet, in an environment of information overload, even the most censored search will bring up more search results that anyone can handle. Thus, unless the user has prior knowledge of the subject area, she cannot know what is not included. Thus, search results always seem amazingly relevant, even if they leave out a lot of relevant material.⁴⁰

With personalization, we enter uncharted territory in its enabling and constraining dimensions. We simply do not know whether this will lead to a greater variety of information providers entering the realm of visibility, or whether it will subtly but profoundly shape the world we can see, if search engines promote a filtering agenda other than our own. Given the extreme power differential between individual users and the search engines, there is no question who will be in a better position to advance their agenda at the expense of the other party.

The reactions to and debates around these three different kinds of surveillance are likely to be very different. As David Lyon remarked,

Paradoxically, then, the sharp end of the panoptic spectrum may generate moments of refusal and resistance that militate against the production of docile bodies, whereas the soft end seems to seduce participants into stunning conformity of which some seem scarcely conscious.⁴¹

In our context, state surveillance facilitated by search engine data belongs to the sharp end, and personalization to the soft end where surveillance and care, the disabling and enabling of social sorting are hard to distinguish and thus open the door to subtle manipulation.

Conclusion

Search engines have embarked on an extremely ambitious project, organizing the world’s information. As we are moving deeper into dynamic, informationrich environments, their importance is set to increase. In order to overcome the limitations of a general topological organization of information and to develop a personalized framework, a new model of the world is assumed: everyone’s world is different. To implement this, search engines need to know an almost infinite amount of information about their users. As we have seen, data is gathered systematically to profile individuals on three interrelated levels – as a knowledge person, a social person, and as an embodied person. New services are developed with an eye on adding even more data to these profiles and filling in remaining gaps. Taken together, all this information creates what we call the second index, a closed, proprietary set of data about the world’s users of information.

While we acknowledge the potential of personalized services, which is hard to overlook considering how busy the service providers are in touting the golden future of perfect search, we think it is necessary to highlight four deeply troublesome dimensions of this second index. First, simply storing such vast amounts of data creates demands to access and use it, both within and outside the institutions. Some of these demands are made publicly through the court system, some of them are made secretly. Of course this is not an issue particular to search engines, but one which concerns all organizations that process and store vast amounts of personal information and the piecemeal privacy and data protection laws that govern all of them. It is clear that in order to deal with this expansion of surveillance technology / infrastructure, we will need to strengthen privacy and data protection laws and the means of parliamentary control over the executive organs of the state. Second, personalization is inherently based upon a distorted profile of the individual user. A search engine can never “know” a person in a social sense of knowing. It can only compile data that can be easily captured through its particular methods. Thus, rather than being able to draw a comprehensive picture of the person, the search engine’s picture is overly detailed in some aspects and extremely incomplete in others. This is particularly problematic given the fact that this second index is compiled to serve the advertisers’ interests at least as much as the users’. Any conclusion derived from this incomplete picture must be partially incorrect, thus potentially reinforcing those sets of behaviors that lend themselves to data capturing and discouraging others. Third, the problem of the algorithmic bias in the capturing of data is exacerbated by the bias in the interpretation of this data. The mistakes range from the obvious to the subtle. For example, in December 2007 Google Reader released a new feature, which automatically linked all one’s shared items from Reader with one’s contact list from Gmail’s chat feature, Google Talk. As user banzaimonkey stated in his post, “I think the basic mistake here [...] is that the people on my contact list are not necessarily my ‘friends’. I have business contacts, school contacts, family contacts, etc., and not only do I not really have any interest in seeing all of their feed information, I don’t want them seeing mine either.”⁴² Thus, the simplistic interpretation of the data – all contacts are friends and all friends are equal – turned out to be patently wrong. Indeed, the very assumption that data is easy to interpret can be deeply misleading, even the most basic assumption that users search for information that they are personally interested in (rather than someone else). We often do things for other people. The social boundaries between persons are not so clear cut.

Finally, personalization further skews the balance of power between the search engines and the individual user. If future search engines operate under the assumption that everyone’s world is different, this will effectively make it more difficult to create shared experiences, particularly as they relate to the search engine itself. We will all be forced to trust a system that is becoming even more opaque to us, and we will have to do this based on the criterion of expediency of search results which, in a context of information overload, is deeply misleading.

While it would be as short-sighted as unrealistic to forgo the potential of personalization, it seems necessary to broaden the basis on which we can trust these services. Personalized, individualized experience is not enough. We need collective means of oversight, and some of them will need to be regulatory. In the past, legal and social pressure were necessary to force companies to grant consumers basic rights. Google is no exception: currently, its privacy policies are not transparent and do not restrict Google in substantial ways. The fact that the company’s various privacy policies are hard to locate and that it is extremely difficult to retrace their development over time, cannot be an oversight – especially since Google’s mission is ostensibly to organize the world’s information and to make it more easily accessible to its users. But regulation and public outrage can only react to what is happening, which is difficult given the dynamism of this area. Thus, we also need means oriented towards open source processes, where actors with heterogeneous value sets have the means to fully evaluate the capacities of a system. Blind trust and the nice sounding slogan “don’t be evil” are not enough to ensure that our liberty and autonomy are safeguarded.

Notes

1 The author would like to thank the organizers and participants of the Deep Search conference in Vienna for the opportunity to present and discuss the ideas that led to this contribution. All online sources were last consulted on 5 January 2009.
2 See Ronald Deibert, John Palfrey et al. (ed.), Access Denied: The Practice and Policy of Global Internet Filtering (Cambridge MA: The MIT Press) 2007. See also Nart Villeneuve, ‘Search Engine Monitor’, Citizen Lab Occasional Paper #1, 2008, http://www.citizenlab.org/papers/searchmonitor. pdf.
3 See for instance Clive Thompson, “Google’s China Problem (and China’s Google Problem)”, The New York Times, April 23, 2006, http://www.nytimes.com/2006/04/23/magazine/23google. html; Wolfgang Pomrehn, “Das Dilemma der Zensur in China”, Telepolis, February 12, 2006, http://www.heise.de/tp/r4/artikel/22/22091/1.html; Dan Sabbagh, “No Tibet or Tiananmen on Google’s Chinese site”, Times Online, January 25, 2006, http://business.timesonline.co.uk/tol/ business/markets/china/article719192.ece; Bruno Philip, “Google se plie à la censure de Pékin pour percer sur le marché de l’Internet chinois”, Le Monde, January 27, 2008.
4 See Christopher Soghoian & Firuzeh Shokooh Valle, “Adiós Diego: Argentine judges cleanse the Internet”, OpenNet Initiative, November 11, 2008, http://opennet.net/blog/2008/11/ adi%C3%B3s-diego-argentine-judges-cleanse-internet; Uki Bonim, “Can A Soccer Star Block Google Searches?”, Time Magazine, November 14, 2008, http://www.time.com/time/world/article/ 0,8599,1859329,00.html.
5 See Pedro Less Andrade, “La censura previa nunca es un buen modelo”, El Blog Official de Google para América Latina, October 8, 2008, http://googleamericalatinablog.blogspot.com/2008/10/ la-censura-previa-nunca-es-un-buen.html.
6 “Google Loses Court Battle Over Image Searches”, Deutsche Welle, October 14, 2008, http:// www.dw-world.de/dw/article/0,2144,3710342,00.html. See Landgericht Hamburg, 26 September 2008 – Az.: 308 O 42/06, and Landgericht Hamburg, 26 September 2008 – Az.: 308 O 248/07. Available at http://www.suchmaschinen-und-recht.de/urteile/Landgericht-Hamburg-20080926. html.
7 Tribunal de Commerce de Paris, September 6, 2007, Attractive Ltd. v. Google France, Google Ireland. Available at http://www.legalis.net/jurisprudence-decision.php3?id_article=206. It seems that Google itself proposed to hand over the IP addresses of users as a second line of defense.
8 For an overview, see Alex Halavais, Search Engine Society, Digital Media and Society Series (Cambridge: Polity) 2008, 5-31; Joris van Hoboken, “Legal Space for Innovative Ordering: On the need to Update Selection Intermediary Liability in the EU”, International Journal for Communications Law & Policy, 2009.
9 Compare, Eszter Hargittai, “Open Portals or Closed Gates? Channeling content on the World Wide Web”, Poetics, 27, 2000, 233-253; Eszter Hargittai, “Content Diversity Online: Myth or Reality?”, in: Media Diversity and Localism: Meaning and Metrics. Ed. Philip Napoli. (Mahwah, NJ: Lawrence Erlbaum, 2007), 349-362.
10 This is not to say that the enforcement of national laws is impossible. See Jack L. Goldsmith, Tim Wu, Who Controls the Internet?: Illusions of a Borderless World, (New York: Oxford University Press US, 2006).
11 Ofcom – Office of Communications, Ofcom’s Response to the Byron Review, March 27, 2008, 50-51. Available at http://www.ofcom.org.uk/research/telecoms/reports/byron/byron_review. pdf.
12 “British ISPs restrict access to Wikipedia amid child pornography allegations”, Wikinews, 7 December 2008, http://en.wikinews.org/wiki/British_ISPs_restrict_access_to_Wikipedia_amid_ child_pornography_allegations; Richard Clayton, “Technical aspects of the censoring of Wikipedia”, Light Blue Touchpaper, December 11, 2008, http://www.lightbluetouchpaper.org/2008/12/11/ technical-aspects-of-the-censoring-of-wikipedia.
13 Internet Watch Foundation, IWF statement regarding Wikipedia webpage, December 9, 2008, last modified 18 December 2008, http://www.iwf.org.uk/media/news.251.htm.
14 David Chartier, “alt.blocked: Verizon blocks access to whole USENET hierarchy”, Ars Technica, June 16, 2008, http://arstechnica.com/news.ars/post/20080616-alt-blocked-verizon-blocks-accessto- whole-usenet-hierarchy.html; Karin Spaink, “Child pornography: fight it or hide it?”, February 19, 2008, http://www.spaink.net/2008/02/19/child-pornography-fight-it-or-hide-it/, “Familienministerin will Kinderporno-Sperren bald umsetzen”, Heise Online, November 30, 2008, http:// www.heise.de/newsticker/Familienministerin-will-Kinderporno-Sperren-bald-umsetzen--/meldung/ 119663.
15 See J.M. Urban & L. Quilter, “Efficient Process or ‘Chilling Effects’? Takedown Notices Under Section 512 of the Digital Millennium Copyright Act”, Santa Clara Comparative & High Technology Law Journal, 22 (2006): 621.
16 Court of Appeals Amsterdam, June 15, 2006, TechnoDesign v. BREIN. Translation available at http://www.book9.nl/getobject.aspx?id=2778Zoekmp3. For a discussion in Dutch, see Joris van Hoboken, De aansprakelijkheid van zoekmachines. Uitzondering zonder regels of regels zonder uitzondering?, Computerrecht 1 (2008): 15-22.
17 For the German search engine self-regulation, see “Subcode of Conduct for Search Engine Providers of the Association of Voluntary Self-Regulating Multimedia Service Providers” (“Freiwillige Selbstkontrolle Multimedia-Diensteanbieter” – FSM) (VK-S), April 21, 2004, http://www.fsm. de/en/Subcode_of_Conduct_for_Search_Engine_Providers. For a discussion of the agreement see Wolfgang Schulz & Thorsten Held, “Der Index auf dem Index? Selbstzensur und Zensur bei Suchmaschinen”, in Marcel Machill & Markus Beiler (ed.), Die Macht der Suchmaschinen / The Power of Search Engines (Cologne: Halem, 2007), 71-87.
18 Ingrid Melander, “Web search for bomb recipes should be blocked: EU”, Reuters, September 10, 2007, http://www.reuters.com/article/internetNews/idUSL1055133420070910.
19 Article 19 of the United Nations Universal Declaration of Human Rights, adopted and proclaimed by General Assembly resolution 217 A (III) of 10 December 1948; Article 19(2) The United Nations International Covenant on Civil and Political Rights, adopted and opened for signature, ratification and accession by General Assembly resolution 2200A (XXI) of 16 December 1966.
20 Council of Europe, Convention for the Protection of Human Rights and Fundamental Freedoms, 4 November 1950, Europ.T.S. No. 5 (hereinafter ECHR).
21 For an overview, see Frederick Schauer, Free speech: a philosophical enquiry (Cambridge: Cambridge University Press, 1982)
22 See generally Van Dijk, Van Hoof, et al. (eds), Theory and Practice of the European Convention on Human Rights, 4th edition (Antwerpen: Intersentia, 2006)
23 European Court of Human Rights 24 Nov. 1993, Lentia v. Austria (labeling the state as the “ultimate guarantor of pluralism”). See generally Aernout Nieuwenhuis, “The Concept of Pluralism in the Case-Law of the European Court of Human Rights”, European Constitutional Law Review, 3 (2007): 367–384.
24 For a detailed discussion see Van Hoboken, 2009.
25 Decision no.: R/01046/2007 of the Spanish Data Protection Authority (AEPD) (Mr. X.X.X v. Google Spain, S.L.) dealing with a request of removal from Google’s index results in the following administrative order: “calling on Google to adopt the necessary measures to withdraw the data from its index and block future access to it.” The order is remarkable because the source of the information, an official publication by a local Spanish authority, is considered to be lawful.
26 See Van Hoboken, 2009.
27 See Urban & Quilter 2006. Interestingly, Scientology’s takedown notices to Google have been cited as a reason for Google to participate in the Chilling Effects Clearinghouse, a United States based initiative that tries to make the takedown practices by online intermediaries more transparent.
28 See Soghoian & Valle, 2008.
29 Major search engines and other Internet companies have recently entered into a Global Network Initiative, which contains a set of principles for dealing with requests to censor information. It is possible that this self-regulatory framework will also prove to be of value in the European context. For more information, see the Global Network Initiative at http://www.globalnetworkinitiative. org.
30 See Van Hoboken, 2009. The review of the E-commerce directive that is supposed to address the issue is long overdue. Search engine liability and intermediary liability more generally are highly controversial issues, so from the perspective of the European Commission it is not very tempting to address the current status quo. In fact, there is a considerable effort to shift intermediary responsibility in the other direction, making it mandatory for ISPs to help to enforce copyright laws. See e.g. Monica Horten, “2009 – the year of the throttled user”, iptegrity.com, available at http://www.iptegrity.com/index.php?option=com_content&task=view&id=226&Itemid=9.
31 See for instance Court of First Instance Brussels, February 13, 2007, Copiepresse v. Google; LG Hamburg, 26 September 2008 – Az.: 308 O 42/06 and LG Hamburg, September, 26 2008 – Az.: 308 O 248/07.
32 European Commission, Green Paper on Copyright in the Knowledge Economy, Brussels, COM(2008) 466/3, http://ec.europa.eu/internal_market/copyright/docs/copyright-infso/greenpaper_ en.pdf.
33 See European Commission 2008, p. 9.
34 Interestingly, in a recent competitive move against Google, Microsoft has taken sides with the publishers. See Scott Fulton, “Microsoft’s IP chief: ‘Information wants to be free’ is a ‘disaster’, BetaNews, November, 26, 2008, http://www.betanews.com/article/Microsofts_IP_chief_Information_ wants_to_be_free_is_a_disaster/1227206366.
35 comScore, “comScore Releases March 2008 European Search Rankings”, March 2008, http:// www.comscore.com/press/release.asp?press=2208.
36 “Attack of the Eurogoogle”, The Economist, March 9, 2006.
37 Michael Barbaro and Tom Zeller jr., “A Face Is Exposed for AOL Searcher No. 4417749”, New York Times, August 9, 2006, http://www.nytimes.com/2006/08/09/technology/09aol.html.
38 Article 29 Data Protection Working Party, “Opinion on Data Protection Issues Related to Search Engines”, April 4, 2008, http://ec.europa.eu/justice_home/fsj/privacy/docs/wpdocs/2008/wp148_ en.pdf.
39 See Google Inc., “Log Retention Policy FAQ”, March 2007. Interestingly, Google has tried to defend its user data collection policies with a reference to the usefulness of these data for the prevention and prosecution of crime, thereby effectively asking for data retention obligations.
40 For an excellent defense of this premise, see Jack Balkin, “Digital Speech and Democratic Culture: a Theory of Freedom of Expression for the Information Society”, New York University Law Review 79, 1, (2004)

Translations

Der zweite Index. Suchmaschinen, Personalisierung und Überwachung (de)

Christine Mayer | Felix Stalder (2009)

Content type	text
Projects	Deep Search World-Information Institute
Date	2009

FUTURE NON STOPbeta

A Living Archive for Digital Culture in Theory and Practice

MAP BROWSING HISTORY

MAP LEGEND

The Second Index. Search Engines, Personalization and Surveillance

Translations

Related Texts

Related Videos

Tags