Google is the most important company to the Internet. Hyberbole? I think not! Without Google, the Internet that we all know and love would be a very different place, as would the business of IT. Along with Microsoft and the supporting community around LAMP, Google is the very foundation of modern computing. But the foundation of Google itself, its ability to rank Internet content and present relevant information to its users, is at risk. What will they do to fix it?
Note: This post is about Google, because it is by far the dominant search engine, advertiser, and “portal” in the English-speaking world. Nearly everything mentioned here applies equally to other search engines and advertising providers.
Ranking Pages
Google’s relevance comes from their historical ability to present a quality searchable portal to the entire Internet. The majority of Google’s revenue is also derived from quality information, giving them the ability to present more-compelling advertising to web users.
Google’s core success is based on its ability to discover and rank the quality of Internet content. Gmail, Reader, Picasa, Apps, and the rest of the Google properties are surely excellent sources of information on the preferences of individual users, but they contribute only slightly to the other side of the coin: Information about Internet content. For that, they still rely on the core technology invented at Stanford a decade ago: PageRank.
Every time it encounters a link, Google’s software “spider” follows it, adding the content of the linked web page to an index. Google, like other early search engines, counts each link as a vote for the quality of the page. The genius of PageRank is that Google weights each vote based on the quality of the page it comes from. Although PageRank is not the entirety of Google, it is a singular key element.
Put simply, Google’s success depends on its ability to gather and rank the links we all make and match them to the data we provide about ourselves. Without this, Google will fail.
The Changing Web
The graphical Web is not the Internet. My first experiences online came well before graphical hypertext clients (what we now call browsers) dominated the user experience and became the web. Although the network we call the Internet now supports a very wide variety of traffic, Google’s preeminence comes only from the Web. They have little or no reach into the massive streams of corporate data, multimedia, and other non-hypertext content streaming across the ‘net.
When it was first developed, the web was manual and links were hand-selected and carefully put into context. It was difficult to put together a web page, and those pages that were developed were were static. The social networks of the time (USENET, IRC, and email mostly) were not integrated into the web, did not generally include links. So the first search engines, and later ones like Google, focused on this relatively small pool of pages and links.
But the web soon became automated, subsuming most other interactive services. Social (user-generated) interaction moved into the web in a big way, with blogs, wikis, and discussion forums enabling rapid content creation and reference by users. Sharing links in the social web, and through social bookmarking services, generally replaced the manual pages of old.
At first, this explosion of user-generated content was a dream scenario for Google. They could harvest the collective intelligence of us all to identify and rank content. But as the number of pages and links exploded, the notion of a “web page” was radically shifted from a stable and predictable set of data to a dynamic portal into a vast store of content. Where everyone once saw the same content at a given URL, now each of us has his own experience.
Spammers and scammers realized the value of Google placement and flooded this dynamic social web with links. This threatened not only to undermine the relevance that supports Google’s search (and advertising) business, but it also threatened these new social services themselves. Each honest, relevant link added to a Wikipedia article, included in a Slashdot comment, or shared on a service like Digg was dwarfed by the thousands or millions of spam links injected to boost the PageRank of “client” sites.
I Don’t Follow
Google and the social net fought valiantly against this wave of link spam, but it became clear that something more radical was needed. The only way to fight spam was to make it useless to the spammers. Thus was born a simple but highly-effective tool: Nofollow.
Webmasters long had the ability to tell the Google spider to ignore a certain set of hosted pages through the use of a server-side list called robots.txt. But spammers wanted the exact opposite. What was needed was a client-side way to specify that a link was not worthy of being spidered and ranked by the search engines. This would eliminate the primary benefit of link spam.
Implementing client-side spider blocking was trivial: A simple tag, “rel=nofollow”, was added alongside the url in a web link. This way, Google’s spider would simply ignore every “nofollow” link it encountered, and they would never be searched or ranked in the index.
But spammers would never put the nofollow tag in their own links. So sites quickly began implementing blanket nofollow policies: Every link submitted by users in any form would receive the tag by default. The idea would be that links that had not yet been vetted by users would get the nofollow tag and those that were deemed acceptable would not. But most sites never figured out the right process to allow the nofollow tag to be removed. Today, nearly every social service, from FaceBook to Twitter to Digg to StumbleUpon, permanently marks nearly every link this way. Even Wikipedia, a long-time holdout, finally switched to a default nofollow on all but the English site.
The Nofollow War
What does this mean for Google? If the vast majority of user-generated links are tossed into the spam category as far as the search engine is concerned, it means that their entire system of discovering and ranking links is in jeopardy. The major social services, most of which attract the majority of end-user traffic, content, and links, are rendered useless in generating relevancy.
But these are the exact sources that Google ought to be focusing on the most. Many have noted that they hear about news more rapidly through real-time sources like Twitter than through less-dynamic traditional news sites and blogs. Even if Google had the ability to spider a service like Twitter in real time, which is doubtful, they would gain no insight from the links included in these sources. Social bookmarking sites like Digg are chock full of user-vetted links and should be gold mines for Google, but the nofollow tag makes them invisible.
This scarcity of user-generated links has made the links that are followable even more valuable. Scammers constantly create fake blogs of scraped (read “stolen”) content and users are paid to include followable links anywhere they can. Sites with a high PageRank value are constantly inundated with offers and attacked by hackers to siphon off high-value “votes”.
High-profile content providers are circling their wagons, drastically cutting down on outside links in order to focus PageRank on their own properties. Smaller publishers and blogs are striking back at the big guys, decrying their dearth of external links. Some even go so far as to initiate blanket nofollow policies against these big, respected, but non-linking sites.
This leaves Google with even fewer useful links with which to examine the Web. It also leaves the biggest content providers and networks and the savviest search engine optimization (SEO) pros with a bigger slice of the valuable top-of-Google result real estate.
The Fix Is In
Google is left with a looming nightmare scenario: As smaller, alternative, social, and real-time content providers disappear from the search engine, its overall relevance and value declines. Soon, a tipping point will be reached when users would rather rely on Twitter, FaceBook, and the rest for their Internet interactions than the old-fashioned search engine, email, and RSS readers that Google currently dominates. This house-of-cards collapse can only be avoided by including user-generated content in the Google index.
Search engines could simply ignore the nofollow tag, wading into the social stream and combatting spam in other ways. But this would lead to another rapid upswing of link spam, shifting the burden to content providers once again. And it might also expose links that actually should not be followed, leading to technical and even legal trouble.
The best solution would see the social networks designing in some method of removing the nofollow attribute once links are verified to be relevant and correct. But there is no incentive for them to help drive Google traffic to other sites. Indeed, Twitter recently took the next step, arranging the titles of user pages in an attempt to SEO their way to the top page of Google searches for user’s names. Only altruistic systems like Wikipedia are likely to design in this type of response.
Another possible scenario (to be explored another day) is the usurpation of today’s social web and its content by a new next-generation service. A web-based social client like FriendFeed could rapidly siphon away both existing and net-new content and users in the guise of openness and interoperability. Although new web spiders like Cuil have failed, perhaps old-fashioned crawling capability is no longer all that valuable in the social web.
The most likely fix is both predictable and pragmatic: Google must buy all successful source of social links (like Twitter, Bit.ly, StumbleUpon, and even FaceBook) and integrate them into their search system. Owning Twitter would enable Google to decide which links to follow and which to ignore. The reward of improving search results would be the incentive needed to add “re-follow” capability. Buying these services would also give Google an open pipe of the real-time traffic flowing through these services, a critical resource that they currently lack.
Google simply can not afford not owning the real-time web, and they must continue to buy up similar sources of content as they appear. Yahoo was unable to extract value from StumbleUpon, but Google’s other competitors will certainly try to undermine the search giant. Frankly, I’m shocked that Microsoft, FaceBook, or even Baidu have not yet snapped up services like Twitter, LinkedIn, and Digg even if only to keep them and the information they contain out of Google’s hands.
If you enjoyed reading this, you’ll probably also like my Foskett Services blog!
yscan says
best yahoo invisible detector :- http://yscan.info
stonewang says
I don’t agree with you.The most reason is the value of real time content is not high as you thing.
stonewang says
I don't agree with you.The most reason is the value of real time content is not high as you thing.