Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A surprising opinion on the ‘Bing is copying Google’ controversy (fury.com)
24 points by bensummers on Feb 4, 2011 | hide | past | favorite | 26 comments


"Bing isn’t mining the first pages on Google search result pages as Google claims; they’re mining the pages that users click on the most."

Hasn't this been at the core of the controversy all along?

Counterargument:

It's user action on a page created by google. Say I had nothing but this user data - and say I had enough of a sample. I could build a pretty damn good search engine out of that. But I know nothing about how to provide search results. I haven't even got a crawler. Yet here I am with a search engine.

edit (to finish the follow through):

If I can with this method build something about which I have no actual knowledge - then it proves the method is piggy backing on someone else's hard work. And THAT is what matters. It doesn't matter if it is only one of many signals - it doesn't matter if it's not copying Google's algo directly. What matters is that it is still relying on someone else's knowledge to improve Bing's product. That's shitty.

This argument clinches it for me.

The fact that Bing can't even acknowledge the opposing side - given that at the very least intuitions are going both ways here - is a clear sign of bad faith to me.


Oh, come on! The whole idea of search engines is to rely on someone else's knowledge to make a product, isn't it? Isn't this how Google makes, like, all its money?

What's skeevy in the MS case is that the relied-upon-knowledge and the product are so similar...but still, Google's uber-umbrage strikes me as odd and tone-deaf, especially given its dominant position in the market.


Re: using someone elses's knowledge - Google honors robots.txt, so, if you don't want them to use your link knowledge, you can tell them not to.

I'm not sure there is any way for Google to Tell MSFT to not use their search results in the Bing search engine.

If Microsoft had just been a bit more upfront with the fact that they were using IE user's google clicking behavior to improve their search engine, I suspect there would have been much less furor.


I don't know, doesn't Google harvest the content of anything I send to an @gmail.com address? Or how about the massive scanning of books over the objections of book publishers a while back (though I think this ultimately got resolved)? My general impression of Google, and part of why I find their crying foul so jarring, is that they will harvest as much information as they possibly can, to whatever end, until restrained by public outcry. (Which, BTW, I'm fine with!)

But it seems very hypocritical to get butt-hurt on behalf of Bing toolbar users who are having their movements tracked.

(BTW, I don't find Google's observance of robots.txt to be particularly compelling or telling, because they've never been in a position where ignoring it would be significantly beneficial to them, as far as I know.)


Google was pretty clear that they would be harvesting the content of your gmail to target ads. The different here, is that MSFT did this "Google-search-Click-Tracking" thing on the down low.

I agree with you though, if I, as an IE9 user wish to submit my click track results to MSFT for analysis so they can improve search results - that's fair game.

But, It's not clear to me that MSFT should be able to review what the user was searching on before they clicked on that data. Now they are actually using Google's search Data + the user's click traffic. I think they cross a line there, particularly if they aren't willing to come clean and admit that's what they are doing, and make it clear that they are sending your Google Search queries + your click traffic back to Redmond.

How many people on HN were aware that Microsoft was doing that with IE? Click Traffic, sure - But I didn't know they were sending my Google Search queries back to HQ.


Bing didn't crawl Google's pages (so never reads a robots.txt file), they are collecting click stream data from users via the Bing toolbar as they browse a page.

It's voluntary, you can opt out of the anonymous data reporting. The data they are using belongs to the user, so it's the user that can opt out, not Google.


"I'm not sure there is any way for Google to Tell MSFT to not use their search results in the Bing search engine."

Google could send a DMCA Takedown or other Cease-and-Desist letter.


Address the argument.

The intuition is that if a method allows you to build a search engine without actually knowing anything (besides the method itself) - then that method is piggy backing off someone else's work.

Yes - google piggy backs off other people's work - i.e. linking to other sites to indicate quality. But this is not piggy backing off a search engine. It's not piggy backing off the technical work that someone else developed for a search engine product. So it's not a counter example to my argument.


We agree, in a way, but disagree about the severity of the distinction. Google can't build a good search engine w/o mining an existing network of interlinks between sites to build a good measure of quality. Bing can't provide results as good (or perhaps "good" -- "torsoraphy", "hiybbprqag"...) as they currently do w/o watching what users click on, which includes Google search results. You're saying (if I follow you) that what Google does is fair because they always transform the form of their input, whereas Bing, in some instances, does not transform their input -- they use search engine results to provide search engine results. OK, fine. But I still don't really see the outrage.

I don't care about piggy-backing or fairness; I care about maximum public good.

So here's the situation. Google wants Bing to stop harvesting its results. Why? Very broadly, there are two scenarios:

- Bing is adding value - Bing is not adding value

In the latter scenario, Google should not care. No one will use Bing, and it's a non-issue.

In the former scenario, it's not clear at all to me that Google should have a monopoly on the information entered and sites visited of people that happen to visit their site. Google is trying to assert ownership, in a fashion, of the link between the users query and the site they visited, because the site they visited happened to be shown to them by Google. I see no reason to grant them this right.

The fact that Google acts outraged be Bing's behavior strikes me as particularly rich, given that Google itself is a notorious and cavalier harvester of data others would consider "theirs".

(Edit: This is pure, uninformed speculation, but I wonder if Google's (to me) odd outrage is because they place such a premium on algorithms over human-generated content. I.E., the idea is it's fine to harvest other people's human-generated work ("information wants to be free"-style), but harvesting other's algorithm-generated content is verboten, because that's essentially stealing the algorithm. I have no inside knowledge, but this would agree w/ the pop-culture characterizations of Google.)


It doesn't improve the common good.

Bing is not adding value with this method - they are thieving value. By thieving it they reduce the reward for the value that google provides. As such they reduce the incentive to provide genuine innovation.

Having said that - I don't necessarily disagree with the view that Google's position is a bit rich given the data that they do harvest from us folk without remunerating us for that work. But that doesn't make the argument against Bing any weaker. Two wrongs n rights n all that.


What data does Google harvest without renumeration?

When it comes to PageRank, the implicit offer is: Allow google to use your links (intra-site and extra-site) in its algorithm, and it will make your site searchable via Google.

If you don't want Google to use your 'hard work' of collating and vetting links to other sites, then disallow the Googlebot via robots.txt

On the other hand, Microsoft refuses to allow anyway to disallow click traffic patterns involving your site to be used in its algorithm. They are thus mining the links from your site to another without even having to renumerate you by giving you a chance to be indexed in their engine.


That's a good point. I didn't really have a clear view on that matter. I just wanted to emphasise that however that discussion bore out - it is irrelevant to the argument against Bing.


> Microsoft refuses to allow anyway to disallow click traffic patterns involving your site

Because that information doesn't belong to the site to which it is associated, but to the user.


In my opinion there is a difference.

A webpage author links to another as an act of conscious recommendation. The one who added the link added it for the sole purpose of others to make use of it. It would be a stretch to claim that Google serves search results for its competitor to make use of.

Next is the issue that Bing is not scraping Google but using user clicks. But here users are just a means to an end of scraping. That someone is doing an act by way of third parties does not take away from the fact that one is still doing it.

I am all for having someone give Google the run for its money, but I want that to be driven by genuine technological innovation. Not by being a El-Cheapo knockoff of the market leader. Some search engines are beating Google in niche markets by being better than Google, which is excellent.

I dont think, Bing re-serving Google results, brings any innovative pressures to the market. Especially when you know that whatever innovations one brings it will get replicated by piggy backing. I wouldnt want to be in a business where this is true.

What worries me is that the main players will start engaging more in how to inconvenience each other rather than building better products. I have been fearful of the fact that Microsoft would someday tweak IE so that Google does not work well on it. They have done this for a few other sites but haven't done it to Google. I think initially in fear of starting a race, but now with other browsers beginning to rule the roost thats moot.


agreed. the case still will never be a slam dunk for many because user generated data is a full partner in the signal, but to me the author's epiphany just means a shift to thinking of Bing asking a user which of these 10 Google results they like best.


The fact that Microsoft is able to take only the relevant results from Google doesn't change the fact that the results are coming from Google. The problem is that if search engines start using the technique that Microsoft is using then there is substantially less value to Google, Microsoft and other search engines for innovating on rare search term relevancy since the other search engines get the benefits for free.


Meta search depending entirely on other search engines' output has been around in the form of Dogpile since the days Google was a student project.

During that time Google has managed to innovate fast enough and implement well enough to grow into a multi-billion dollar corporation.


Monitoring users clicks over the web is not the same as monitoring users clicks on a competitors web site (google search results).

The whole point of generating 100 unlikely search terms is the fact that these wouldn't be items that exist in any web page on the web. Hence if one searches for them it should not return any results!

How is monitoring people's clicks on google's search results not copying ?

On a different note, I am of the opinion that this practice is acceptable, only as long as one acknowledges it. The fact that M$ is not, to me is appalling.


> This morning I had a different thought though, one that’s completely reversed my thinking on the issue: People aren’t robots, and they don’t always click on the first result.

But people do have a bias to click on the first result because it's the first result. From http://www.useit.com/alertbox/defaults.html :

> 42% of users clicked the top search hit, and 8% of users clicked the second hit. ...

> [When the researchers] swapped the order of the top two search hits ... users still clicked on the top entry 34% of the time and on the second hit 12% of the time.


"If Bing is using the techniques descriped in this Microsoft technical paper then they should stop. Full stop." http://aclweb.org/anthology/P/P10/P10-1028.pdf

I didn't read the paper (just read the abstract, I know..), could anyone explain what are they doing so bad in that? Thanks.


The whole point of crawling and indexing websites to begin with is to simulate what users can and do click on. The fact that it is now technologically possible to actually track users navigation of the web doesn't change the intent of the robots.txt convention in controlling and limiting the way that search engines collect data from a site.

Bing/MS can choose to dishonor that convention, either directly or indirectly as they are now with the Bing toolbar, but I think that's a mistake on their part. Google (and Bing.com for that matter) has used the most widely accepted mechanism, robots.txt, for ensuring that content it does not believe should be indexed is not. It abides by that convention with other sites and expects other search engines to abide by that convention with respect to google.com in return.

From here the following things can happen:

1. Microsoft agrees that robots.txt should govern clickstream data from the Bing toolbar. The world returns to sanity once again.

2. The search engine community (MS, Google, etc.) agree on a new standard similar to robots.txt specifically for governing the use of clickstream data, sites are updated with new directives allowing/disallowing such use and the world returns to sanity.

3. Microsoft specifically denies that they should be limited in using clickstream data from any source, through convention or through any means. Search companies and other sites are forced to fall back on other means to achieve the same results and things get messy (for example, google blocks all users who have the Bing toolbar installed, they prevent the Bing toolbar from being installed in Chrome, MS retaliates, etc.)

For the life of me I cannot think of a sane reason why MS would choose option number 3 other than that they are insane, horribly myopic, or just plain dumb.


I think it would be rather short sited of Google to start getting into wars over permissions for mining clickstream data from the source page, given the amount of data mining they do on Gmail links and search result links etc.

I can imagine a reverse sting operation Bing could run: 1) Set up similar faked honeypot pages that only Bing knows about. 2) Send emails to Gmail accounts owned by the Bing management team. 3) Click on a few of those links. 4) Watch as Google mines the clicks and ends up with the links in their search index. 5) Sit back and accuse Google of reading private emails from Bing management to improve their search results.

It's exactly the same thing, the data is owned by the user, and in both cases the user will have agreed to a terms and conditions that allows the companies to collect stats from that data.


I was always under the impression that robots.txt was intended not to control search results, but to help prevent overzealous crawlers from taking down your website.

If you want to control what a search engine does with a page, you should use <meta name="ROBOTS"> and rel=nofollow, neither of which Google includes in its results pages. It would be reasonable for Bing to look for and honor these directives when collecting their clickstream data, and Google isn't availing itself of the option.


It's for controlling search results; blocking overzealous crawlers is a convenient side-effect.

robots.txt also blocks indexing of non-html files.


rel="nofollow" is only useful for pages which are intended to be indexed, google's search results pages are not, which is why they don't use it for the links in search results.

The point is: up until now search indexing has had a robust opt-out mechanism. Bing has changed that and has muddied the waters with vague hand-waving. The question still remains: should there still be a way for sites to opt-out of search indexing? If so, then why can't robots.txt continue to be that mechanism? If robots.txt can't be that mechanism then what should be the new standard. If not, then that opens an entirely new can of worms. A can that I think we're better off not opening.


Kevin Fox, the author, makes the interesting point that what Microsoft is mining via clicks isn't directly Google's results, but the users' estimation of those results, when they choose which ones to visit. This data is to some extent a new creation, although definitely derived from the Google results. Custom analysis of such user activity doesn't necessarily just port over results, but could also result in an even better ranking than the original Google presentation.

Fox also suggests there could be a robots.txt-like standard where sites declare they want to opt their users' activity out of any such analysis. That strikes me as a bad idea: users ought to own their own interaction trails.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: