search tool data analysis

by Bryan Blum (bryblumbryblum in BIT330, Fall 2008)

Questions and queries

Web search engines

Text description of query: I am looking into starting a new web-based business. In short, the business will essentially streamline the Ann Arbor off-campus housing market, and improve communication and payment processes between landlords and tenants. I am looking to do some market research into this idea. I am wondering whether a business like this already exists or has already been attempted in Ann Arbor or elsewhere, and how successful it is or was.

Search queries

  • Google: college housing
  • Yahoo: college housing
  • Windows Live: college housing

Blog search engines

Text description of query: I am looking to find information about Sarah Palin, the newly announced Republican Vice-Presidential candidate. Since I have heard a lot of criticism about her economic views, I would like to find out her views on the economy and why they have been criticized.

Search queries

  • Technorati: Sarah Palin economy
  • Google Blog: Sarah Palin economy
  • Bloglines: Sarah Palin economy

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 10 10 10
Google 10 20 10
Yahoo Web 10 10 10
All 5
Blog search Technorati Google Blog Bloglines
Technorati 35 0 10
Google Blog 0 40 5
Bloglines 10 5 50
All 0

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 0 1 1
10 0 1 1
20 0 1 2
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 0 0 0
10 1 1 1
20 2 2 2
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 1 1 1
10 1 1 1
20 1 1 1
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 1 1 1
10 1 1 1
20 1 1 1

Results

Web search

Data Set #1 (Overlap of Search Engines)

This table provides key statistical measures regarding the first set of data in the web search analysis.
Precision Overlap All
Statistical Measure Live Google Yahoo L/G L/Y G/Y L/G/Y
Sample Size 19 19 19 19 19 19 19
Mean 42.74 54.58 51.68 18.47 20.37 20.84 10.21
Median 42 57 52 20 20 20 10
Mode 15 70 70 10 10 25 10
High 80 90 85 35 45 35 25
Low 10 20 10 0 5 5 0
Range 70 70 75 35 40 30 25
Standard Deviation 22.13 19.51 21.79 9.30 11.17 7.72 7.32

Various observations can be made from this summary of the class data. First, we must look at the "precision" column, which will essentially tell us how applicable and useful each search engine's results are. Ultimately, Google seems to be the most precise search engine, with a mean of 54.58%, compared to Yahoo!'s 51.68% and Live's 42.74%. That is, members of the class found Google's results to be relevant and useful more often than Yahoo! and Live. Further supporting this notion is the standard deviation of precision statistic. Google's standard deviation is lower (19.51) than both Yahoo! (21.79) and Live (22.13), meaning that Google's search results are more consistent than Yahoo! and Live, and don't vary as much. The "overlap" statistics are interesting as well. For example, when the mean of L/Y is 20.37, this means that 20.37% of the relevant results found in Live are also found on Yahoo!, on average in the class data. Since Google and Yahoo! have the highest overlap mean, they produce the most similar relevant top 20 results. At 18.47, Google and Live produce the least similar top 20 results, on average in our class. Lastly, the "all" column should be noted, as 10.21% of results are found in all three search engines.

Data Set #2 (Overlap of Search Rankings)

This table provides key statistical measures regarding the second set of data in the web search analysis
GY YG
Statistical Measure o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Sample Size 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
Mean 1.06 1.35 1.65 1.29 2.00 2.65 1.65 2.47 3.71 1.06 1.18 1.65 1.47 1.94 2.47 1.88 2.65 3.76
Median 1 1 2 1 2 3 1 3 4 1 1 1 1 2 3 2 3 4
Mode 1 0 0 1 1 4 1 3 5 1 0 1 1 3 3 1 4 5
High 4 4 4 4 4 5 4 5 7 4 4 4 4 4 5 4 5 7
Low 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Range 4 4 4 4 4 5 4 5 7 4 4 4 4 4 5 4 5 7
Standard Deviation 1.20 1.32 1.41 1.21 1.32 1.73 1.22 1.55 2.11 1.20 1.29 1.37 1.23 1.39 1.59 1.27 1.73 2.08

This table expands on the previous set of information, and takes into account the position of the search results within the top 20. From this data, we can determine how relevant Google's, Yahoo!'s, and Live's top 5, top 10, and top 20 search results are, with regards to the others' results. For example, in the GY portion of the chart, it says the mean for o(10,5) is 1.35. This means that of the top 10 results in Google, only 1.65 of them, on average, appear in the top 5 Yahoo results. On the YG portion of the chart, however, when it says that the mean for o(10,5) is 1.18, it means that of the top 10 results in Yahoo!, only 1.18 of them, on average, appear in the top 5 Google results. An interesting stat to compare is GY's o(20,10) of 2.65 and YG's o(20,10) of 2.47. This tells us that Google has more of Yahoo!'s top 10 results in their top 20, than Yahoo! has Google's top 10 in their top 20. Thus, Google's top 20 search results are more encompassing than Yahoo!'s.

Blog search

Data Set #1 (Overlap of Blog Search Engines)

This table provides key statistical measures regarding the first set of data in the blog search analysis.
Precision Overlap All
Statistical Measure Technorati GBlog Bloglines T/G T/B G/B T/G/B
Sample Size 19 19 19 19 19 19 19
Mean 33.42 52.63 44.63 3.89 9.53 7.21 1.58
Median 30 45 48 0 10 5 0
Mode 35 40 50 0 5 5 0
High 85 100 75 25 25 20 10
Low 5 25 20 0 0 0 0
Range 80 75 75 25 25 20 10
Standard Deviation 20.62 21.56 13.96 6.94 7.66 6.37 3.36

This data can be interpreted the same as the web search data above. In this case, Google Blog is the most precise, as their results are relevant 52.63% of the time, on average, followed by Bloglines (44.63%) and then Technorati (33.42%). However, Google Blog's standard deviation is the highest of the 3, at 21.56. Therefore, although their results are the most relevant, they are the most variable of the 3, and perhaps they are not the most reliable. It also should be noted that Technorati and Bloglines are most similar in their results, as indicated by the T/B overlap cell. Overall, however, the three blog search engines produce very dissimilar results, as indicated by the "all" column, which shows that only 1.58% of the relevant results are found in all three blog search engines, on average.

Data Set #2 (Overlap of Blog Search Rankings)

This table provides key statistical measures regarding the second set of data in the blog search analysis
GB BG
Statistical Measure o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Sample Size 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
Mean 0.29 0.35 0.47 0.41 0.47 0.82 0.71 0.76 1.06 0.29 0.35 0.59 0.41 0.53 0.82 0.53 0.88 1.12
Median 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1
Mode 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
High 1 2 2 2 2 3 3 4 4 1 2 3 2 2 4 2 3 4
Low 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Range 1 2 2 2 2 3 3 4 4 1 2 3 2 2 4 2 3 4
Standard Deviation 0.47 0.61 0.62 0.62 0.72 1.01 0.92 1.09 1.20 0.47 0.61 0.87 0.62 0.72 1.07 0.62 0.99 1.17

The data from this table can be interpreted like the overlap data for the web search engines. In this case, the mean is rarely above one. This means that the majority of the relevant results, regardless of where they are in the order, are not found in the other blog search engine's results. The mode for all of these results is 0. This further proves the point that the relevant results in one search engine are rarely found in the results of another search engine.

Discussion

Web search

  • Discuss the meaning of the two sets of data, especially when considered together.
    • By studying the two sets of data, many conclusions can be drawn. First, one can conclude that Google is the most precise search engine of the three, as it's results are the most relevant to the user, on average. It is also the most consistent, as it's results have the lowest standard deviation on average. We can also conclude that Google and Yahoo! are most similar, since their mean overlap percentage is the highest. However, although they are most similar, their results are still not similar at all, as at most 4 or 5 results are shared among 20 results. Surprisingly, less than 2 results from Google's top 20 are found in Yahoo!'s top 5, on average.
  • What recommendation(s) can you make to a person searching for information?
    • The obvious recommendation to make to somebody after studying this data is to use multiple search engines. For whatever reason, search engines rarely produce similar search results, especially in the top 5 and top 10. If someone is searching for information, they are likely to never even look at results 10-20. Thus, it would be very useful to look at the top 10 of various search engines in order to maximize the amount of relevant results you receive.
  • What did you learn from this, either from the data itself, the process of submitting the queries, or whatever?
    • From this process, I learned many things. First, I learned that I have no idea how search engines decide on which results to display. If you had asked me before I performed this exercise, I would have guessed that 80-90% of the results would be the same. I was shocked how rare it was to have relevant results shared by different search engines. I also never realized how low the precision percentages would be for a certain search engine. However, to me it seems that as long as there is ONE relevant result, it shouldn't matter how many of the others are relevant, provided you find what you are looking for.
  • Are there any further questions that you might want to investigate? By this, I don't mean what further queries would you like to submit to the search engines. I mean what research questions would make sense as a follow-up? What methodological changes might need to be made to make these results better?
    • The results from this study are far from conclusive. First, there is a confounding variable in the subjectivity of what "relevant" actually is. Participants in this exercise may have had different standards of relevancy, skewing the data one way or another. Also, the queries may have varied in how specific or broad they were. This could skew the results as well. Also, one might argue that it shouldn't matter how many of the results are relevant, as long as ONE of the results is relevant, provided the searcher finds what they are looking for. Thus, another question I would want to investigate is "how many relevant results does a searcher click on before finding the information they are looking for?" This would help us to understand the usefulness of this data. To make these results better, perhaps the type of data being searched for should be specific.

Blog search

  • Discuss the meaning of the two sets of data, especially when considered together.
    • In studying the blog search data, one can conclude that blog search engines all find relevant results, but that the results they do find are not at all similar to one another. The most important statistic, in my opinion, is that only 1.58% of relevant results were found in all 3 blog search engines, on average. Why is this? I think that it is a combination of there being so many blogs on the internet, and that blog search engines aren't as sophisticated as web search engines. When compared to the web search engine results, one can see that the blog search engines were less relevant overall, and that they shared MUCH fewer results with each other. This means that each blog search engine must search from very different pools of blogs.
  • What recommendation(s) can you make to a person searching for information?
    • Again, I would recommend for a person to use multiple blog search engines when searching for information. This would clearly increase the chance of finding relevant results, given the fact that the search results from the three blog search engines were so different. Also, I would recommend starting broad with search queries, and slowly getting more specific.
  • What did you learn from this, either from the data itself, the process of submitting the queries, or whatever?
    • Before doing this exercise, I had never used a blog search engine. In fact, I had no idea they even existed. Now, I know that using a blog search engine is a great way to find more opinionated and subjective information.
  • Are there any further questions that you might want to investigate? By this, I don't mean what further queries would you like to submit to the search engines. I mean what research questions would make sense as a follow-up? What methodological changes might need to be made to make these results better?
    • In the case of the blog search engines exercise, I think it would be useful to exclude search results which quote the same source. In my results, there were sometimes multiple websites that were quoting the same person, book, etc. This may have inflated the number of relevant results included in my data. Again, I think it would be useful to be more specific with the type of information being searched for in the blogs. This would make the study much more scientific and less subjective. Also, I feel that there is a distinct difference in the type of information being searched for by a blog searcher as opposed to the type of information being sought after by a web searcher. A blog searcher is likely just looking to read someone's opinion about a topic. Therefore, many results can be relevant, and there may never be a certain number of relevant results that "satisfy" the searcher's need for information. A web searcher, however, may be looking for a more objective, fact-based piece of information, making the relevancy of results more important.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License