RELATED APPLICATIONS
This application claims the benefit of U.S. provisional Patent Application Nos. 60/603,140 entitled “Method and Apparatus for Responding to End-User Request for Information” by Westover et al. filed on Aug. 19, 2004; 60/637,684 entitled “Method and Device Publishing Cross-Network User Behavioral Data” by Wohlers et al. filed on Dec. 20, 2004; 60/662,680 entitled “Method and Device for Publishing Behavioral Observations to Customers” by Eagle et al. filed on Mar. 17, 2005; and 60/660,798 entitled “Method and Apparatus for Responding to End-User Requests for Information” by Westover et al. filed on Mar. 11, 2005. This application continues-in-part the disclosure of U.S. patent application Ser. No. 11/015,583 entitled “Search Engine for a Computer Network” by Anthony G. Martin filed on Dec. 17, 2004. These identified applications are incorporated by reference for all purposes.
This application is related to the contemporaneously filed U.S. patent application Ser. Nos. ______ and ______ entitled “Method and Apparatus for Responding to End-User Request for Information—Ranking” and “Method and Apparatus for Responding to End-User Request for Information—Personalization”, both filed on Aug. 19, 2005.
FIELD OF THE INVENTION
The present invention relates to an advanced search engine. The advanced search engine may include a client component for monitoring an end-user's browsing activity, a remote server (may comprise one or more computers) for storing and processing data received from the client component, and a module that process web pages and serves search results to end-users. The advanced search engine may collect web pages for keywords of proven interest, fetch web pages requested by end-users, generate snippets or abstracts of the web pages, eliminate duplicate web pages, rank the importance of the web pages, and provide relevant web pages or links to web pages in response to an end-user search request for information regarding one or more keywords, for example. Technical problems solved, measures used and results obtained are discussed below.
BACKGROUND
One approach to search engines, taken by Google, is to organize the world's information and make it universally accessible and useful. Another approach, once taken by Dogpile, is to have a meta-search engine aggregate the results of other search engines. These approaches create a great haystack of results. For instance, the keyword “cheap travel” returns about 18,000,000 results from Google, about 85,800,000 from Yahoo and 68,377,619 from MSN, as of summer 2005!
Much work has been done to float the “needles” to the top of the results haystack, to devise methods of ranking links returned in response to a query. Google's published patent applications propose, in their titles, to use local inter-connectivity, article information, location awareness and other factors to decide on the position of results. Yahoo, Overture and Microsoft also have worked to refine their presentation of results.
In any set of information, a search term is sometimes not enough to determine what results are sought. In one sense, the search term may be ambiguous, as extensively discussed in Bharat et al., “Generating User Information for Use in Targeted Advertising”, US 205/0131762 A1 published Jun. 16, 2005 and in Carrasco et al., “Disambiguation of Search Phrases Using Interpreation Clusters”, US 2005/0015366 A1 published Jan. 20, 2005. The term “jaguar” might refer to cars, animals, a football team, or an operating system. Even if a term were unambiguous, different users might prefer to access different information. For instance, teenage travelers, business travelers and luxury travelers look for different travel arrangements and accommodations, potentially using similar search terms.
It is desirable to return the most relevant results, whether in response to a search or, more generally, on an information feed. The growing number of documents published on web sites (and of documents accessible on private servers) invites development of alternative or improved technology to quickly return relevant results responsive to users' queries. In effect, to find the 50 or 100 most relevant web sites for a particular user whose keyword is “cheap travel” and effectively summarize them for the user. This further invites development of technologies that personalize the information returned, whether content, sponsored content or advertising, based on the interests of the user.
BRIEF DESCRIPTION OF THE DRAWINGS
DETAILED DESCRIPTION
The following detailed description is made with reference to the figures. Preferred embodiments are described to illustrate, not to limit the scope of the claims. Those of ordinary skill in the art will recognize a variety of equivalent variations on the description that follows. Persons of ordinary skill in the art will recognize, however, that the embodiments described can be practiced without one or more of the specific details. In other instances, well-known details are not shown or described to avoid obscuring aspects of the embodiment.
Being computer-related, it can be appreciated that the components disclosed herein may be implemented in hardware, software, or a combination of hardware and software (e-g., firmware). Software components may be in the form of computer readable program code stored in a computer-readable storage medium, such as memory, mass storage device, or removable storage device. For example, a computer readable medium may comprise computer-readable program code for performing the function of a particular component. Likewise, computer memory may be configured to include one or more components, which may then be executed by a processor. Components may be implemented separately in multiple modules or together in a single module.
Embodiments and aspects of embodiments described below can be applied to solve various technical problems. One problem applies to a large network: how to monitor and usefully aggregate patterns of communication among users, search engines and documents accessed. In one scenario, the users are surfing the Internet at widely disbursed personal computers; the search engines include Baidu, Google, Yahoo! and MSN; the documents are pages posted on web sites around the world. In another, employees use an enterprise intranet with an enterprise search engine to locate reference documents exposed on workgroup servers. Addressing these problems may involve positioning a behavior observing module where it can monitor the communication channels in use and report observations to a server, preferably without disrupting the users' routines. In one embodiment, the behavior observing module may run on the user's personal computer (such as a desktop, laptop or handheld computer or media center device). The module can observe both communications and the status of the user's machine when the communications take place. For example, what search terms and results did a user follow to a particular web site? The module can achieve distributed processing and substantially reduce the resources required to aggregate communication behavior by filtering reports and categorically encoding activity. It may report observations to the server as resources are available or when a user browses to an affiliated domain. As part of the ordinary uploading of cookies to domains accessed, observation cookies can be transmitted to the server. The server can aggregate reported patterns of communication. One technical result is receiving reports from communication monitors positioned across a large network and aggregating patterns, including the status or state of individual computers when communications took place. In turn, the aggregated patterns of communications can be used to pre-organize information for retrieval or publication, in anticipation of a query or contact.
A related technical problem is how to organize over-abundant electronic records based on the current and recent status of a particular terminal connected to a network, to enhance the relevance of the first records presented to a user. The over-abundant electronic records may come from web sites world wide, such as the web sites for “jaguar.” Or, they may be documents stored on workgroup servers. They are over-abundant in the sense that they are too numerous to display on a user's screen without repeatedly pressing “page down” or the like. The current and recent status of the terminal, preferably associated with a particular user, may include web sites visited within the last 24 hours (or some other period) before a search query was submitted. Or, it may include a history of documents retrieved from workgroup servers. Either type of activity may be preprocessed and categorically classified. The period for reporting terminal status may precede a query or request for a personalized electronic journal that selects and filters the information based on the current and recent status of the terminal. The technical method again involves a behavior watching module running on the user's computer. In this embodiment, the module summarizes terminal status, publishes the summary to an electronic record (e.g., a cookie) and communicates the record to a search engine or other site that the user contacts. Reporting categorical summaries of status or activity distributes processing and reduces the need for server-based resources. The search engine uses the information, for instance, to determine what sense of “jaguar” is of interest? A highly involved auto category user who searches for “jaguar” would get Jaguar auto related links, while a person with no relevant category involvement would get a mix of auto, animal, etc. links. Category profiles may be developed to further categorize a user's interests. A new parent in the suburbs might be recognized from content accessed and be assigned to a different category for home accessories or cars than a single gen-X 20-something person. Life change events also might be recognized, such as marriage, home buying or parenthood. Like the “jaguar” example, for a “travel Italy” inquiry, the search engine might determine the style of travel that interests the user and organize the over-abundant electronic records accordingly. The technical result is respond to a query or contact based on an electronic report of the status or recent activity of a terminal, selecting from over-abundant electronic records a particular set of records that are most likely relevant to the current and recent status of the terminal.
Web-wide behavioral targeting differs substantially from site-side behavioral targeting. Practicing site-side behavioral targeting, a group of affiliated sites attempt to identify commercial behaviors. The sites typically serve ads, such as portals and news sites. Messages are displayed on the affiliated sites, responsive to behavior that is recognized from visits to the affiliated sites. The results of site-side behavioral targeting are better than non-behaviorally targeted campaigns, but depend on insight that can be gained from a narrow portion of user's behavior, as illustrated by
Referring now to
A client computer 110 is typically, but not necessarily, a personal computer such as those running the Microsoft Windows™ operating system, far example. A consumer may employ a suitably equipped client computer 110 to get on the Internet and access computers coupled thereto. For example, a client computer 110 may be used to access web pages from a web sever computer 160.
A web server computer 160 may be a server computer hosting a website, which comprises web pages designed to attract consumers surfing on the Internet. A web server computer 160 may include web pages supporting advertisem*nts, downloadable computer programs, products available for online purchase, and so on. As can be appreciated, a website may be on one or more server computers.
A message server computer 140 may include the functionalities of a web server computer 160. In one embodiment, a message server computer 140 further includes a database 171. Database 171 may be a commercially available database, such as those available from the Oracle Corporation. Database 171 may store client data received from behavior watching and message delivery programs 120 running in client computers 110. The client data may be transmitted from a client computer 110 to message server computer 140 in a data packet 121. The client data may include navigation and behavioral data obtained by a behavior watching and message delivery program 120 by monitoring a consumer's on-line activities. In the example of
Web server computers 160 and message server computers 140 are typically, but not necessarily, server computers such as those available from Sun Microsystems, Hewlett-Packard, or International Business Machines. A client computer 110 may communicate with a web server computer 160 or a message server computer 140 using client-server protocol. It is to be noted that client-server computing is well known in the art and will not be further described here.
As shown in
In one embodiment, behavior watching and message delivery program 120 is downloadable from a message server computer 140 or a web server computer 160. Behavior watching and message delivery program 120 may be downloaded to a client computer 110 in conjunction with the downloading of another computer program. For example, behavior watching and message delivery program 120 may be downloaded to client computer 110 along with a utility program 181 that is provided free of charge or at a reduced cost. Utility program 181 may be a wallet or calendar program, for example. Utility program 181 may be provided to a consumer in exchange for the right to deliver advertisem*nts to that consumer's client computer 110 via behavior watching and message delivery program 120. In essence, revenue from advertisem*nts delivered to the consumer helps defray the cost of creating and maintaining the utility program. Behavior watching and message delivery program 120 may also be provided to the consumer along with free or reduced cost access to an online service, for example.
Behavior watching and message delivery program 120 is a client-side program in that it is stored and run in a client computer 110. Behavior watching and message delivery program 120 may comprise computer readable program code for displaying advertisem*nts in a client computer 110 and for monitoring the online activity of a consumer on the client computer 110. It is to be noted that the mechanics of monitoring a consumer's online activity, such as determining where a consumer is navigating to the URL of web pages received in client computer 110, the domain names of websites visited by the consumer, what the consumer is typing on a web page, what keyword the consumer is providing to a search engine, whether the consumer clicked on a link or an advertisem*nt, when the consumer activates a mouse or keyboard, and the like, is, in general, known in the art and not a further described here. For example, behavior watching and message delivery program 120 may learn of consumer online activities by receiving event notifications from web browser 112.
Behavior watching and message delivery program 120 may record the consumer's online activity for reporting to message server computer 140. The recorded consumer online activity is also referred to as “client data,” and provided to message server computer 140 using data packets 121. Message server computer 140 may use the client data to provide targeted advertisem*nts to the consumer. Message server computer 140 may include the advertisem*nt or data for displaying the advertisem*nt in a message unit 141. In the example of
As will be more apparent below, behavior watching and message delivery programs are primarily used to obtain client data far building a search engine index, not necessarily to display presentation vehicles in a client computer 110, That is, a behavior watching and message delivery program does not necessarily have to display advertisem*nts in a client computer 110. This is advantageous in that consumers may be allowed to obtain a free or reduced cost utility program 181 (or other benefits) without having to see advertisem*nts from the provider or sponsor of the utility program.
In the example of
Process Flow
Data Collection
Web usage statistics are collected (407) using behavior watching modules (120) for users searching on selected search engines across the Internet or an enterprise intranet. The modules send back information related to their searches and how effective their searches were on each particular search engine for each particular keyword. The information is captured in a database, which is loaded daily or at some other concurrent frequency. Information available includes what the user saw and how they responded. URLs seen by the user may be displayed in algorithmic or natural sections the results. Pertinent information available for a single search term includes machine id, keyword, search engine where search was performed, resultant URLs, algorithmic URLs, bidded URLs, paid inclusion URLs, whether or not a URL was clicked, number of pages viewed, dwell time, repeat visits and user metrics such as category involvement and search engine sophistication.
Some statistics from US users can bring to life the analytical reach of data collection and ranking. The one million most frequently used keywords presently account for just more than half of the searches conducted on the major search engines, 53% of the searches. The 10,000 most frequent searches account for 38%. The distribution of keywords entered as searches can be represented by a Zipf distribution, which plots as a straight line on a graph with logarithmic scale on both axes. See, e.g., Jacob Nielson, “Diversity is Power for Specialized Sites”, Alertbox (Jun. 16, 2003) accessed Aug. 13, 2005 at http://www.useit.com/alertbox/20030616.html. It is estimated that 2,000 behavior watching modules will generate enough data to rank the 10,000 most frequent searches. [* * * Dominic, need to fill in the next blank. Would like a number in the 50,000 to 250,000 user range that connects with a round number of searches or a round percentage of searches.] A larger group of behavior watching modules will cover the most frequent searches. A base of 40 million behavior watching modules may capture 30 billion rows of data per month, filtered from 150 billion page views observed. Among the page views observed, on the order of 8 billion commercial events per month are noted, including more than 2.5 million purchases. These numbers and even a passing familiarity with statistics should excite the reader to aggregating the intelligence and behavior of a multiplicity of searchers, and presenting by popular acclaim the most significant web sites on the first page of results from a search engine. A search engine that uses aggregated consumer behavior is well-positioned to rate the authority and popularity of pages/documents as responsive to a search request.
With the categorical coding of recency and frequency in mind, we turn to
The behavior watcher module 120 preferably sorts the category history information in
The statistics returned regarding web usage may or may not distinguish between commercial and non-commercial keyword searches. This distinction is made at the time of loading into the database where keywords are checked against the ever-changing list of commercial terms, for instance, terms bidded by Overture. If a keyword is determined to be commercial, then it is assigned a keyword id, which may be compatible with the Overture keyword id list. If the keyword is not deemed commercial, then another id is assigned in the data loading process.
These two types of data (commercial and non-commercial) are loaded into separate sections of a data warehouse. At the time of a new search engine loading run, keyword data is extracted from both the commercial and non-commercial tables. The results are joined and unique keyword ids are assigned from a master table. A keyword can sometimes be found in both the commercial and non-commercial contexts. For instance, if the commercial nature of a keyword is tied to whether it's being bidded upon, a keyword which is not commercial today may become a commercial tomorrow, if it receives a bid. This duality of keywords creates non-unique keyword ids as the same keyword may have both a keyword id created by a bidding process and a second one created through the load process. To eliminate this, it is preferred to create and maintain a single unique keyword id for every keyword. This list is updated when new keywords are discovered, and assigns existing ids to keywords already in the system.
Some preprocessing may be performed by the behavior watching module to simplify the URLs reported. URLs are unwrapped and cleaned in a separate process. URLs are often wrapped by search engines to enable the serving search engine to track clicks on served URLs. There are many different forms of URL wrapping. For example, a wrapped URL from yahoo.com might be: http://rds.yahoo.com/S=2766679/K=bmw/v=2/SID=w/1=WS1/R=2/SS=100040736/H=1/SHE=0/*-http://www.bmwmotorcycles.com/. Unwrapping the URL produces http://www.bmwmotorcycles.com
From the server's perspective, the process begins with receipt (402) of behavioral information. The server uses whatever information it receives. From the user's perspective, the behavior watching module will report its observations and the user will receive search results ordered aggregating the user's information with others and/or will receive personalized to the user's recent behavior.
De-duping may also occur at the URL level. Information for two URLs which are identical is aggregated into one single URL. If two URLs differ even slightly however, (e.g., by a slash) then the two distinct versions are kept and another attempt at de-duplication is made as described below, for instance, using a combination of title and generated snippet.
Conversion data can be associated to a specific search by an algorithm that ties a search URL click to a specific conversion event, which occurred within a predetermined window. Usage and conversion data are matched for advertiser domains (URLs) that have clicks at the machine id, query time, advertiser domain level. For a particular machine with a click on a particular advertiser domain, if a conversion stat is observed within a predetermined window, then the conversion is attributed to that search click. If the conversion falls outside of the predetermined window, then the search click is not attributed.
For machine id-advertiser domain pairs that have a conversion stat attributed, subsequent future conversions are attributed as repeat conversions. These attributes also may be carried along and are available for use by a ranking algorithm (404). Metrics included with search data include number of visits, time spent (dwell time) and pages viewed.
Domain event data is joined to user data (with conversion metrics) at the machine id-advertiser domain level across sources, for combining search behavior for U.S. machines at google.com, msn.com, and yahoo.com. These results are put into time series order within machine id and advertiser domain. Domain events which occur within a predetermined time period following a search click are assigned as post-click metrics for that search click on that particular advertiser domain. If post-click metrics cannot be assigned to a particular search-click, the record is thrown out.
Several of the domain event data elements are subject to inaccuracies manifested in the client-sent stats. Both time spent and pages viewed are occasionally misreported, and at other times, accurately reported, but in need of logically driven limits to be imposed.
Time spent can be misrepresented by machines having bad or inaccurate clocks. It can be accurately represented but in need of caps in such a case when a machine is left on a particular domain for an extended period of time. In order to cap outliers and to maintain a reasonable threshold for time spent on a site post click (403), a time limit of 30 minutes has been employed. A cap for pages viewed has also been implemented and set at 5. Other time and pages viewed caps may be substituted. These caps can be implemented on the behavior watching client side or after data is received at a host. In addition, user activity can be monitored by the behavior watching module so that extended periods of inactivity are not counted as dwell time.
When data is joined and aggregated at the machine-id, keyword, and URL level, the resulting data structure may include: machine-id, keyword, keyword-id, URL, URL_ID, domainid (corresponds to the domain of the URL), clicks, dwell_per_click, pages_per_click, conversions_per_click, rank_position (from search results list viewed by the user). Optionally, only keywords of predetermined interest (404) may be processed. For instance, keywords having commercial interest, such as bidded keywords, may be processed.
Outliers optionally may be removed, to avoid scoring anomalies. One example of an outlier is a link that is returned only once by a search engine and followed with enthusiasm by the user when was returned. A single strong sample point can give a link an unbeatable average score. A link that appeared just before a keyword was rescored could potentially be ranked in the top position for that keyword on the basis of a single sample! Accordingly, one example of optionally removing outliers is to not rank links unless they have been followed a predetermined number of times. For instance, if a link has not been selected by users and followed at least 10 or 100 times, it might remain unranked until its activity level reached the predetermined level or threshold.
Ranking Algorithm
A URL ranking algorithm (404) has been developed to identify and rank links for any given keyword. Many variations on combining the aggregated observations have merit, as discussed below. One combination uses URL click rate and dwell time metrics (time spent at the domain and/or number of pages viewed), to select and rank URLs.
Optionally, user responses can be normalized for the position of a URL in a search result set. Position produces an inherent bias in URL click data for search results, which may be desirable or not. URLs occupying higher ranks garner higher clicks. In order to account for this bias, a normalization algorithm was developed to put clicks on links in disparate positions on equal footing.
Data is aggregated for each position and average click through rates, average time spent, and average pages viewed are calculated. For each rank position (1-n), there are at least three average aggregate measures of that position's importance: click through rate, time spent and pages viewed. Normalization of these measures can be expressed as:
This embodiment calculates and ranks top URLs for any keyword (404) based upon observed user metrics.
This embodiment may re-rank results based upon a time share metric, which corresponds to an individual machine's percentage vote. The algorithm takes into account user web surfing patterns and effectively places users on equal voting for relevant links. The premise is that a user has a certain amount of time which is spent on a site post a search click. These times are totaled to form the individual user's total time value which was spent viewing sites post search clicks. Percentages of the total time are then calculated for each URL click made by that particular user, resulting in a time fraction vote.
In using this methodology, users who in general spend less time surfing the Web have the same voting power as users who tend to spend longer amounts of time. This evens the playing field across all categories. Alternatively, other embodiments may take into account other factors which make up an individual user's profile. Users identified as category experts may have a higher vote. For example, a user highly involved in the electronics category may have his vote count more for links clicked than a user who is new to the category. The voting blocks may take place within a particular category, and not across all categories as a whole.
The following detailed computational example applies to a specific keyword-URL pair through the aggregation and cleansing process.
Links in higher positions garner higher clicks and hence possess higher click through rates. It follows that users also tend to spend longer amounts of time and view more pages at URLs occupying these higher ranks. In order to account for this bias, a normalization process is applied.
Average clicks, average dwell time, and average pages viewed are calculated for each position regardless of keyword or URL combinations. These numbers are shown below as Position Averages. Average fraction clicks, dwell time and page views are also shown for this keyword, URL combination. These average fractions correspond to the percentage of each metric devoted by all machines to each keyword, URL, position grouping.
Inflation factors are calculated for each position and applied to the appropriate observed metrics, normalizing them for position. The normalizing equation:
Clicks, dwell time, and page views in lower positions are factored up by the appropriate inflation factor observed for that particular position. In the example below, the average fraction dwell time for position 7 pre-normalization was 0.32. After the application of the inflation factor, the average fraction dwell time was 0.53. This number is now a normalized dwell time.
Weighted averages are calculated for each metric at each position, and totals are calculated across the positions for clicks, and all of the weighted average fraction measurements; clicks, time and page views.
Over time, with new reports of web usage from behavior watching modules, URLs for a particular keyword will adjust their positions. A URL in position 1 this week may be in position 7 the next. Adjusting for position is therefore a cleanup and adjustment process.
The final result is the Weighted Average of Normalized Totals for Keyword 01, URL 101, at any position. Computed for every keyword, URL combination a single score is calculated for each metric needed for the v17.1 algorithm. These metrics are now normalized for position, and for multiple rank occurrences.
Ranks for scoring can be based on time that a user spends viewing pages on the domain. The information received from the behavior watching module may limit the maximum amount of time that will be assigned for any viewing session or it may track the user's behavior, such as window navigation between programs, mouse clicks or mouse movement, and disregard periods of inactivity when calculating dwell time.
Ranking may ignore links that were selected by users less than a predetermined number of times, which may be predetermined as a fixed number or a function of traffic for the keyword or category. Ignoring outlier links (406) may avoid giving a high ranking to a link that was rarely presented by the search engines and followed only once or twice.
Ranks for scoring also can be based on a combination of click through rate, dwell time and the number of pages or documents viewed after following the link. Combining these factors, in some instances one of the factors will dominate: all or more than two thirds of the ranking weight may be assigned to just one of click through rate, dwell time or number of pages or documents viewed. Alternatively, they may be equally weighted, plus or minus 10%, or the factors may be assigned weighting ratios of approximately 2-1-1, plus or −10%, so that one factor is given approximately half of the combined weighting.
Another factor that can be used in ranking is return visits. If the user returns to the domain within a predetermined time after leaving it or within a predetermined number of navigation events, the user's return to the site can be assigned significance. Return to the site may reflect a favorable impression after considering other sites.
Conversion from browser to buyer or registered lead can considered to be particularly worthwhile as a factor. Again, conversion may include both a purchase in the domain and a registration. In some instances, such as car or home purchases, registration may be more realistic measure, because the purchase may be impractical or infrequently completed at a web site. Return conversion also may be taken into account.
Results may be segregated for analysis by search engine and ranks scored. Then, the separate rank scores may be combined into an overall ranking.
Statistical or other analysis can be applied within categories or keywords to determine which combination of ranking factors best attracts users to follow a link responsive to a search. It is anticipated that ranking information will be used differently among categories of keywords. Time spent will be important in the auto is category. Conversions will be much more important music downloads category.
Optionally, click segmentation bands may be applied. These bands give precedence to URLs with high numbers of clicks. Employing these bands may improve the resultant links on selected algorithms.
Segmentation bands are identified based upon total clicks received by a particular URL. For instance:
Tier 1: URLs with 100+ clicks
Tier 2: URLs with between 50 and 99 clicks
Tier 3: URLs with between 10 and 49 clicks
Tier 4: URLs with less than 10 clicks
URLs for a particular keyword are first put into the appropriate segmentation band. Once the band is identified, these URLs are set in descending order by rank score.
A predetermined number of links, such as the top 15 links (4XX), may be selected for data collection, to be followed by a spider engine (4XX).
Three tables are generated as output from the rank process:
Following Links
Traditional crawling programs at other search engines (ex: Slurp at Yahoo!, Googlebot at Google, MSNBot at MSN) crawl the entire web in search of relevant pages to index to be used in determining the rank order of links to display for a given keyword. The embodiment disclosed here; in contrast, is given a succinct number of URLs to crawl, which may optionally be selected (405) from links reported (407) by the behavior watching module. These links are pre-ranked, hence this information retrieval process needs not determine the relative importance of a given URL from its connections to others, but rather to obtain the best possible descriptive information from the URL.
This embodiment takes a specific set of URLs and performs several specific tasks: It strips out all HTML tags and returns first 100 k or another predetermined chunk of the text on the page to a file. It takes and stores a mapping from the text object's value into a uniform scalar space to be used as a text signature or text fingerprint. It calculates an MD5 or other fingerprint of the document (with or without html tags). It calculates a summary count of the characters within the text extracted from the document.
This method may be implemented by a Java application which operates in a Linux environment as illustrated by
Total threads working for a single broker can be arrived at by the following equation: With i number of spiders each having j number of workers (threads):
The dual-broker model (1821, 1822) can segregate keywords by keyword velocity. General keywords are funneled through a robust, heavy duty version of the ranking algorithm. Fast moving keywords (e.g., news, current events) can be processed through a nimble, express version of the ranking algorithm, which uses less history. Keyword velocity is a measure of how quickly the popularity of a keyword changes. The highest velocity keywords can be selected by comparing the number of keyword searches in the last 24 hours (day 0) against the 24 hours before that (day-1). A different time span, such as four or eight hours, can be used, of course. How far the ration day 0/day-1 varies from “1” is the keyword velocity. If the ration is less than 1, the keyword is becoming less popular, “old news.” If the ratio is much more than 1, the keyword may relate to a new story. Generally, a predetermined number of relatively high velocity keywords are re-indexed at a predetermined interval or as resources permit. In one embodiment, the top 10,000 keywords are re-indexed each day. While one metric of keyword velocity or volatility has been described, variations are anticipated.
The heavy duty version handles the ranked keyword URL pairs. These ranked keywords URL pairs are made available through an Oracle table on a database. The URL_TABLE includes: DOMAINID, URLID, URL, LENGTH, SIGNATURE_H, SIGNATURE_T, SPIDER_DATE and HOST
The DOMAINID, URLID, and URL fields are populated from a reference database prior to following the links. After the link-following process for a specific URL, the LENGTH, SIGNATURE_H, SIGNATURE_T, SPIDERDATE, and HOST fields are written back to the database.
Brokers use Java Database Connectivity (JDBC) to connect in to the Oracle database. The broker accesses the URL_TABLE from the ranking process. The broker makes a request for 1/100th of the total number of domains which are available in the URL table for which SPIDER_DATE is null. All URLs associated with these domains are extracted by the broker where they are grouped by domain. Individual spider boxes talk to the Broker via Remote Method Invocation (RMI) requesting URLs for domains 1,000 domains at a time. Domains are then passed from the spider to a worker who takes all of the URLs associated with its domain and operates upon those URLs.
URLs are passed to the workers grouped by domain in order to accommodate generally accepted crawling or link following practices so as not to swamp domains with thousands of requests simultaneously. It is a generally accepted practice to not access a single domain with more than one request at a time. The link following process (406) respects this generally accepted principle by assigning each worker all URLs associated with a given domain.
The link following process (406) is a robust, scalable, application which fetches content and calculates statistics from a specific URL. Once a worker receives a domain and its associated URLs, it accesses that URL using HTTP protocols. If a good response code is received, a link following worker goes to work on that page. The worker receives a 200 response code (status OK) more than 98% of the time. If the page returns an HTTP code indicating a redirect (codes 301, 302, 303 and 307), further action must be taken by the worker or system in order to obtain information about that URL. A worker will follow up to 5 redirects from an initial URL before abandoning. Once the worker reaches an end point, the following tasks take place: Acquire HTTP return code from the URL. If a good response code is acquired: Identify title meta tag if available; calculate an MD5 fingerprint of the entire document (both HTML and text); parse HTML from the page; and write back first 1,000 k of text to disk.
Once the content is parsed and written back to the disk, a subsequent operator takes over. This operator makes several calculations used for the document fingerprint and writes those and other statistics back to the Oracle database. The system writes back the following fields to Oracle: URL_TABLE, DOMAINID, URLID, URL, LENGTH, SIGNATURE_H, SIGNATURE_T, SPIDER_DATE and HOST.
LENGTH is a count of characters in the text of the document (first 1,000 k). This feature can be used for de-duping URLs later in the process (408). SIGNATURE_H is the MD5 hash code signature. SIGNATURE_T is a CRC32 checksum code of the text (first 1000 k). SPIDER_DATE indicates the date and time that the particular URL was accessed. HOST pertains to which spider machine stored the text of the URL.
The following system may create three different measures designed to aid in document de-duplication (409). This de-duplication process aims at identifying documents that are identical or very similar within a given keyword result set. In a prior step not separately shown, URLs are de-duped at the URL level. Easily identified duplicates such as two occurrences of the exact same URL are eliminated. The system attempts to eliminate URLs that do not appear to point to the same page, but in fact do. In one embodiment, mathematical signatures (fingerprints) are taken for each URL and compared to other URLs within a given keyword result set. Three exemplary signatures are a length signature, an MD5 signature and a CRC32 checksum. Other signatures may be substituted.
For the length signature, the character length of the text document is calculated. This measure aids in the de-duping process to aid in giving context to a page which has been identified as a duplicate. For instance, if two sites show identical MD5 and CRC32 signatures, but have very disparate URLs, the signature is analyzed. If the length signature is low, meaning the page is small, it is likely that these two URLs share, for instance, a standard warning screen as would be found prior to entering an adult content site.
An MD5 signature typically is a way to verify data integrity. The MD5 algorithm takes as input a message of arbitrary length and produces as output a 128-bit “fingerprint” or “message digest” of the input. The MD5 algorithm is intended for digital signature applications, where a large file must be “compressed” in a secure manner. The system computes an MD5 signature for the entire document, reducing the identity comparison process to a 128-bit comparison, for instance.
A CRC32 checksum generates the cyclic redundancy checksum polynomial of 32-bit lengths. This is usually used to validate the integrity of data being transmitted. The CRC is a “digital fingerprint” of a file, With CRC32 you can “melt down” a huge 20 MB (or even much bigger) file to have a small, handy reference to it, a single 32-bit number like 7d9c42fb (hexadecimal notation) which reflects the entire contents of this huge file. The system computes a CRC32 signature of the text of the document, giving insights into the text content of the page.
Another signature that can be calculated and used is Rabin's fingerprinting algorithm, for instance Broeder's implementation, which produces a compact checksum.
Any of the checksums or fingerprints can be applied to the whole document, the whole document less HTML tags stripped away, the selected chunk of the document that is cached, the title and snippets or some other predetermined excerpt from the document. More or less than 1,000 k of the document can be used.
The process completes a run for a particular URL with data being written to an Oracle database and a spider box. The Oracle database receives fingerprint information (length, MD5, crc32), spider date/time, and host location information written to URL_TABLE and a spider date/time stamp written to KEYWORD table. The spider box receives files for data links that it followed: URL, title (if it was obtained during the initial fetch from the URL) and text of the document (first 1,000 k) to be used for snippet generation. The text contains elements of the meta description and the body of the document
Snippet Generation
A snippet generation process generates titles and snippets for display (407). The snippet process takes a keyword phrase and URL combination, comes up with the best title describing that URL, and creates the best snippet (i.e., abstract, description) for that URL outlining in a 200 character space the information contained in the URL that pertains to the keyword. Snippet generation follows the link following process. Snippets are created from the text of the document retrieved from the chosen URL.
The keyword “somec bicycles” produces the following sample text for display:
In this example, the title is “Upland Sports . . . Frames”. The snippet is the two lines following the title. The URL is on the bottom line.
Titles are usually generated from the title of the page retrieved when a link is followed. Most sites annotate the title of the page for search engines through the use of HTML meta tags. A tag identifying the title is present on over 97% of all URLs.
In the 3% of URLs for which the HTML tags do not supply a title, the process composes a title. If there is text available for the URL, the process takes the first approximately 70 characters of text (respecting word boundaries) and creates a title. If there was no text generated from the URL, the domain name is stripped from the URL (all information between www and .com) and displayed as the title.
Snippet generation is a mix of art and science. The process creates snippets leveraging mathematical equations and linguistic science. In one embodiment, snippets can be comprised of 1 single sub-snippet, or up to 3 sub-snippets separated by ellipses ( . . . ). A scoring algorithm decides which sub-snippets when combined (or not in the case of a single sub-snippet) produce the best score.
The snippet scoring algorithm is a multi-step process which scores various portions of the document's text. In four parts, it includes keyword tokenization, window scoring, window trading and final determination.
Keyword tokenization is applied because keywords are not always single words. Keywords are often multi-word phrases. The process tokenizes or single outs individual words within a phrase. Identifying individual word tokens typically includes searching for word separators such as spaces, periods, commas, or colon/semicolons. Once the tokenization of the keyword phrase is complete, the window scoring routine can commence.
In one version of window scoring, windows of three different sizes are calculated within the text of the document, for instance, for sub-snippet of lengths 200 characters, 100 characters and 66 characters.
When the process is complete, there may exist:
i windows of length 200 (where i=document length−200)
j windows of length 100 (where j=document length−100)
k windows of length 66 (where k=document length−66)
Window scoring may be based on one or more metrics, such as the number of unique tokens found within the window, the total number of tokens found within the window, and/or the longest string of tokens found within the window. A formula for each window is computed from a combination of these metrics and assigned to that window.
In the case where there is one 200 character snippet, the window with the highest score is chosen. The two highest scoring windows of length 100 are chosen for the two sub-snippet model. The three highest scoring windows of length 66 are chosen for the three sub-snippet model.
The best scores are calculated for each model (1, 2, or 3 sub-snippets). A final algorithm may be applied when 2 or 3 windows are eligible for a snippet. If the global window score can be increased by one window giving up characters to another, then that action is seen as a gain and it is taken. If the global window score cannot be raised in this manner, the snippets are used without trading.
The output from snippet generation may include 5 different scores: Score of single sub-snippet model; score of non-traded two sub-snippet model; score of traded two sub-snippet model; score of the non-traded three sub-snippet model; and/or score of the traded three sub-snippet model. Of these, the single highest score is chosen and that sub-snippet model is applied to that keyword, URL combination.
For a sample keyword=“red dog run”, the following steps may be followed:
Step 1: Tokenize keyword into three tokens:
Step 2a: Locate instances of the tokens within the text document
Step 2b: Score the windows and identify the top ones. In this example, the three sub-snippet model, the best 3 windows were calculated.
Step 3: Allow for trading to occur. In this case, if window 1 can give up some of the non-token containing characters within it's left edge to window 3. This allows window 3 to expand and include the final token ‘run’, increasing the overall global score of the snippet.
In an alternative embodiment, the snippet generation process may involve the creation of an approximately 200 byte field used as a descriptor for the associated link. Snippet generation takes place post spidering and is created from the complete text of the document associated with the chosen URL or at least the portion of the document stored.
Personalization and Ranking
Within the ranking algorithm, there is the ability to select anonymous users who, based upon their behavioral profile, would have their votes for particular categories of links count more than other users.
Users who are heavy searchers (based upon their observed search behavior) would have their votes count more on links that they click more than the votes of novice searchers on that same link. In this way, the search experts would help produce more relevant ranking results.
Similarly, users who are highly involved in a particular category would have their votes count higher in that category than users who have no involvement in that category. Using behavior watching modules, one can identify users who are highly involved in various categories such as digital cameras, fantasy sports, or automobiles. For example, a user identified as being highly involved in the digital camera category would have his vote count more for links he clicked after a search for ‘Cannon G3’ than a user who is new to the category searching on that same keyword.
Identification of a user's category involvement status also drives personalization. A user with a high degree of involvement in a particular category would get different results from a user identified as less involved. This personalized results serving would require the presence of a cookie like object available on a particular machine. This lifestyle cookie would provide the search engine with a behavioral profile of the user, obtained from the users category navigational patterns. These category specific navigational patterns would be obtained from information contained in a categorization structure that also can be used for targeted advertising. For commercial purposes, a budget category or likely budget can be inferred from sites visited. Visitors to IKEA and Target are likely to have a different budget for apparel than visitors to Sachs Fifth Avenue or Bloomingdale's. Similarly, Hyatt Hotels are in a different budget category than youth hostels.
Personalization based on observed communications is much more powerful than user-entered customization, because research shows that only 8-14 percent of users manually personalized their content. Personalization highly correlates with pages viewed at a domain: users who personalize have been reported to view 130 percent more pages at the domain than users who do not personalize.
Sometimes different behavioral profiles can be leveraged to make a difference in search results. Other times, differences between two users' behavioral profiles does not help in the context of a particular search keyword.
Some examples are helpful. First, an ambiguous search terms example: A highly involved auto category user who searches for “jaguar” would get more Jaguar auto related links than jaguar animal related links as compared to a normal mix of auto and animal related links for someone with no identifiable category involvement. Identification and usage of these behaviorally profiles could slant results, without completely replacing results. In the example above, the auto category involved user could get 100% auto results, or just a larger percentage of auto results than found among popular websites.
Next, a sub-category identification example: Three users search for the keyword “rental car”. Three separate sets of results come up, each personalized for the users. Each user has a particular behavioral profile obtained from their past navigational patterns observed within the travel category. These behaviors are readily identifiable from the observed communications.
User 1: Frequent business traveler—his rental car results would be slanted toward the business traveler car rental results, possibly more about frequent rental points, etc.
User 2: Budget traveler—his rental car results would be slanted toward the budget traveler; rent-a-wreck type results, specials on sub-compact cars etc.
User 3: Luxury Traveler—his rental car results would be slanted toward the high-end luxury traveler; sports car rentals, classic car rentals, etc.
Return visit data from the behavior watching module can assist an advertiser in measuring the effectiveness of a particular ad. User differentiation by box can further be associated with selection of ads and evaluation of ad effectiveness.
Cross-browsing of users also can be reported. Users can be selected by follow-through, for instance all click-throughs or all users with conversions. The users with a conversion at a particular domain (or vendor or brand, for instance) can be rated by the frequency of their visits to competitors' domains (or vendors or brands).
Some Particular Embodiments
The present invention may be practiced as a method or device adapted to practice the method. The same method can be viewed from the perspective of a user' at their terminal or personal computer or on the server side, collecting information from users. The invention may be an article of manufacture such as media impressed with logic to carry out computer-assisted method.
A device embodiment, from the user perspective, may be embodied in a module running on the user's computer and collecting behavioral observations, coupled to a server that responds to the behavioral observations with information personalized to the user.
While the present invention is disclosed by reference to the preferred embodiments and examples detailed above, it is understood that these examples are intended in an illustrative rather than in a limiting sense. Computer-assisted processing is implicated in the described embodiments. Accordingly, the present invention may be embodied in methods aggregating of communication patterns, pre-processing links responsive to keyword searches, responding to keyword searches using aggregated communication patterns to rank the responsive links, and responding to keyword searches using recent and current navigation information systems to resolve ambiguities and/or personalize responses based on user characteristics. Other embodiments, as devices, include logic and resources to carry out thes methods. As systems, still other embodiments include behavior watching modules on terminals, servers that process or respond to the behavioral data, or both. Other embodiments include media impressed with logic to carry out the methods, data streams impressed with logic to carry out the methods, or computer-accessible services that carry out the methods. It is contemplated that modifications and combinations will readily occur to those skilled in the art, which modifications and combinations will be within the spirit of the invention and the scope of the following claims.
The embodiments disclosed may be practiced as a method or device adapted to practice the method. They may be practiced as a system. The same method can be viewed from the perspective of a server collecting data or a behavior watching module generating data. The invention may include an article of manufacture such as machine readable memory including instructions to carry out the method or a data stream including instructions to carry out the method.
One embodiment is a method of selectively collecting web pages that may be returned to users in response to search requests. This method includes receiving from behavior watching modules operating on a multiplicity of users' computers, information regarding the users' search engine usage across one or more search engines. This information includes at least keywords (including phrases) submitted by particular users to the search engines, links selected by the particular users from results returned by the search engines, and at least one of dwell time or documents viewed by the particular users when following the selected links. The method further includes using the search engine usage information to choose keywords of interest and to choose links selected by the users corresponding to the keywords of interest.
The multiplicity of users in this embodiment may exceed 2000 or 100,000 users, depending on the desired coverage of frequently submitted search terms. The information received from behavior watching modules may span a plurality of unaffiliated search engines, providing data that no single search engine could collect.
This embodiment further may include ranking, for the chosen keywords, the corresponding links based on user access rates calculated from the search engine usage information and at least one of the dwell time or the documents viewed. The method further may include limiting the collection of content based on a predetermined application of the ranking, such as following only the top 10 or top 15 links for a particular keyword.
In some embodiments, the search engine usage information is known to have been qualified before receipt to discount dwell time during periods when a particular user is likely to have been inattentive.
This embodiment further may include selecting snippets from the content of a document reached by following one of the chosen links. This snippet selecting may include applying a window of predetermined length to the document, repeatedly shifting the window through the document. The keyword of interest, if it is a phrase, is divided into words. The method further includes repeatedly calculating one or more window scores for the words in the window, including a count of instances of the words in the window and a measure of adjacency of the words in the window. With the window scores, a plurality of non-overlapping window positions are chosen as snippets. The snippets or references to the snippets are stored in a machine readable memory. This method may be refined in many ways. The repeated shifts of the window may align the window so that its start or end matches a word boundary. As the window is repeatedly shifted, the score within the window may be calculated by reducing the score for characters removed from the window and increasing the score for characters added to the window.
Snippet generation may be refined by trading off lengths of the snippets, approximately maintaining a combined length of all the snippets. As the lengths of the snippets are traded off, the one or more window scores are recalculated. The method further includes choosing a particular trade-off of snippet links using the recalculated window scores, optionally merging the snippets, and storing the chosen snippets or references to the chosen snippets in a machine readable memory.
Viewed from the perspective of a network of behavior watching modules, the method includes using a multiplicity of behavior watching modules operating on a multiplicity of users' computers, the behavior watching modules collecting information regarding the users' search engine usage across one or more search engines. The information collected includes at least keywords (including phrases) submitted by particular users to the search engines, links selected by the particular users from results returned by the search engines, and at least one of dwell time or documents viewed by the particular users when following the selected links. The method further includes the behavior watching modules electronically reporting the search engine usage information to one or more affiliated servers. As above, the multiplicity of users in this embodiment may exceed 2000 or 100,000 users, depending on the desired coverage of frequently submitted search terms. The information received from behavior watching modules may span a plurality of unaffiliated search engines, providing data that no single search engine could collect.
The behavior watching modules may qualify the search engine usage information before reporting, to discount dwell time during periods when a particular user is likely to have been inattentive.
The behavior watching module embodiment may further include a particular user invoking a search engine with a keyword and receiving results from the search engine organized to reflect the reported search engine information.
The embodiments and various aspects of the embodiments described above may be practiced as a machine readable memory including instructions to carry out the methods and aspects of methods described or a data stream including the machine-readable instructions. Further, a device may include one or more servers, personal computers or other computer devices having logic and resources adapted to practice the methods and aspects of methods described.
A system embodiment includes at least one listener module and at least one computer or cluster of computers in communication with the listener module. The listener module includes a network interface that receives from behavior watching modules operating on a multiplicity of users' computers connected by network, information regarding the users' search engine usage across one or more search engines. The information includes at least keywords (including phrases) submitted by particular users to the search engines, links selected by the particular users from results returned by the search engines, and at least one of dwell time or documents viewed by the particular users when following the selected links. The computer or cluster of computers in communication with the listener module operate one or more modules that include logic and resources adapted to process the search engine usage information and choose keywords of interest and the links selected by the users corresponding to the keywords of interest, to follow the chosen links to collect at least part of the content of documents addressed by the chosen links, and to associate the collected content with the corresponding keyword of interest.
The multiplicity of users in this embodiment may exceed 2000 or 100,000 users, depending on the desired coverage of frequently submitted search terms. The information received from behavior watching modules may span a plurality of unaffiliated search engines, providing data that no single search engine could collect.
Another embodiment is a method of selecting snippets from content of a document that may be returned to users in response to search requests. The snippets selecting includes applying a window of predetermined length to the document, repeatedly shifting the window through the document. If the keyword of interest is a phrase, is divided into words. The method further includes repeatedly calculating one or more window scores for the words in the window, including a count of instances of the words in the window and a measure of adjacency of the words in the window. With the window scores, a plurality of non-overlapping window positions are chosen as snippets. The snippets or references to the snippets are stored in a machine readable memory. This method may be refined in many ways. The repeated shifts of the window may align the window so that its start or end matches a word boundary. As the window is repeatedly shifted, the score within the window may be calculated by reducing the score for characters removed from the window and increasing the score for characters added to the window.
Snippet generation may be refined by trading off lengths of the snippets, approximately maintaining a combined length of all the snippets. As the lengths of the snippets are traded off, the one or more window scores are recalculated. The method further includes choosing a particular trade-off of snippet links using the recalculated window scores, optionally merging the snippets, and storing the chosen snippets or references to the chosen snippets in a machine readable memory.
Snippet generation may be combined with duplicate elimination. The method may further include evaluating to more documents based on their titles and the calculated snippets. If the titles and snippets or fingerprints or checksums calculated from the titles and snippets match or are very similar, the documents may be declared and handled his duplicates.
In snippet generation, the count of instances of the words in the window may include both counting how many of the words in the keyword appear in the window and how many times the words appear in the window.
The embodiments and various aspects of the embodiments described above may be practiced as a machine readable memory including instructions to carry out the methods and aspects of methods described or a data stream including the machine-readable instructions. Further, a device may include one or more servers, personal computers or other computer devices having logic and resources adapted to practice the methods and aspects of methods described.