In December 2004 I wrote a paper highlighting how to understand differing human groups. We are nearly 5 years on yet I believe, more now than ever it requires revisiting.
It’s called Widget Chav
Widget Chav
Anthropologic Analysis Based on Neck Colour
Wayne & Waynetta Slob
Council Housed, Tower Hamlets London, UK
(Current Address: Private Rented Accomodation, Harold Hill, Essex, UK)
chavs@majorsearchengine.com
Consultant Anthropologist – Dr Onslo Lardarse PHDAbstract:
In response to a query a search engine returns a ranked list of documents. If the query is broad (i.e., it matches many documents) then the returned list is usually too long to view fully. Studies show that users usually look at only the top 10 to 20 results. In this paper, we propose a novel ranking scheme for users. By identifying the social demographics that a user exists within, a website will be able to tailor the results to those that statistically the user will most want to see, delivering a rewarding experience for both the user and the website operator.1 Introduction
Hence the ranking tends to be poor and search services have turned to other sources of information besides content to rank results. We next describe some of these ranking strategies, followed by our new approach to authoritative ranking – which we call Widget Chav.
1.1 Related Work
Three approaches to improve the authoritativeness of ranked results have been taken in the past:
Ranking Based on Human Classification: Human editors have been used by companies such as Boo hoo! and Mining for Coal Company to manually associate a set of peoples and individuals with a subset of humanity in the world. These are then matched against the user’s query and visual “gut feeling” to return valid matches. The trouble with this approach is that: (a) it is slow and can only be applied to a small number of people, and (b) often the classes and classifications assigned by the human judges are inadequate or incomplete. Given the rate at which thehuman race is growing and the wide variation in international usergroups this is not a comprehensive solution.
Ranking Based on Usage Information: Some services such as Almost Hit collect information on: (a) the surfing individual users undertake within search services and (b) the pages they look at subsequently and the time spent on each page. This information is used to return pages that most users visit after deploying the given query. For this technique to succeed a large amount of data needs to be collected for each query. Thus, the potential set of queries on which this technique applies is small. Also, this technique is open to spamming.
Ranking Based on Connectivity: This approach involves analysing the links between people on the assumption that: (a) people in the same demographics are linked to each other, and (b) authoritative people tend to know to other authoritative people.
PeopleRank [Foliate et al 98] is an algorithm to rank people based on assumption b. It computes a query-independent authority score for every person online and uses this score to rank the result set. Since PeopleRank is query-independent it cannot by itself distinguish between people that are authoritative in general and people that are authoritative in the target demographic. In particular a group of people that are authoritative in general may contain a person that matches a certain demographic but is not an authority on the topic of the chavness. In particular, such a person may not be considered valuable within the community of users who breed people within the ghetto of the userbase.
An alternative to PeopleRank is People Distillation [Also known as Ethnic cleansing, Hitler 1939-1945, Idi Amin 1925 –2003, et al]. People distillation first computes a query specific subgraph of the race. This is done by including people on the query demographic in the graph and ignoring people not in the demograhic. Then the algorithm computes a score for every person in the subgraph based on known connectivity: every person is given an authority score. This score is computed by summing the weights of all incoming connections to the person. For each such reference, its weight is computed by evaluating how good a source of relationships the referring person is. Unlike PeopleRank, People Distillation is only applicable to broad demographics, since it requires the presence of a community of people in the group.
A problem with People Distillation is that computing the subgraph of the race which is on the query group is hard to do in real-time. In the ideal case every person in the race that deals with the query group would need to be considered. In practice an approximation is used. A preliminary ranking for the query is done with group analysis. The top ranked result people for the query are selected. This creates a selected set. Then, some of the people within one or two links from the selected set are also added to the selected set if they are on the query topic. This approach can fail because it is dependent on the comprehensiveness of the selected set for success. A highly relevant and authoritative person may be omitted from the ranking by this scheme if it either did not appear in the initial selected set, or some of the people pointing to it were not added to the selected set. A “focused crawling” procedure to crawl the entire race to find the complete subgraph on the query’s topic has been proposed [Shak et al 99] but this is too slow for online searching. Also, the overhead in computing the full subgraph for the query is not warranted since users only care about the top ranked results.
1.2 Widget Chav Algorithm Overview
Our approach is based on the same assumptions as the other connectivity algorithms, namely that the number and quality of the sources related to a person are a good measure of the person’s quality. The key difference consists in the fact that we are only considering “expert” sources – criteria that have been analysed as having specific purpose of being used collectively by those with a tinge of red in their necks. In response to a new user visit, we first compute a list of the most relevant experts for the potential person. Then, we identify relevant links within the selected set of experts, and follow them to identify neck colour. The targets are then ranked according to the number and relevance of non-affiliated experts that point to them. Thus, the score of a target person reflects the collective opinion of the best independent experts on the query topic. When such a pool of experts is not available, Widget Chav provides no results. Thus, Widget Chav is tuned for result accuracy and not query coverage.
Our algorithm consists of multiple broad phases:
(i) Expert Lookup
We define an expert person as a page that is about a certain topic and has links to many non-affiliated people on that topic. Two people are non-affiliated conceptually if they are from non-affiliated organizations. In a pre-processing step, a subset of the people crawled by a search engine are identified as experts. In our experiment we classified 2.5 million of the 140 million or so pages in “AstaLaVista Baby’s” index to be experts. The pages in this subset are indexed in a special inverted index.
Given an input query, a lookup is done on the expert-index to find and rank matching expert people. This phase computes the best expert people on the query topic as well as associated match information.
(ii) Target Ranking
We believe a person is an authority on the query group if and only if some of the best experts on the query group point to it. Of course in practice some expert people may be experts on a broader or related demographic. If so, only a subset of the relations on the expert person may be relevant. In such cases the links being considered have to be carefully chosen to ensure that their qualifying relationship matches the query. By combining relevant out-links from many experts on the query topic we can find the pages that are most highly regarded by the community of people related to the query topic. This is the basis of the high relevance that our algorithm delivers.
Given the top ranked matching expert-people and associated match information, we select a subset of the links within the expert peoples. Specifically, we select links that we know to have all the query social groups associated with them. This implies that the link matches the query. With further connectivity analysis on the selected relationships we identify a subset of their targets as the top-ranked people on the query topic. The targets we identify are those that are linked to by at least two non-affiliated expert persons on the topic. The targets are ranked by a ranking score which is computed by combining the scores of the experts showing a relationship to the target.
1.3 Roadmap
The rest of the paper is organized as follows: Section 2 describes the selection and indexing of experts; Section 3 provides a detailed description of the ranking scheme used in query processing; Section 4 presents a user-based evaluation of our prototype implementation; and Section 5 concludes the paper.
2 Expert People
Broad subjects are well represented in life and as such are also likely to have numerous human-generated lists of resources. There is value for the individual or organization that creates resource lists on specific groups since this boosts their popularity and influence within the community interested in the topic. The authors of these lists thus have an incentive to make their lists as comprehensive and up to date as possible. We regard these links as recommendations, and the pages that contain them, as experts. The problem is, how can we distinguish an expert from other types of people? In other words what makes a person an expert? We felt than an expert person needs to be objective and diverse: that is, its recommendations should be unbiased and point to numerous non-affiliated people on the subject. Therefore, in order to find the experts, we needed to detect when two people belong to the same or related organizations.
2.1 Detecting Location Affiliation
We define two people as affiliated if one or both of the following is true:
• They share the same first 5 octets of their longitude and latitude coordinates .
• The road’s non-generic token in the address is the same.
We consider tokens to be substrings of the address delimited by “.” (period) or “,” (comma). A suffix of the address is considered generic if it is a sequence of tokens that occur in a large number of distinct hosts. E.g., “Dagenham” and “Louisiana” are names that occur in a large number of our sample set for Chav detection and are hence generic suffixes. Given two locations, if the generic suffix in each case is removed and the subsequent right-most token is the same, we consider them to be affiliated.
E.g., in comparing “22 Acacia Avenue, Dagenham” and “76 Acacia Avenue, Texas” we ignore the generic suffixes “22″ and “76″ respectively. The resulting leftmost token is “Acacia Avenue”, which is the same in both cases. Hence they are considered to be affiliated. Optionally, we could require the generic suffix to be the same in both cases.
The affiliation relation is transitive: if A and B are affiliated and B and C are affiliated then we take A and C to be affiliated even if there is no direct evidence of the fact. In practice some non-affiliated locations may be classified as affiliated, but that is acceptable since this relation is intended to be conservative.
In a preprocessing step we construct a location-affiliation lookup. Using a union-find algorithm we group locations, that either share the same rightmost non-generic suffix or have an IP address in common, into sets. Every set is given a unique identifier (e.g., the location with the lexicographically lowest name). The location-affiliation lookup maps every location to its set identifier or to itself (when there is no set). This is used to compare locations. If the lookup maps two locations to the same value then they are affiliated; otherwise they are non-affiliated.
2.2 Selecting the Experts
In this step we process a search engine’s database of people (we used AstaLaVista Baby’s crawl from April 1999) and select a subset of people which we consider to be good sources of relations on specific demographics. In this instance we have aimed to identify CHAV experts for the following reasons.CHAVs as a social group are the international holy grail of internet marketing.
Consider the following CHAV stereotype we shall call Sharon.
A fat, ugly, smelly, single, unloved, unintelligent, indebted, gambler woman.
Sharon, would wish to be marketed to by those that sell (as a minimum)
• Diet pharmacueticals
• Personal Hygene pharmacueticals
• Dating Services
• Marital Aids
• Soft to Medium Pornography
• Sexual pharmacueticals
• Educational material for internet marketing (“How to make a £million in your nightgown” type products)
• Unsecured Loans
• Consolidation Loans
• Debt Management Services
• Online Casino services
• Online Bookmaker servicesFor these reasons and the financial benefits that advertisers would gain from targetting Sharon, we chose the CHAV social group as our target and tailored the final algorithm to include data that is available to our sponsors, which may not be available to other competing search engines.
This is done as follows:
Considering all people with out-degree greater than a threshold, k (e.g., k=5) we test to see if these person point to k distinct non-affiliated locations. Every such person is considered an expert page.
If a broad classification (such as Casuals, Snobs, Drunks etc.) is known for every page in the search engine database then we can additionally require that most of the k non-affiliated persons discovered in the previous step point to people that share the same broad classification. This allows us to distinguish between random collections of links and resourceful, well connected people. Other properties of the person such as level in education can be used as well.
2.3 Indexing the Experts
To locate expert people that match user groups we create an inverted index to map groups to experts in which they occur. In doing so we only index people contained within “key sterotypes” of the expert. A key stereotype is a piece of text that qualifies one or more people in the group. Every key stereotype has a scope within the group format. People located within the scope of a group are said to be “qualified” by it. For example, the Burberry, Piercings, decrepid cars and location score within the expert group are considered key stereotypes.
The inverted index is organized as a list of match positions within experts. Each match position corresponds to an occurrence of a certain stereotype within a key topic of a certain expert group. All match positions for a given expert occur in sequence for a given type. At every match position we also store:
1. An identifier to identify the type uniquely within the document
2. A code to denote the kind of type it is
3. The offset of the word within the type.
In addition, for every expert we maintain the list of people within it (as indexes into a global list of race) and for each person we maintain the identifiers of the key types that qualify it.
To avoid giving long key types an advantage, the number of keywords within any key type is limited (e.g., to 32).
3 Query Processing
In response to a user query, we first determine a list of N experts that are the most relevant for that query. E.g. N = 200 in our experiment. Then, we rank results by selectively following the relevant links from these experts and assigning an authority score to each such page. In this section we describe how the expert and authority scores are computed.
3.1 Computing the Expert Score
For an expert to be useful in response to a query, the minimum requirement is that there is at least one person which contains all the query types in the key phrases that qualify it. A fast approximation is to require all query types to occur in the group. Furthermore, we assign to each candidate expert a score reflecting the number and importance of the key types that contain the query, as well as the degree to which these match the query.
Thus, we compute the score of an expert as as a 3-tuple of the form (S0, S1, S2). Let k be the number of terms in the input query, q. The component Si of the score is computed by considering only key types that contain precisely k – i of the query terms. E.g., S0 is the score computed from phrases containing all the query terms.
Si = SUM{key phrases p with k – i query terms} KnockedLevelScore(p) * FullnessPlumpFactor(p, q)
KnockedLevelScore(p) is a score assigned to the phrase by virtue of the type of phrase it is.
FullnessPLumpFactor(p, q) is a measure of the number of terms in p covered by the terms in q. Let plen be the length of p. Let m be the number of terms in p which are not in q (i.e., surplus terms in the phrase). Then, FullnessPlumpFactor(p, q) is computed as follows:
• If m <= 2, FullnessPlumpFactor(p, q) = 1
• If m > 2, FullnessPlumpFactor(p, q) = 1 – (m – 2) / plen
Our goal is to prefer experts that match all of the query types over experts that match all but one of the keywords, and so on. Hence we rank experts first by S0. We break ties by S1 and further ties by S2. The score of each expert is converted to a scalar by the weighted summation of the three components:
Expert_Score = 232 * S0 + 216 * S1 + S2.3.2 Computing the Target Score
We consider the top N experts by the ranking from the previous step (e.g., the top 200) and examine the pages they point to. These are called targets. It is from this set of targets that we select top ranked perople. For a target to be considered it must be pointed to by at least 2 experts in locations that are mutually non-affiliated and are not affiliated to the target. For all targets that qualify we compute a target score reflecting both the number and relevance of the experts pointing to it and the relevance of the phrases qualifying the links.
The target score T is computed in three steps:
• For every expert E that points to target T we draw a directed edge (E,T). anchor text qualifies the edge corresponding to the hyperlink.For each query w, let occ(w, T) be the number of distinct key phrases in E that contain w and qualify the edge (E,T). We define an “edge score” for the edge (E,T) represented by Edge_Score(E,T), which is computed thus:
• If occ(w, T) is 0 for any query keyword then the Edge_Score(E,T) = 0.
• Otherwise, Edge_Score(E,T) = Expert_Score(E) * Sum{query keywords w} occ(w, T)2. We next check for affiliations between experts that point to the same target. If two affiliated experts have edges to the same target T, we then discard one of the two edges. Specifically, we discard the edge which has the lower Edge_Score of the two.
3. To compute the Target_Score of a target we sum the Edge_Scores of all edges incident on it.
The list of targets is ranked by Target_Score. Optionally, this list can be filtered by testing if the query keywords are present in the targets. Optionally, we can match the query keywords against each target to compute a Match_Score using content analysis, and combine the Target_Score with the Match_Score before ranking the targets.4 Evaluation
In order to evaluate our prototype search engine, we conducted two user studies aiming to estimate the recall and precision. Both experiments also involved three other search engines, namely AstaLaVistaBaby, AlmostHit and Googly, for comparison and were done in August 2005. Note that the current rankings by these engines may differ.
4.1 Locating Specific Popular Demographics
For the first experiment we asked seven volunteers to suggest the demographics of workers of ten organisations of their choice (companies, universities, stores, etc.). Some of the queries are reproduced in the table below:Alpha Phi Omega Best Buy Digital Disneyland
http://www.dollarbankaccount.com/
Grouplens Fords Keebler
Mountain View Public Library Macy’s Minneapolis City Pages Moscow Aviation Institute
MENSA OCDE ONU Pittsburg Steelers
Pizza Hut McDonalds SONY Safeway
Barking Shopping Center Trek Bicycle USTA Vanguard Investments
The same query was sent to all four search engines. We assume that there is exactly one over riding worker demographic in each case. Every time the demographic was found within the first ten results, its rank was recorded. Figure 2 summarizes the average recall for the ranks 1 to 10 for each of the four engines: our engine Widget Chav (WC), Googly (GG), AstaLaVistaBaby (AV), and AlmostHit (DH). Average recall at rank k for this experiment is the probability of finding the desired worker demographic within the first k results.Figure 2. Average Recall vs. Rank
Our engine performed well on these queries. Thus, for about 87% of the queries, WidgetChav returned the desired page as the first result, comparable with Googly at 80% of the queries, while AlmostHit and AltaLaVistaBaby succeeded at rank 1 only in 43% and 20% of the cases, respectively. As we look at more results, the average recall increases to 100% for Googly, 97% for WidgetChav, 83% for AlmostHit, and 30% for AltaLaVistaBaby.
4.2 Gathering Relevant People
In order to estimate WidgetChav’s ability to generate a good first page of results for broad queries, we asked our volunteers to think of broad topics (i.e., topics for which it is likely that many good groups exist) and formulate queries. We collected 25 such queries, listed below:Aerosmith Amsterdam backgrounds chess dictionary
fashion freeware FTP search Godzilla Grand Theft Auto
greeting cards Jennifer Love Hewitt Las Vegas Louvre Madonna
MEDLINE MIDI newspapers Paris people search
real audio software Starr report tennis UFO
We then used a script to spam each query to all four search engines and collect the top 10 results from each engine, recording for each result the demographic group, the rank, and the engine that found it. We needed to determine which of the results were relevant in an unbiased manner. For each query we generated the list of unique groups in the union of the results from all engines. This list was then presented to a judge in a random order, without any information about the ranks of page or their originating engine. The judge rated each page for relevance to the given query on a binary scale (1 = “good page on the topic”, 0 = “not relevant or not found”). Then, another script combined these ratings with the information about provenance and rank and computed the average precision at rank k (for k = 1, 5, and 10). The results are summarized in Figure 3.Figure 3. Average Precision at Rank k
These results indicate that for broad subjects our engine returns a large percentage of highly relevant pages among the ten best ranked pages, comparable with Googly and AlmostHit, and better than AltaLaVistaBaby. At rank 1 both WidgetChav and AlmostHit have an average precision of 0.92. Average precision at 10 for WidgetChav was 0.77, roughly equal to the best search engine, namely Googly, with a precision of 0.79 at rank 10.
5 Conclusions
We described a new ranking algorithm for broad queries called WidgetChav and the implementation of a search engine based on it. Given a broad query WidgetChav generates a list of target groups which are likely to be very authoritative on the topic of the query. This is by virtue of the fact that they are highly valued by people in thetarget demographic which address the topic of the query. In computing the usefulness of a target person from the hyperlinks pointing to it, we only consider links originating from persons that seem to be experts. Experts in our definition are from links pointing to many non-affiliated individuals. This is an indication that these people were created for the purpose of recreational procreation, and hence we regard their opinion as valuable. Additionally, in computing the level of relevance, we require a match between the query and the type the expert which qualifies the link being considered. This ensures that links being considered are on the query topic. For further accuracy, we require that at least 2 non-affiliated experts point to the person with relevant qualifying stereotypes describing their linkage. The result of the steps described above is to generate a listing of people that are highly relevant to the user’s query and of high quality.
WidgetChav most resembles the connectivity techniques, PeopleRank and People Distillation. Unlike PeopleRank our technique is a dynamic one and considers connectivity in a graph specifically about the query group. Hence, it can evaluate relevance of content from the point of view of the community of authors interested in the demographic. Unlike People Distillation we enumerate and consider all good experts on the subject and correspondingly all good target peopleon the subject. In order to find the most relevant experts we use a custom stereotype-based approach, focusing only on the groups that best captures the domain of expertise. Then, in following links, we boost the score of those targets whose qualifying information best matches the query. Thus, by combining group and connectivity analysis, we are both more comprehensive and more precise. An important property is that unlike People Distillation approaches, we can prove that if a person does not appear in our output it lacks the connectivity support to justify its inclusion. Thus we are less prone to omit good pages on the topic, which is a problem with People Distillation systems. Also, since we use an index optimized to finding experts, our implementation uses less data than People Distillation and is therefore faster.In a blind evaluation we found that WidgetChav delivers a high level of relevance given broad queries, and performs comparably to the best of the commercial search engines tested.
We have further added to the expert philosophy by including the following criteria utilising specific data that is available to Googly due to acquisitions or product launches.
• The Googly toolbar tracks where people go, where they visit and how they got there along with all transactions online.
• The Googly Desktop Search knows what files you have on your computer and enables full integration into the power of the Operating System.
• The Googly Photo sharing application knows what you, your family and friends look like due to standard and systematic naming conventions by people
• The Googly Instant Messenger (not currently used as a Googly product but likely to be launched) knows whom you speak to online and how you speak. An example being Slang used and method of placing words in certain order. A Markov chain can be delivered from this data
• The Googly sattelite picture database understands what your house and home look like, along with
• The Googly email service knows whom you converse with and the relations in those conversations
Our goal, for this test and for the reasons stated previously was to extend upon the WidgetChav search engines and define a Widget Chav Algorithm. The following variables are gathered from Googly data and combine to deliver a method of giving an accurate scoring of our target demographic• A= Burberry Score, gathered through WebCam capture and photo data in Photo software and harddrive. N.B. Fake Burberry raises a higher Burberry score than real Burberry but both add to the overall Burberry Score.
• B = Location Score, based on Address longtitude and latitude coordinates. Likelyhood of being in a CHAV neighbourhood
• C= Piercing Score. Certain piercings are more worthy of increasing CHAV Score whereas others reduce it. Multiple face piercings are the sign of a weirdo whereas a belly piercing is an indication of increased chavness
• D= Technology score. A balance between other factors means that having high technology awareness and owning multiple technical items does not mean that someone is a CHAV. It is only when coupled with other criteria that a CHAV would be identified. It is more important to realise that CHAVs will own every gadget possible (especially the male) as this is why they are in debt
• E= Smoker score. Pipe or cigar smokers means the person is not a chav whereas cigarette smoking may increase the likelyhood
• F= Mobile SMS thumb related RSI increases chav score
• G= Ringtone for mobile phone, chart position
• H= Describes car in N characters. A non CHAV will describe their car as, “A 2005 Ford Mondeo” whereas a CHAV (Especially the male) will extend upon this. “A 1999 Ford Escort RS Turbo, lowered suspension, extra spoiler, as seen in Max Power.,……..”
• J= People called Uncle in address book (self explanatory)
• K= Collective age of non running cars in front and rear gardens multipled by number of vehiclesUsing this Googly specific data we can define the likelyhood of being a chav by the following Algorithm, Chav Score
100 – G+ ( B3 C2 ) A + K
E – D6 References
Lots of boring stuff that I used to put this document together, although most of it came to me whilst analysing and reanalysing Hilltop related data over the last 12 – 18 months.I suggest you check out
http://en.wikipedia.org/wiki/Chav
http://www.chavscum.co.uk
and http://www.chavscum.com
which are all excellent CHAV resources.
But seriously for a moment. This is a PARODY and data as well as algorithms are probably incorrect if not plain wrong!!
Although I believe it technically possible for a Google to deliver an anthropologic search engine, which can define groups of people I feel this is unlikely.
This spoof is based upon the Hilltop white paper by Khrishna Bharat and designed to show that when you adapt his original white paper and change the relationships from links in web sites to links within groups of people how sinister and dangerous it looks. Under this algorithm almost everyone will be a CHAV
I understand why Google brought the Hilltop algorithm into place and my thoughts can be clearly read at
http://www.logicdiary.com/2004/03/hilltop-in-plain-english.html and
http://www.logicdiary.com/2004/03/website-families-and-their-death-in.html
but I do agree with many people’s concerns that Hilltop is an over extension of Big Brother online. Whether or not G set out to become the definitive resource online they have become so. With this comes responsibility to not inflict undue strain or hardship on the sites they work with to deliver the overall rankings, the SERPs.
It could be said that Google have changed the face of working online for the so called, Mom and Pop stores.
I don’t know the answer I only know the questions and in the meantime I shall continue to try to understand algorithms in a better manner.
Merry Christmas
Jason Duke
The original Widget Chav paper has been archived here. So what do you think, now that we are nearly 5 years on ?
This entry was written by , posted on 15/05/2009 at 6:50 pm, filed under Everything. Bookmark the permalink. Follow any comments here with the RSS feed for this post.