Cookie Laws – What it means for your business.

An amendment to the Privacy and Electronic Communications Act is a change to legislation that comes into force in the UK on the 26th May 2011. The adapted section is commonly called “The Cookie Law” as it affects how your online business, and the businesses that you work with, use cookies and other “cookie like” data stores of information such as Adobe Flash’s Local Shared Objects as well as browser web storage.

The legislation dates back to 2003 but was revised with changes to cookie like data,

Previous PECR Section 6

6.   (1) Subject to paragraph (4), a person shall not use an electronic communications network to store information, or to gain access to information stored, in the terminal equipment of a subscriber or user unless the requirements of paragraph (2) are met. (2) The requirements are that the subscriber or user of that terminal equipment -

  • (a) is provided with clear and comprehensive information about the purposes of the storage of, or access to, that information; and
  • (b) is given the opportunity to refuse the storage of or access to that information

Is being changed to now read.

6 (1)     Subject to paragraph (4), a person shall not store or gain access to information stored, in the terminal equipment of a subscriber or user unless the requirements of paragraph (2) are met. (2)The requirements are that the subscriber or user of that terminal equipment–

  • (a)     is provided with clear and comprehensive information about the purposes of the storage of, or access to, that information; and
  • (b)     has given his or her consent.

(3)     Where an electronic communications network is used by the same person to store or access information in the terminal equipment of a subscriber or user on more than one occasion, it is sufficient for 2the purposes of this regulation that the requirements of paragraph (2) are met in respect of the initial use. (3A) For the purposes of paragraph (2), consent may be signified by a subscriber who amends or sets controls on the internet browser which the subscriber uses or by using another application or programme to signify consent. (4) Paragraph (1) shall not apply to the technical storage of, or access to, information–

  • (a)     for the sole purpose of carrying out the transmission of a communication over an electronic communications network; or
  • (b)     where such storage or access is strictly necessary for the provision of an information society service requested by the subscriber or user.

This change means that consent to information must be given by a website visitor. The ICO has released a guidance notes and advice on the implementation of this change. What is not clear however is what consent means and how to gain it in a manner that is both compliant and allows a business to function effectively.

Despite the departments name, it  seems that the Information Commissioner’s Office has not delivered meaningful and detailed information. Despite this there are many actionable points that have been shown and multiple “clues” as to how the act will be enforced going forward.

Please notice that I said enforcement above, as it is always the interpretation of the law and how it will be enforced as the defining factor as to whether something is legitimate or not. The Information Commissioner has made it clear that enforcement action will NOT be taking place from the 26th May date when the legislation passes into effect but will expect businesses to have firm and sound plans in place to comply.

But what does it mean to “comply with the act” and how will that affect your business?

If you were to undertake perfection in your data compliance exercise then your site would end up looking like David Naylor’s excellent example that he gives on his blog. I do not believe that this is how the UK Government intends web businesses to operate and neither do I believe it will be how businesses will enact the legislative change.

Ed Vaizey is the Minister for Culture, Communications and Creative Industries, a Parliamentary Under-Secretary of State post with responsibilities in both the Department for Culture, Media and Sport (DCMS) and the Department for Business, Innovation and Skills (BIS).  In a speech to the Confederation of British Industry (The CBI) he said:

It seems to me that consumers have two key concerns around privacy. The first is about what happens to the data that we upload: the bank details we submit when we buy our groceries online; the family video on myspace; the photo on Facebook. The second concern is more complicated and relates to what others know about us and where we have been, to the fear of the online big-brother; a debate which in the US has come to be known as “do not track”.

Let’s be clear about where we are today. Many people voluntarily give up their privacy when they go online.  But they still want a number of rules to apply.

They want the sites they use to be secure; they want to be sure that their data is kept securely; and they want internet companies to be transparent in how their data is used in terms of tracking their activity on the web.

There are many benefits to internet sites knowing who you are, or indeed where you are, in terms of providing tailored information.  People just want to have the option to say ‘yes’ or ‘no’ before allowing it to happen.

Now let me be clear. When it comes to addressing these concerns, I am not a big fan of regulation.

When Government steps into the fast moving world of technology we risk creating more problems than we can solve.   If industry can bring in its own measures to reassure customers – such as clear guidelines in plain English and greater transparency – not only will they win customers, they will avoid regulation.

The red emphasis  above (and below)  is mine as I feel this are important areas to focus on.

Ed Vaizey went on to say:

Of these, it is the cookies provision that is the biggest change, and therefore of most concern to business.  It’s a good example of a well-meaning regulation that will be very difficult to make work in practice. If we get the implementation wrong, it will seriously hamper the smooth running of the internet, and so it’s therefore a provision that should concern the consumer as well.

That’s why our approach to this very challenging provision is a sensible and pragmatic one.  We have made it clear, for example, that the consent of the user is not needed where a cookie is essential for a service that has been requested by the user. The use of cookies for shopping baskets on websites, for example.

We are also supporting cross-industry work on the use of third party cookies in behavioural advertising.

and then also

We are also working with the browser manufacturers to see if browsers can be enhanced to provide relevant information about cookies, as well as easy to use settings. Because we want users to be able to make informed decisions about what they do or don’t allow on to their machines.

However, a one size fits all solution will not cover everything. There will, inevitably, be legitimate uses of cookies that fall through the cracks.

That’s why it is so important for us to adopt a flexible approach – so that new business models and innovations that no one has yet thought of are not held back.

We don’t want to be prescriptive. We want business, regulators and consumers to continue to work together to provide solutions as problems arise.  And we want to see sensible solutions that balance privacy and innovation.

Ed Vaizey’s speech was very welcome and gave some very strong clues as to what is OK and what isn’t along with the problems associated with trying to taint all cookies or cookie look mechanisms with the same brush.

Some points in the speech were backed up by sections in today’s ICO advice.

I have heard that browser settings can be used to indicate consent – can I rely on that?

One of the suggestions in the new Directive is that the user’s browser settings are one possible means to get user consent.  In other words, if the user visits your website, you can identify that their browser is set up to allow cookies of types A, B and C but not of type D and as a result you can be confident that in setting A, B and C you have his consent to do so.  You would not set cookie D.   At present, most browser settings are not sophisticated enough to allow you to assume that the user has given their consent to allow your website to set a cookie.  Also, not everyone who visits your site will do so using a browser.  They may, for example, have used an application on their mobile device.  So, for now we are advising organisations which use cookies or other means of storing information on a user’s equipment that they have to gain consent some other way.

We also know that the UK Government has been speaking with (The Govt. use the term “working with”) browser manufacturers. Some of the new areas that have come out from these discussions, either directly or not, are features such as the Do Not Track feature in Mozilla browsers such as Firefox. Similar features are due to be enabled within Internet Explorer and Apple’s Safari. Google has so far decided not to include the feature by default within their browser but has released an extension to Chrome that delivers similar functionality.

Happening concurrently with EU and UK changes  there is much discussion occurring over on the other side of the Atlantic, a lot of which is very similar to that happening within Europe. I am far from best placed to comment on US legislation and the processes that bills go through before they become law, but I do believe that much of the US changes that are likely to occur are extremely similar to those within the UK. It is this that I believe Ed Vaizey was referencing when he said:

Creating an international standard for on-line privacy will ensure businesses compete on a level playing field while web users enjoy the same protections wherever a website is based.  This may seem like a lofty ambition. But I think that looking at trends in the US and intentions in Europe, it is clear that the two are not poles apart.  Indeed, both the Commission and the American Administration recognise that this is a problem that needs to be addressed.

It is probably no coincidence  that yesterday, the Internet Advertising Bureau, the trade body for the UK online advertising industry announced the released of Pan European guidelines on behavioural targeting & the full guidelines here

It should probably be noted that the IAB defines behavioural advertising as

Online Behavioural Advertising is the collection of online information to facilitate the delivery of display advertising based upon the potential preferences or interests of web users or to advertise a product consumers have shown an interest in previously. For example: a consumer searching online for a new car and therefore visiting vehicle websites may – in a later online session – be shown car advertisements.

In your own business it is unlikely that you classify as a business that delivers behavioural advertising. Not many of the people reading this will be Google, Yahoo or a myriad of other retargeting companies. What some of you may well do though is sell advertising to networks such as Google, Yahoo or others. It is not clear that if you sell advertising via a broker like Google , via say Adsense, whether you will need to definitively tell your visitors that you do so.

The IAB would say that, Google, or the relevant advertising brokerage that you work with, will show a symbol on your site which will link to meaningful information explaining what kind of data is gathered and how, as well as giving the visitor the ability to opt out. It is this that the advertising industry believes complies with EU (and likely US) law.

The clues in the ICO information and the CBI speech infer this would be the case – When the IAB show the i symbol in a prominent position, it could be argued by them  that this is similar and arguably better than the ICOs use of the term “scrolling piece of text”!

Where we stand at the moment is an area without clear specific guidelines but many clues as to what the ICO will accept and will not accept as we go forward.

If you are a business that runs an e commerce store, and as part of the e commerce transaction need to store products that the user has put in their cart, then in my interpretation of the legislation, you will NOT require explicit pre informed consent as that is the primary focus of your business.

If though you allowed advertising from your site enabled via a third party then make sure you work with a partner that delivers the functionality that is similar to the IAB. They make clear that adverts which are shown have a link to a detailed explanation about what data is gathered and how, always delivering a link allowing people to opt out. It is NOT clear if this will be enough to placate the ICO or the US equivalent organisations but it is definitely better than not telling your visitors any thing. If you are unsure then you can make sure you are compliant by explicitly telling your visitors that you display advertising via a 3rd party and that the advertiser may use behavioural retargeting. If the visitor agrees then great, if not then you may wish to turn advertising off or refuse entry to your site. Advertising is your business model after all and you do NOT need to allow anyone onto your site if you dont wish to!

The ICO also made it clear that there are many different levels of cookie and data collection and there could well be a huge difference between a visitor who has chosen to come to your site and is then shown content that is more relevant to them to assist in the normal use of that site, than behavioural advertising on Site A that promotes Site B.

I believe that once all major browsers enable the do not track, or equivalent features and as long as you ensure you treat the differing implementations and specs of those systems effectively, always honouring when someone has these settings turned on, then you will be compliant with the implementation. I should stress though that you should ALWAYS detail in your privacy policy what you do with data, how you collect it and what you collect it for, and give everyone the ability to say “no thanks” – Don’t forget that this includes things such as analytics or other packages. Data is data, it doesnt have to be solely behavioural advertising.

What I think is a very good clue as to how strongly the ICO will act in enforcing the new regs is shown by the ICO web site itself. Currently the ICO uses 3rd party tracking via Google Analytics. How do I know this, well the ICO privacy policy tells me so as well as shows me how to opt out.

But …. I wasn’t asked before I reached the ICO site and didn’t give explicit pre informed consent before I was recorded as visited in the Analytics package. I had to land on the site, search for the privacy policy (or privacy notice as the ICO call it) and then click through and read it there.

If it’s good enough for the goose……..

N.B. Please do not take any of the above post as representing anything other than my personal views. They do not express anything other than some ramblings from a British online marketer, do not express the views of any companies I work for, with, have previously worked with, will work with in the future nor any other thing that I have ever spoken to or with. They words above are my views only and if you are in any doubt then please take independent legal advice, preferably from a competent information and data specialist lawyer. Having said that, I doubt they’ll say much different than I have above :)

This entry was written by JasonD, posted on 09/05/2011 at 7:59 pm, filed under Everything. Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



I may have a mouth the size of the Blackwall tunnel….

… but I am awful at using it on my own site.

It’s been quite some time since I last posted anything here and its been very very busy. I’ll try to bring everything up to speed over the next day or so..


This entry was written by JasonD, posted on 30/12/2010 at 10:57 pm, filed under Everything. Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



SEO Job Vacancy – Come work with me and “The Lawyer”!!!!

Do you want to work with me? You’re a mad(man/woman)  if you do, but you’ll get to learn from one of “the more experienced and proven” Search guys in the business and will see what enterprise level marketing in the most competitive of industries is all about.

I’m not just looking for the best but for people that are passionate about Search.

You need to have SEO experience, be able to show a proven track record and make a great cuppa!

If  you fit the bill, get in touch with and let’s see what we can work out.

Jason

P.S. “The Lawyer!” is part of the office furniture but the upside is he is bloody good to have in an emergency and his tea isn’t too bad either other than it is rarer than a hen’s tooth :)

This entry was written by JasonD, posted on 10/06/2009 at 5:16 pm, filed under Everything. Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



How to build a Brand Based Search Engine

Are these brands?

Are these brands?

Let’s presume that I was running a search engine and wanted to deliver results that the majority of searchers are happy with.

  • Would I run a search engine based on internal links?
  • Would I run a search engine based on page content?
  • Would I run a search engine based on external links?
  • Would I run a search engine based on brand?

Hmmm, I  think I would do all of the above. The 1st 3 points are all relatively simple to implement as great people have gone before us and shown us the way to do it but how on earth do you measure “brand”?

It’s a soft term, something that is hard to define, yet, as a business, you know it when you have it and aspire it when you don’t. As a consumer you are aware of it when a product has it and quite often shy away from it if it is missing.

When You search for Cola what would you expect to see?

When you search for cars which would be a “better result” for you ?

When you seek information on spectacles what would you like to see?

Maybe “brand” is the wrong term to use, maybe we should be better by defining it as…

“a product or company that is already within our psyche?”

The one thing that Coke, Rola Cola, Aston Martin, Trabant, Rayban and Dame Edna all have in common with each other is that you mention their name and I am already aware of them, I have a mental image and I have pre determined views about them.

But then I have to work out what is the difference between say Rola Cola and Coke? I don’t know about you but I like Coke and could never stand Rola Cola when it was sold over here in the UK yet I am sure there are many people who prefer Rola Cola’s flavour over Coke. Either way you had to become aware of the product, it had to become ingrained in your psyche, it had to become a “brand” for you to make that 1st purchase.

Let’s look a little deeper, let’s go back in time and ask how these companies’ built their position into our psyches.

  • They were in in newspapers
  • They were in magazines
  • They were on TV
  • They were in the cinema
  • They were on billboards
  • They undertook public relations

In essence everywhere you looked, everywhere you went, whatever you viewed on whatever screen you saw them, and slowly over time they worked their way into your pysche.

But how do you measure those offline unconnected media and incorporate the data to work WITH online links and content as a measure of “brand” ?

The first issue would be to ascertain how many people get to see the messages within the respective media. The second issue would be to recognise what messages and by whom are being pushed forward by the marketers that place the advertising.

Let’s break them down by general category

  • Print Media- All major print based media (at least in the UK and I am sure similar exist elsewhere in the world) have their readership and sales audited. This delivers real and meaningful data to show the amount of eyeballs that see any mentions of a product or company within them.
  • Screen Based Media – These media also have their viewer numbers audited, with popularity of shows and films being available enabling price differences for advertisers between a main commercial channel with a popular show, say ITV’s during  Britain’s Got Talent compared to say Discovery Shed‘s Wheeler Dealer
  • Billboard Advertising – This one is trickier to measure, not in so much as how many eyeballs get to see them but where and what message is on every billboard out there. But let’s presume for a moment that I had lots of resources available to me and also let’s presume I wanted to deploy of team of cars going around the country with camera’s perched a top. If I were trying to take pictures of the whole country I might well do this…. Hold on, wait a minute, hasn’t a household search engine brand done this already? Those, multi camera wearing, life spying vehicles will be capturing billboards as well as the streets and people within them as well. WOOOHOOO – we have just solved a crucial problem, we now know when and where Billboard advertising changes and we have pictures showing what are on those billboards each time they change.

So now we know who gets to see marketing messages we simply don’t know what messages are out there.

OCR is your friend and so are subtitles (closed captioning) within TV programmes.

TV and Cinema is nothing but a stream of images and sound. Subtitling makes life even easier as raw text data is present and accessible. Print media is nothing but images, and optical character recognition is pretty damn good.

Let me ask you, isn’t it possible to run your theoretically infinite and scaleable resources into running a server farm to process all these images and closed captions looking for mentions of web properties ?

Whern THE becomes WTF

When THE becomes WTF

I admit that OCR isn’t perfect, you sometimes get false positives (WTF or THE ?) but overall it is pretty damn good and more importantly I believe it is more than “good enough” – Perfection may be sought but good enough is exactly that, it delivers a solution to a problem that is accurate enough to deliver acceptable data for our goals. It is “Good Enough”

So let’s recap, in theory ……

  • We can be aware of the amount of people that view offline media
  • We can be aware of mentions of web properties and “brands” in offline media

But how do we incorporate that raw data so that it can deliver a Brand or Inner Psyche based search engine?

Let’s work the simple angle 1st.

Viewers * Brand Mentions = Psyche Score

Quite simply we could combine the onpage analysis engines to ascertain the theme of a page & site then rank them according to the Pysche Score. In this manner Coke would rank above Rola Cola and Dame Edna would always come below RayBan sunglasses.

The next challenge would be to integrate this with link based algorithms. We would want to do this as we know they work. Yup, they’ve been spammed to buggery and back but overall, spam aside, they work and they work extremely well.

Let’s presume that we have lots and lots and lots and lots of user data. We could gather this via advertising programmes, toolbar useage, clickstream data based on analytics or any number of numerous other data collection methods including running an existing and large search engine. In theory this would allow us to understand and see if there is a correlation in user activity and Pysche mining (Brand) related advertising and touchpoint opportunities (I had to get one bullshit bingo phrase in this post!)

If there was a correlation we would know the power of offline media in comparison to clickthroughs from search, specific search phrases, time on page and many other metrics.

Couldn’t we then weight the power of the offline media accordingly?

Shirley Crabtree - The Big Daddy !

Shirley Crabtree - The Big Daddy !

Do you think the following might be worth testing?

Viewers * Brand Mentions = Psyche Score

Ranking Score = Link Score ^ (Psyche Score * Industry Related Offline Media Weighting)

I believe that this will ensure that larger businesses, essentially BRANDS that have succesfully made it into our inner psyche to rank extremely well and reduce the opportunity for small businesses who operate and market almost exclusively online to compete with the “big daddies” of their industry.

It would ensure that, as long as there was some relevant content on page, any Big Daddy domain name would be able to rank for almost anything. It would also reduce the amount of visible spam as most Big Daddy domain names don’t directly spam. The link spam and content spam would still be there it would simply be invisible to the searcher as so few click past the 1st or 2nd page of results.

So let’s recap over what would be required to implement this…..

  • A large page scanning operation to digitise print based media
  • A large base of camera equipped vehicles to capture billboard advertising
  • A large server processing pool to work your way through TV, Cinema, print images and subtitles
  • A great team of thinkers to convert the raw data to meaningful search results

Hmmmm, who do you know that might have that?

This entry was written by JasonD, posted on 29/05/2009 at 1:06 pm, filed under Everything. Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



The Guardian Newspaper (and me) on Comment Spam

I am pleased to say that in today’s (Thursday 28th May 2009) Guardian Newspaper (print and online editions) I am interviewed regarding a comment spam run by a firm of solicitors.

It highlights the dangers of working with a search marketing company that does not FULLY explain risks – It’s an interesting read and great work by Michael Pollit, the journo who wrote the piece. Check it out at the Guardian website where you will find Are Comment Links Just a Form of Spam

This entry was written by JasonD, posted on 28/05/2009 at 10:39 am, filed under Everything. Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



Anthropological Search Algorithm

In December 2004 I wrote a paper highlighting how to understand differing human groups. We are nearly 5 years on yet I believe, more now than ever it requires revisiting.
It’s called Widget Chav :)

Widget Chav
Anthropologic Analysis Based on Neck Colour
Wayne & Waynetta Slob
Council Housed, Tower Hamlets London, UK
(Current Address: Private Rented Accomodation, Harold Hill, Essex, UK)
chavs@majorsearchengine.com
Consultant Anthropologist – Dr Onslo Lardarse PHD

Abstract:
In response to a query a search engine returns a ranked list of documents. If the query is broad (i.e., it matches many documents) then the returned list is usually too long to view fully. Studies show that users usually look at only the top 10 to 20 results. In this paper, we propose a novel ranking scheme for users. By identifying the social demographics that a user exists within, a website will be able to tailor the results to those that statistically the user will most want to see, delivering a rewarding experience for both the user and the website operator.

1 Introduction
Hence the ranking tends to be poor and search services have turned to other sources of information besides content to rank results. We next describe some of these ranking strategies, followed by our new approach to authoritative ranking – which we call Widget Chav.
1.1 Related Work
Three approaches to improve the authoritativeness of ranked results have been taken in the past:
Ranking Based on Human Classification: Human editors have been used by companies such as Boo hoo! and Mining for Coal Company to manually associate a set of peoples and individuals with a subset of humanity in the world. These are then matched against the user’s query and visual “gut feeling”  to return valid matches. The trouble with this approach is that: (a) it is slow and can only be applied to a small number of people, and (b) often the classes and classifications assigned by the human judges are inadequate or incomplete. Given the rate at which thehuman race is growing and the wide variation in international usergroups this is not a comprehensive solution.
Ranking Based on Usage Information: Some services such as Almost Hit collect information on: (a) the surfing individual users undertake within search services and (b) the pages they look at subsequently and the time spent on each page. This information is used to return pages that most users visit after deploying the given query. For this technique to succeed a large amount of data needs to be collected for each query. Thus, the potential set of queries on which this technique applies is small. Also, this technique is open to spamming.
Ranking Based on Connectivity: This approach involves analysing the links between people on the assumption that: (a) people in the same demographics are linked to each other, and (b) authoritative people tend to know to other authoritative people.
PeopleRank [Foliate et al 98] is an algorithm to rank people based on assumption b. It computes a query-independent authority score for every person online and uses this score to rank the result set. Since PeopleRank is query-independent it cannot by itself distinguish between people that are authoritative in general and people that are authoritative in the target demographic. In particular a group of people that are authoritative in general may contain a person that matches a certain demographic but is not an authority on the topic of the chavness. In particular, such a person may not be considered valuable within the community of users who breed people within the  ghetto of the userbase.
An alternative to PeopleRank is People Distillation [Also known as Ethnic cleansing, Hitler 1939-1945, Idi Amin 1925 –2003, et al]. People distillation first computes a query specific subgraph of the race. This is done by including people on the query demographic in the graph and ignoring people not in the demograhic. Then the algorithm computes a score for every person in the subgraph based on known connectivity: every person is given an authority score. This score is computed by summing the weights of all incoming connections to the person. For each such reference, its weight is computed by evaluating how good a source of relationships the referring person is. Unlike PeopleRank, People Distillation is only applicable to broad demographics, since it requires the presence of a community of people in the group.
A problem with People Distillation is that computing the subgraph of the race which is on the query group is hard to do in real-time. In the ideal case every person in the race that deals with the query group would need to be considered. In practice an approximation is used. A preliminary ranking for the query is done with group analysis. The top ranked result people for the query are selected. This creates a selected set. Then, some of the people within one or two links from the selected set are also added to the selected set if they are on the query topic. This approach can fail because it is dependent on the comprehensiveness of the selected set for success. A highly relevant and authoritative person may be omitted from the ranking by this scheme if it either did not appear in the initial selected set, or some of the people pointing to it were not added to the selected set. A “focused crawling” procedure to crawl the entire race to find the complete subgraph on the query’s topic has been proposed [Shak et al 99] but this is too slow for online searching. Also, the overhead in computing the full subgraph for the query is not warranted since users only care about the top ranked results.
1.2 Widget Chav Algorithm Overview
Our approach is based on the same assumptions as the other connectivity algorithms, namely that the number and quality of the sources related to a person are a good measure of the person’s quality. The key difference consists in the fact that we are only considering “expert” sources – criteria that have been analysed as having specific purpose of being used collectively by those with a tinge of red in their necks. In response to a new user visit, we first compute a list of the most relevant experts for the potential person. Then, we identify relevant links within the selected set of experts, and follow them to identify neck colour. The targets are then ranked according to the number and relevance of non-affiliated experts that point to them. Thus, the score of a target person reflects the collective opinion of the best independent experts on the query topic. When such a pool of experts is not available, Widget Chav provides no results. Thus, Widget Chav is tuned for result accuracy and not query coverage.
Our algorithm consists of multiple broad phases:
(i) Expert Lookup
We define an expert person as a page that is about a certain topic and has links to many non-affiliated people on that topic. Two people are non-affiliated conceptually if they are from non-affiliated organizations. In a pre-processing step, a subset of the people crawled by a search engine are identified as experts. In our experiment we classified 2.5 million of the 140 million or so pages in “AstaLaVista Baby’s” index to be experts. The pages in this subset are indexed in a special inverted index.
Given an input query, a lookup is done on the expert-index to find and rank matching expert people. This phase computes the best expert people on the query topic as well as associated match information.
(ii) Target Ranking
We believe a person is an authority on the query group if and only if some of the best experts on the query group point to it. Of course in practice some expert people may be experts on a broader or related demographic. If so, only a subset of the relations on the expert person may be relevant. In such cases the links being considered have to be carefully chosen to ensure that their qualifying relationship matches the query. By combining relevant out-links from many experts on the query topic we can find the pages that are most highly regarded by the community of people related to the query topic. This is the basis of the high relevance that our algorithm delivers.
Given the top ranked matching expert-people and associated match information, we select a subset of the links within the expert peoples. Specifically, we select links that we know to have all the query social groups associated with them. This implies that the link matches the query. With further connectivity analysis on the selected relationships we identify a subset of their targets as the top-ranked people on the query topic. The targets we identify are those that are linked to by at least two non-affiliated expert persons on the topic. The targets are ranked by a ranking score which is computed by combining the scores of the experts showing a relationship to the target.
1.3 Roadmap
The rest of the paper is organized as follows: Section 2 describes the selection and indexing of experts; Section 3 provides a detailed description of the ranking scheme used in query processing; Section 4 presents a user-based evaluation of our prototype implementation; and Section 5 concludes the paper.
2 Expert People
Broad subjects are well represented in life and as such are also likely to have numerous human-generated lists of resources. There is value for the individual or organization that creates resource lists on specific groups since this boosts their popularity and influence within the community interested in the topic. The authors of these lists thus have an incentive to make their lists as comprehensive and up to date as possible. We regard these links as recommendations, and the pages that contain them, as experts. The problem is, how can we distinguish an expert from other types of people? In other words what makes a person an expert? We felt than an expert person needs to be objective and diverse: that is, its recommendations should be unbiased and point to numerous non-affiliated people on the subject. Therefore, in order to find the experts, we needed to detect when two people belong to the same or related organizations.
2.1 Detecting Location Affiliation
We define two people as affiliated if one or both of the following is true:
•    They share the same first 5 octets of their longitude and latitude coordinates .
•    The road’s non-generic token in the address is the same.
We consider tokens to be substrings of the address delimited by “.”  (period) or “,” (comma). A suffix of the address is considered generic if it is a sequence of tokens that occur in a large number of distinct hosts. E.g., “Dagenham” and “Louisiana” are names that occur in a large number of our sample set for Chav detection and are hence generic suffixes. Given two locations, if the generic suffix in each case is removed and the subsequent right-most token is the same, we consider them to be affiliated.
E.g., in comparing “22 Acacia Avenue, Dagenham” and “76 Acacia Avenue, Texas” we ignore the generic suffixes “22″ and “76″ respectively. The resulting leftmost token is “Acacia Avenue”, which is the same in both cases. Hence they are considered to be affiliated. Optionally, we could require the generic suffix to be the same in both cases.
The affiliation relation is transitive: if A and B are affiliated and B and C are affiliated then we take A and C to be affiliated even if there is no direct evidence of the fact. In practice some non-affiliated locations may be classified as affiliated, but that is acceptable since this relation is intended to be conservative.
In a preprocessing step we construct a location-affiliation lookup. Using a union-find algorithm we group locations, that either share the same rightmost non-generic suffix or have an IP address in common, into sets. Every set is given a unique identifier (e.g., the location with the lexicographically lowest name). The location-affiliation lookup maps every location to its set identifier or to itself (when there is no set). This is used to compare locations. If the lookup maps two locations to the same value then they are affiliated; otherwise they are non-affiliated.
2.2 Selecting the Experts
In this step we process a search engine’s database of people (we used AstaLaVista Baby’s crawl from April 1999) and select a subset of people which we consider to be good sources of relations on specific demographics. In this instance we have aimed to identify CHAV experts for the following reasons.

CHAVs as a social group are the international holy grail of internet marketing.

Consider the following CHAV stereotype we shall call Sharon.

A fat, ugly, smelly, single, unloved, unintelligent, indebted, gambler woman.

Sharon, would wish to be marketed to by those that sell (as a minimum)

•    Diet pharmacueticals
•    Personal Hygene pharmacueticals
•    Dating Services
•    Marital Aids
•    Soft to Medium Pornography
•    Sexual pharmacueticals
•    Educational material for internet marketing (“How to make a £million in your nightgown” type products)
•    Unsecured Loans
•    Consolidation Loans
•    Debt Management Services
•    Online Casino services
•    Online Bookmaker services

For these reasons and the financial benefits that advertisers would gain from targetting Sharon, we chose the CHAV social group as our target and tailored the final algorithm to include data that is available to our sponsors, which may not be available to other competing search engines.

This is done as follows:
Considering all people with out-degree greater than a threshold, k (e.g., k=5) we test to see if these person point to k distinct non-affiliated locations. Every such person is considered an expert page.
If a broad classification (such as Casuals, Snobs, Drunks etc.) is known for every page in the search engine database then we can additionally require that most of the k non-affiliated persons discovered in the previous step point to people that share the same broad classification. This allows us to distinguish between random collections of links and resourceful, well connected people. Other properties of the person such as level in education can be used as well.
2.3 Indexing the Experts
To locate expert people that match user groups we create an inverted index to map groups to experts in which they occur. In doing so we only index people contained within “key sterotypes” of the expert. A key stereotype is a piece of text that qualifies one or more people in the group. Every key stereotype has a scope within the group format. People located within the scope of a group are said to be “qualified” by it. For example, the Burberry, Piercings, decrepid cars and location score within the expert group are considered key stereotypes.
The inverted index is organized as a list of match positions within experts. Each match position corresponds to an occurrence of a certain stereotype within a key topic of a certain expert group. All match positions for a given expert occur in sequence for a given type. At every match position we also store:
1.    An identifier to identify the type uniquely within the document
2.    A code to denote the kind of type it is
3.    The offset of the word within the type.
In addition, for every expert we maintain the list of people within it (as indexes into a global list of race) and for each person we maintain the identifiers of the key types that qualify it.
To avoid giving long key types an advantage, the number of keywords within any key type is limited (e.g., to 32).
3 Query Processing
In response to a user query, we first determine a list of N experts that are the most relevant for that query. E.g. N = 200 in our experiment. Then, we rank results by selectively following the relevant links from these experts and assigning an authority score to each such page. In this section we describe how the expert and authority scores are computed.
3.1 Computing the Expert Score
For an expert to be useful in response to a query, the minimum requirement is that there is at least one person which contains all the query types in the key phrases that qualify it. A fast approximation is to require all query types to occur in the group. Furthermore, we assign to each candidate expert a score reflecting the number and importance of the key types that contain the query, as well as the degree to which these match the query.
Thus, we compute the score of an expert as as a 3-tuple of the form (S0, S1, S2). Let k be the number of terms in the input query, q. The component Si of the score is computed by considering only key types that contain precisely k – i of the query terms. E.g., S0 is the score computed from phrases containing all the query terms.
Si = SUM{key phrases p with k – i query terms} KnockedLevelScore(p) * FullnessPlumpFactor(p, q)
KnockedLevelScore(p) is a score assigned to the phrase by virtue of the type of phrase it is.
FullnessPLumpFactor(p, q) is a measure of the number of terms in p covered by the terms in q. Let plen be the length of p. Let m be the number of terms in p which are not in q (i.e., surplus terms in the phrase). Then, FullnessPlumpFactor(p, q) is computed as follows:
•    If m <= 2, FullnessPlumpFactor(p, q) = 1
•    If m > 2, FullnessPlumpFactor(p, q) = 1 – (m – 2) / plen
Our goal is to prefer experts that match all of the query types over experts that match all but one of the keywords, and so on. Hence we rank experts first by S0. We break ties by S1 and further ties by S2. The score of each expert is converted to a scalar by the weighted summation of the three components:
Expert_Score = 232 * S0 + 216 * S1 + S2.

3.2 Computing the Target Score
We consider the top N experts by the ranking from the previous step (e.g., the top 200) and examine the pages they point to. These are called targets. It is from this set of targets that we select top ranked perople. For a target to be considered it must be pointed to by at least 2 experts in locations that are mutually non-affiliated and are not affiliated to the target. For all targets that qualify we compute a target score reflecting both the number and relevance of the experts pointing to it and the relevance of the phrases qualifying the links.
The target score T is computed in three steps:
•    For every expert E that points to target T we draw a directed edge (E,T). anchor text qualifies the edge corresponding to the hyperlink.

For each query  w, let occ(w, T) be the number of distinct key phrases in E that contain w and qualify the edge (E,T). We define an “edge score” for the edge (E,T) represented by Edge_Score(E,T), which is computed thus:
•    If occ(w, T) is 0 for any query keyword then the Edge_Score(E,T) = 0.
•    Otherwise, Edge_Score(E,T) = Expert_Score(E) * Sum{query keywords w} occ(w, T)

2.    We next check for affiliations between experts that point to the same target. If two affiliated experts have edges to the same target T, we then discard one of the two edges. Specifically, we discard the edge which has the lower Edge_Score of the two.
3.    To compute the Target_Score of a target we sum the Edge_Scores of all edges incident on it.
The list of targets is ranked by Target_Score. Optionally, this list can be filtered by testing if the query keywords are present in the targets. Optionally, we can match the query keywords against each target to compute a Match_Score using content analysis, and combine the Target_Score with the Match_Score before ranking the targets.

4 Evaluation
In order to evaluate our prototype search engine, we conducted two user studies aiming to estimate the recall and precision. Both experiments also involved three other search engines, namely AstaLaVistaBaby, AlmostHit and Googly, for comparison and were done in August 2005. Note that the current rankings by these engines may differ.
4.1 Locating Specific Popular Demographics
For the first experiment we asked seven volunteers to suggest the demographics of workers of ten organisations of their choice (companies, universities, stores, etc.). Some of the queries are reproduced in the table below:

Alpha Phi Omega    Best Buy    Digital    Disneyland

http://www.dollarbankaccount.com/

Grouplens    Fords    Keebler
Mountain View Public Library    Macy’s    Minneapolis City Pages    Moscow Aviation Institute
MENSA    OCDE    ONU    Pittsburg Steelers
Pizza Hut    McDonalds    SONY    Safeway
Barking Shopping Center    Trek Bicycle    USTA    Vanguard Investments
The same query was sent to all four search engines. We assume that there is exactly one over riding worker demographic in each case. Every time the demographic was found within the first ten results, its rank was recorded. Figure 2 summarizes the average recall for the ranks 1 to 10 for each of the four engines: our engine Widget Chav (WC), Googly (GG), AstaLaVistaBaby (AV), and AlmostHit (DH). Average recall at rank k for this experiment is the probability of finding the desired worker demographic within the first k results.

Figure 2. Average Recall vs. Rank

Our engine performed well on these queries. Thus, for about 87% of the queries, WidgetChav returned the desired page as the first result, comparable with Googly at 80% of the queries, while AlmostHit and AltaLaVistaBaby succeeded at rank 1 only in 43% and 20% of the cases, respectively. As we look at more results, the average recall increases to 100% for Googly, 97% for WidgetChav, 83% for AlmostHit, and 30% for AltaLaVistaBaby.
4.2 Gathering Relevant People
In order to estimate WidgetChav’s ability to generate a good first page of results for broad queries, we asked our volunteers to think of broad topics (i.e., topics for which it is likely that many good groups exist) and formulate queries. We collected 25 such queries, listed below:

Aerosmith    Amsterdam    backgrounds    chess    dictionary
fashion    freeware    FTP search    Godzilla    Grand Theft Auto
greeting cards    Jennifer Love Hewitt    Las Vegas    Louvre    Madonna
MEDLINE    MIDI    newspapers    Paris    people search
real audio    software    Starr report    tennis    UFO
We then used a script to spam each query to all four search engines and collect the top 10 results from each engine, recording for each result the demographic group, the rank, and the engine that found it. We needed to determine which of the results were relevant in an unbiased manner. For each query we generated the list of unique groups in the union of the results from all engines. This list was then presented to a judge in a random order, without any information about the ranks of page or their originating engine. The judge rated each page for relevance to the given query on a binary scale (1 = “good page on the topic”, 0 = “not relevant or not found”). Then, another script combined these ratings with the information about provenance and rank and computed the average precision at rank k (for k = 1, 5, and 10). The results are summarized in Figure 3.

Figure 3. Average Precision at Rank k
These results indicate that for broad subjects our engine returns a large percentage of highly relevant pages among the ten best ranked pages, comparable with Googly and AlmostHit, and better than AltaLaVistaBaby. At rank 1 both WidgetChav and AlmostHit have an average precision of 0.92. Average precision at 10 for WidgetChav was 0.77, roughly equal to the best search engine, namely Googly, with a precision of 0.79 at rank 10.
5 Conclusions
We described a new ranking algorithm for broad queries called WidgetChav and the implementation of a search engine based on it. Given a broad query WidgetChav generates a list of target groups which are likely to be very authoritative on the topic of the query. This is by virtue of the fact that they are highly valued by people in thetarget demographic which address the topic of the query. In computing the usefulness of a target person from the hyperlinks pointing to it, we only consider links originating from persons that seem to be experts. Experts in our definition are from links pointing to many non-affiliated individuals. This is an indication that these people were created for the purpose of recreational procreation, and hence we regard their opinion as valuable. Additionally, in computing the level of relevance, we require a match between the query and the type the expert  which qualifies the link being considered. This ensures that links being considered are on the query topic. For further accuracy, we require that at least 2 non-affiliated experts point to the person  with relevant qualifying stereotypes describing their linkage. The result of the steps described above is to generate a listing of people that are highly relevant to the user’s query and of high quality.
WidgetChav most resembles the connectivity techniques, PeopleRank and People Distillation. Unlike PeopleRank our technique is a dynamic one and considers connectivity in a graph specifically about the query group. Hence, it can evaluate relevance of content from the point of view of the community of authors interested in the demographic. Unlike People Distillation we enumerate and consider all good experts on the subject and correspondingly all good target peopleon the subject. In order to find the most relevant experts we use a custom stereotype-based approach, focusing only on the groups that best captures the domain of expertise. Then, in following links, we boost the score of those targets whose qualifying information best matches the query. Thus, by combining group and connectivity analysis, we are both more comprehensive and more precise. An important property is that unlike People Distillation approaches, we can prove that if a person does not appear in our output it lacks the connectivity support to justify its inclusion. Thus we are less prone to omit good pages on the topic, which is a problem with People Distillation systems. Also, since we use an index optimized to finding experts, our implementation uses less data than People Distillation and is therefore faster.

In a blind evaluation we found that WidgetChav delivers a high level of relevance given broad queries, and performs comparably to the best of the commercial search engines tested.

We have further added to the expert philosophy by including the following criteria utilising specific data that is available to Googly due to acquisitions or product launches.

•    The Googly toolbar tracks where people go, where they visit and how they got there along with all transactions online.
•    The Googly Desktop Search knows what files you have on your computer and enables full integration into the power of the Operating System.
•    The Googly Photo sharing application knows what you, your family and friends look like due to standard and systematic naming conventions by people
•    The Googly Instant Messenger (not currently used as a Googly product but likely to be launched) knows whom you speak to online and how you speak. An example being Slang used and method of placing words in certain order. A Markov chain can be delivered from this data
•    The Googly sattelite picture database understands what your house and home look like, along with
•    The Googly email service knows whom you converse with and the relations in those conversations
Our goal, for this test and for the reasons stated previously was to extend upon the WidgetChav search engines and define a Widget Chav Algorithm. The following variables are gathered from Googly data and combine to deliver a method of giving an accurate scoring of our target demographic

•    A= Burberry Score, gathered through WebCam capture and photo data in Photo software and harddrive. N.B. Fake Burberry raises a higher Burberry score than real Burberry but both add to the overall Burberry Score.
•    B = Location Score, based on Address longtitude and latitude coordinates. Likelyhood of being in a CHAV neighbourhood
•    C= Piercing Score. Certain piercings are more worthy of increasing CHAV Score whereas others reduce it. Multiple face piercings are the sign of a weirdo whereas a belly piercing is an indication of increased chavness
•    D= Technology score. A balance between other factors means that having high technology awareness and owning multiple technical items does not mean that someone is a CHAV. It is only when coupled with other criteria that a CHAV would be identified. It is more important to realise that CHAVs will own every gadget possible (especially the male) as this is why they are in debt
•    E= Smoker score. Pipe or cigar smokers means the person is not a chav whereas cigarette smoking may increase the likelyhood
•    F= Mobile SMS thumb related RSI increases chav score
•    G= Ringtone for mobile phone, chart position
•    H= Describes car in N characters. A non CHAV will describe their car as, “A 2005 Ford Mondeo” whereas a CHAV (Especially the male) will extend upon this. “A 1999 Ford Escort RS Turbo, lowered suspension, extra spoiler, as seen in Max Power.,……..”
•    J= People called Uncle in address book (self explanatory)
•    K= Collective age of non running cars in front and rear gardens multipled by number of vehicles

Using this Googly specific data we can define the likelyhood of being a chav by the following Algorithm, Chav Score

100 – G+ ( B3 C2 ) A + K
E – D

6 References
Lots of boring stuff that I used to put this document together, although most of it came to me whilst analysing and reanalysing Hilltop related data over the last 12 – 18 months.

I suggest you check out

http://en.wikipedia.org/wiki/Chav

http://www.chavscum.co.uk

and http://www.chavscum.com

which are all excellent CHAV resources.

But seriously for a moment. This is a PARODY and data as well as algorithms are probably incorrect if not plain wrong!!

Although I believe it technically possible for a Google to deliver an anthropologic search engine, which can define groups of people I feel this is unlikely.

This spoof is based upon the Hilltop white paper by Khrishna Bharat and designed to show that when you adapt his original white paper and change the relationships from links in web sites to links within groups of people how sinister and dangerous it looks. Under this algorithm almost everyone will be a CHAV

I understand why Google brought the Hilltop algorithm into place and my thoughts can be clearly read at

http://www.logicdiary.com/2004/03/hilltop-in-plain-english.html and

http://www.logicdiary.com/2004/03/website-families-and-their-death-in.html

but I do agree with many people’s concerns that Hilltop is an over extension of Big Brother online. Whether or not G set out to become the definitive resource online they have become so. With this comes responsibility to not inflict undue strain or hardship on the sites they work with to deliver the overall rankings, the SERPs.

It could be said that Google have changed the face of working online for the so called, Mom and Pop stores.

I don’t know the answer I only know the questions and in the meantime I shall continue to try to understand algorithms in a better manner.

Merry Christmas
Jason Duke

The original Widget Chav paper has been archived here. So what do you think, now that we are nearly 5 years on ?

This entry was written by JasonD, posted on 15/05/2009 at 6:50 pm, filed under Everything. Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



Why Jay Deiboldt is a Donkey & GoogleSlapper is a sham!

Jay Deboilt the Google Slapper Donkey!

Is Jay Deboilt a Google Slapping Donkey!

Imagine that you are happily spending the evening with your family, the kids are getting ready for bed and you’re looking forward to a nice dinner once the little ones are in the land of nod. Then, out of the blue, a shrill ring pierces the air – it’s your business phone, the number of which very few people have as it is rarely used and is especially for business emergencies.

You jump up, passing your daughter, with her heavy eyes and bottle of warm bed time milk, to your wife so you can grab the phone expecting al hell to have broken loose.

What follows is an almost verbatim transcript of the conversation that followed:

Me: Hello.

Person : Is that Jason Dyke

I must say that being called a Dyke isn’t the best way to enamour me – I much prefer my real surname of Duke over Dyke every day but let me go on….

Me: Who is this?

Person: I am calling from Jay Deboilt’s office and you are having difficulties making money

I was pissed off with a call starting like this let alone saying I had difficulties making money. I’ll be straight here I could always do with more money but as to difficulties earning it all I can say is my personal tax bill in the 2008/2009 tax year was 6 figures. That’s £ GBP  not $ USD and most of that was paid when £1 was worth $2! Also remember this was TAX, not income and all of that was made due to me taking a website to the very top, around the world for a few single word phrases.

What phrases Jason? I hear you ask – well how does the phrase “poker” grab you? It’s safe to say I have probably “slapped” Google harder than most!

Me: Who is this, where did you get this number from?

Person: You are having trouble making money offline..

Me Interjecting: Please stop talking and tell me where you got this number from

Person: I am from Jay Deboilt’s office and I got your details from our file…. You are having trouble making money..

Me: Woahhhh there. What file?

I was about to remind this “person” that under European Data Protection Laws that any personally identifiable information that is kept about an individual that registration with the Information Commissioner’s Office, a UK government department is a required. I live in the UK, am a UK citizen, was born and bred here and I don’t care if someone who has my personally identifiable information, without my consent is calling from the US.

Person: The file we have on you that says, You are having trouble making money

Me: Stop talking right now and listen to me for a moment

Person: I don’t like your tone.

Me: You bloody called me and YOU’RE complaining to me?

Person: You need to buy GoogleSlapper

Me: (I literally Laugh out Loud)

Person: Are you being rude to me?

Me: Who are you, what is your name?

Person: Hangs up…..

Now I have no idea if Google Slapper is any good or not (I guess not) nor if Jay Deboilt is truly a donkey but  I do know a few things about Jay Deboilt, his business organisation and his business’ processes.

  • I know that Jay employs donkeys to make cold calls
  • I know that the person that called isn’t clever enough to understand time zones exist
  • I know that Jay’s business employs illegal data gathering
  • I know that Jay’s business needs help with it’s cold calling sales script
  • I know that Jay’s business may preech online marketing but attempts (badly) offline telesales
  • I know Jay doesn’t research his prospect list.

And most of all I know that jay Deboilt doesn’t understand much about Google, despite selling a product that purports to dominate and slap Google’s search engine. I say that Jay doesn’t know much about Google as I am relatively confident that if you search for his name he will need some reputation management soon!

UPDATE – Number 1 for Jay Deboilt – http://www.google.com/search?q=jay+deboilt

Yup Jay, you have a reputation management issue!

END UPDATE

Jay, if you read this, you have my number. I suggest you check the time in the UK then if it’s during office hours you give me a call. I will accept your apology!

This entry was written by JasonD, posted on 07/05/2009 at 8:16 pm, filed under Everything and tagged , , , , . Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



Amsterdam Affiliate Conference

I’ve just got back from the Amsterdam Affiliate Conference (CAP) and am playing catch up but to keep your interest until I do a more detailed post I thought this might make you smile

(Thanks to Becky Naylor for takin the photo)

To Get the Pot of Gold you can either catch a leprechaun or work with a great SEO!

To Get the Pot of Gold you can either catch a leprechaun or work with a great SEO!

This entry was written by JasonD, posted on 06/05/2009 at 9:21 am, filed under Everything. Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



AVA RS5 – A Gadget for Geeks…

I about to spend £1000 on a computer. A simple little computer, with no screen, no keyboard, no mouse – just a little slot to take DVDs and CDs.

It isn’t REALLY a server although it calls itself one, as it has a pathetic (in server terms) Atom processor. But what is good about it is that it does one job and one job well.

It’s the only machine I have yet to find that allows you to stick a DVD or CD into it and then automatically rip and streams across the network the movie or music. It runs windows home server (Boooooo!!!)  but  I feel confident I can change that to an operating system worth having (Debian!) but what, if any other physical hardware exists that rips dvds and cds like the AVA RS5 with a front mounted slot, multiple hard disk caddys etc ??

This entry was written by JasonD, posted on 24/04/2009 at 8:07 am, filed under Everything. Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



Urgent Assistance – Are you a graphics dude or dudette?

If you can assist I will be very grateful.

I am many things but a graphic designer I am not. I have put together a PDF for some printing I need done urgently but it is extremely low quality but it does show the general thought process and layout I require.

I would like someone to take that and urgently turn it around so it’s high quality, print ready. It’s a couple of images and loads of text.
Can you help ?
Thanks

Jason

This entry was written by JasonD, posted on 23/04/2009 at 9:51 am, filed under Everything. Leave a comment or view the discussion at the permalink and follow any comments with the RSS feed for this post.



« Previous Entries