Thursday, July 26, 2012

Web: A Primer

What is the Web?

“Web” is a collective noun for all the pages that can be viewed in a web browser (Firefox, Chrome, IE), that have an address in the form of a Uniform Resource Locator or URL such as http://www.google.com. The terms “http” and “www” in the URL tell you that this resource or page, is part of the “Web”.

The following services are not part of the Web, but are a part of the Internet - VOIP, FTP and SFTP. These services do not use http, and that distinguishes them from the Web. The question “Do you have google access?” means “Do you have Web access”. In layman’s terms “having Internet access” is the same as Web access - that is, being able to connect to the network and use one or all of its services.

Internet is the name of the global interconnected network, the hardware and software over which the world wide web runs -  remember that several non-Web services and applications also run on the internet.

The world wide web is no longer a novelty. It is now more frequently used than newspapers and printed books - for entertainment, news, and business. A company’s website is its public face. A website is more influential than the physical aspects (office, brochures, personnel) of the company in converting visitors to customers. People make judgements, and take financial decisions about the company’s people, products and services, from its website.


How is the Web useful?

A) Bookmarking:
That is, by remembering specific website addresses and going directly to them.

This is somewhat like treating the internet like a newspaper.

When you go to a page, you have to be satisfied with whatever is on it - take it, or leave it - you can’t change it. That is fine for routine tasks like the daily news, today’s crossword, jokes, horoscopes, etc.

B) Searching:
If you are looking for specific, specialized information, and if your information needs change frequently, you have to search the Web for it. Enter search engines. To search, it is necessary to communicate with your search engine (which is itself a website) by typing out your thoughts.


Searching uses language but it is not "natural language" yet - its getting there. For now, use the Search Rule of Thumb to find what you're looking for. If you cannot find something you’re looking for from 7.8 Billion pages, chances are that you are not asking for it in a way the program understands - or, you're an outlier genius who is doing a lot of original work.

Search Rule of Thumb - Search Iteratively


That is, instead of using natural language long form questions such as "How do cat's eyes work?" (bad example, that will probably work!), start by typing in the terms that most broadly define what you want, first. Then add words and keep refining them until you see the results converge with your mental image of what you are looking for. Of course natural language query is the holy grail and we’ll have it someday - just not yet.

How do people find your company’s website or your products and services?

By enabling discovery. By either preparing and serving your website in the most optimal way, or by buying search keywords and showing your offering as a sponsored search result. By building a database of email addresses and sending email, or by forming business partnerships that link back to your site so that the popularity of the partner rubs off on your site.

As of 25 April the internet had 7.84 billion pages and the number is rising fast. Assuming your website has been made adequately, a user who knows words relevant to your business or product should be able to find it. This is what a search engine does. Google is a search engine, it helps us get what we want from the Web.

Why Search Engine Optimization:

Johann, a middle school student in Sweden, is struggling to understand how Fluid displacement works and what the Archimedes Principle is. The textbook explanation does not make sense to him. So he goes to google and searches. The words and phrases that he uses are critical for the quality of the search results he gets.

Being part of this product group, we know that Exploriments is a provider of the kind of information Johann is looking for, but he has never heard of Exploriments and he does not know its URL - so he cannot key in the URL to get to our site. He can only use natural language to describe what he is looking for.

To be found by Johann, we have to make Exploriments.com appear as a search result on the first page, or in the initial pages. This is only possible if we ensure that the web page that deals with our Buoyancy application includes the words and phrases he is likely to search for, and present the page in a way that Google can find it, read it, and index it. The technique, process, or method of preparing web pages to do this, is called Search Engine Optimization. Web pages are Optimized for Search through several techniques.

What is SEO?

Computer programs called robots work tirelessly on behalf of search engines to monitor, track and scan web pages. The robots go through all the text, make what sense they can, and store this info in proprietary repositories (called indexes, which function much like those things at the back of books). To understand a search engine robot, imagine a very smart and efficient, but visionless, human being; who, after entering your home, will try and locate all your doors and windows, enter every one he can reach, and store every scrap of information he finds..

A search engine robot “sees” only the underlying HTML markup of a page, and not its visual design. View the source of an HTML page to understand what the engine sees, and to appreciate how difficult (or easy) you are making for it to find content.

To a search engine, the visual appearance of a website does not matter. It ranks the page by the a) quality of the content (that is, its adherence to topic, by keyword) and b) the ease of finding content. If the robot had to traverse 4 links in a website to reach a piece of content, it ranks that lower than a piece of content only 2 clicks away.

Similarly, if the robot sees a keyword half way down the page, its importance is not as high as a keyword that is found easily, high up, on the page. Then again, with how relevant these keywords are to the title of the page and to its meta tags. All these parameters are churned by the secret algorithms, until the page rank for a particular term is determined.

All the tweaks you can make to your own web site can be called “Internal SEO”.

“External SEO” is quite a different matter. If more and more users click on a search result and come to a page, its page rank will increase because of its external popularity. This is the downstream aspect of Internal SEO. If one website links to a piece of content on another, the first website’s page rank will rub off on the second. If CNN.com links to my post somewhere, my page will start ranking high for those keywords, even if it was unknown before, even if no SEO best practices were followed.

Technical teams can control only Internal SEO - it should be done as due diligence. Internal SEO increases the chances of organic discovery through search engines, and the chances of hitting an External SEO jackpot.

For Internal SEO: logical content organisation, ease of navigation, and ease of content discovery are of paramount importance.

- Logical URL design: All piece of content should be equidistant from the root.

- Page Title: HTML title in a page should be unique, and should identify the page
accurately. The page title is displayed in the search results. If this does not match the
content, the google rank for the page goes down because the google algorithm sees this
as incorrect.
- Meta Tags: Meta Keywords and Meta Description. These should be relevant to the
content of the page
- Image Alt tags should be filled in
- Logical use to H tags to reflect content hierarchy
- Keyword density. How often does the important keyword(s) appear in page?
- Link farming. Avoid having more than 100 links on a single page, it could look fishy
- Availability of page content as high up in markup as possible.
- Blindness to frames. A search engine agent (or robot) cannot navigate into a frame, or
into a flash file to index content inside the binary object. These are lost opportunities
- link text - is ranked higher if it is relevant to the page being linked to.

Anatomy of a page view:

When you enter a URL in the browser and hit Enter, or click a link in your search results, your browser is requesting one particular file from the available 7.8 billion. You are most probably asking for the file using its friendly name and not its computer location.

1) Assume that the URL below is the request you made. This is in friendly form, since “developer, yahoo performance...etc” are all English words that you can read and understand -  the file location is specified in natural language.

http://developer.yahoo.com/performance/rules.html

2) Natural language is ok for human beings, but no computer understands it - for now. The browser needs to spend some time getting from the natural language location, to the IP address location..

For this, the browser first sends the natural language request to an internet directory, called the DNS server.  The DNS looks up the domain (this part of the address - developer.yahoo.com) and returns the underlying IP address. This takes between 20 and 120 ms

3) The browser takes the IP numbers in the response, and makes another request to the IP address. The IP address corresponds to an actual computer (or set of load balanced computers) somewhere in the world. The numbers are the result of an international convention that all networked computers follow, so each knows where the other is.

4) By default, computers that are part of the www (such as developer.yahoo.com) have a specific program always “listening” for requests.

This program is called the Webserver. Inside it is a directory structure like a filing cabinet, in which different files are stored - images, code and data. The webserver is a specialised program that can read the request and understand how to assemble the file that was requested. Assembly is required because the file being requested is usually not all in one place - images, database info, and text are all part of the requested file, and they’re stored in different locations.

5) The webserver puts the raw file together, wraps it up in HTML so that a browser can handle it, and streams it back to the computer that requested it.

6) A web browser does not receive the visual version of the web page down the wire. It only receives the markup (what you see if you view source) from the responding computer. The browser has to parse each line of the markup, then either send requests over the wire for the referenced files, or take them from browser cache if they were downloaded in the recent past - images, javascripts, stylesheets, icons, etc.. When all files are received, the browser uses its “rendering engine” to generate the display by computing colors, placements, fonts and images - and displaying the web page.

Realise that several round trips of data communication, totalling tens of thousands of miles and lasting tens of seconds, occur for a single page view to happen. Not taking this mechanism into account results in poorly built web pages. Such pages are slow and frustrating, and eventually lose business for the company.

Browser Page Rendering is an involved area of study, where the constraints below need to be understood to produce the best possible experience for users. They are -

- reduce http requests
- CDN for localized delivery
- Reduce number of unique DNS lookups
- Dealing with the restriction of parallel downloads. 2/domain
- Cache header
- Compress components
- CSS at the top
- Scripts at the bottom
- CSS and JS make external
- Minify
- AJAX cacheable
- GET in AJAX
- lazy load or post-load images below the fold
- pre-load components that will be needed next, in the current request.
- Minimize number of frames - each is a separate web page
- Split components across domains
- Reduce number of DOM elements
- No 404s. Each one is a wasted round trip, putting stress on the server as well as
wasting front end time.
- Optimize images
- Optimize sprites
- favicon.ico - make it available, small and cacheable. All browsers make this request
- Don’t have any empty img src tags. This is very expensive.

It is estimated that about 80% of the time between clicking a link and seeing the page, is spent by the browser in making sense of the markup it receives.

It is obvious then, that smart efficient markup can make the biggest difference for the end user. Errors in the markup, redundant instructions, incorrect instructions or mal-formed instructions - all result in time being wasted, and your web page performing “badly”.

Usability:

If the principles of page loading and www are applied properly when constructing a page, it will load quickly and respond to subsequent clicks optimally, without letting the user feel impatient.

It is estimated that users need to see signs of a page forming (progressive loading) within 10 seconds, and to see the full page load within 20 seconds. Otherwise, they close and leave. And just go on to the next page that responds fast.

Some elements of usability are:

- Pages load in the browser regardless of whether I use FF, Chrome or IE.
- Pageload time is within the 10 and 20 second thresholds
- The color scheme lets me easily read the content
- There is a site search so I can search the website for what I want, if I cannot locate it by
using the navigation
- I am able to view the website on different devices using the same URL I learned.
- If I am visually impaired, I am still able to use my special website reader to navigate the website.
- there is simple and logical navigation for me to quickly locate the content I am looking for,
- there is a sitemap that gives a roadmap of the site, and its complete organization.
- If I turn off the images in my browser, to conserve bandwidth, the webpage still makes sense and I can use the navigation
- If I turn off javascript, the site still functions - lets me perform navigation, search and clicks the way I need to.


Accessibility:

The W3C has laid down guidelines for creating pages in a way that will facilitate web browsing for people with special needs. Visually impaired users don’t use normal web browsers - instead, they use readers that parse the markup and announce the titles and navigation options out loud. The user then selects a number, and this results in that link getting clicked.
If accessibility norms are not followed, your website will be a meaningless jumble to a visually impaired reader. Happily though, a website made accessible also becomes a logically structured, easily navigable website - and this makes the search engine robots love it.

Metrics:

Business decisions need data drivers. For pages to provide data, they get special javascript code added to them, that registers a hit in a database every time someone views a page. These is called metrics, and includes terms like -

pageviews, channel views, uniques, engagement, pathing, exits, bounces, referrals, campaigns, downloads.

Metrics are made possible by adding special code to the source of every page. Google Analytics, Omniture and HitBox are the leading providers of analytics, and such code is clearly visible in the page-source.


Live examples:

Some live websites with most of the best practices applied:

www.people.com
www.cnn.com
www.time.com

People.com has over 1 billion page views per month. As you have seen, each page view in turn makes 100s of HTTP calls and if the quality control is not tight, it can prove very expensive. Unncecesary or incorrect calls all add to the load on the server, and will reduce the performance for all users at once.

For such high availability and high volume websites some special techniques are used.
1) Load balancing
2) Special heuristic pre-cache layer. Statistically high files and pages are cached in a special RAM layer for faster serving that normal webservers.
3) Pre-caching of content. New articles and updates to articles would normally be generated only when a user comes in a requests it. But, if a server attempts that in real time while also serving thousands of other pages, it will slow down and eventually fail.

A separate pre-cache farm is used. Servers artificially request and generate all new and updated pages, then in a special process ftps them to all the servers behind the load balancer. At time.com there were 16 servers behind the load balanced domain www.time.com

Test Plan

A good test plan ensures that the important aspects of the webpage have been tested before it goes live.

SEO, Usability, Accessibility, Metrics.
If a design template is made available, basic adherence can be tested.

The Big Four of Testing:

1) Browser Compatibility (Usability clause)
2) Confirm title tag and meta tags are unique and relevant. (SEO clause)
3) Google analytic code present ? (Metrics clause)
4) Logical use of H tags (Basic Accessibility and SEO clause)


Lower priority testcases:

1) Externalized CSS, JS
2) missing alt tags
3) missing img src = “”
4) quality of meta keywords  and meta description
5) Site navigation is not broken if images are turned off.
6) Site functioning is not broken if javascript is turned off
7) The same URL functions on different devices - use tablet, mobile and desktop to compare.