Monday, June 2, 2014

Shoppertom.com - Python Issue

Some of the Shoppertom website is implemented in python. Python is used for some CGI bin and some build process automation.

During the development I have encountered a problem with urllib.urlretrieve call which causes a delay of several seconds when retrieving data from remote urls. This problem was easily fixed by switching the pycurl and following the instructions at http://www.angryobjects.com/2011/10/15/http-with-python-pycurl-by-example/




Wednesday, April 30, 2014

Shoppertom.com - Understanding the Growth and Basic SEO 


In this post I will review the different periods over the last 4 months since the rollout of shoppertom (A new E-Commerce comparison engine) Shoppertom is still small and growing but we can still review and see different behaviors in a website launch activity.

Below you can see a screenshot from google analytics (displaying the number of hits to the site per day)



Lets review and explain the different periods of the website activity.

  1. Shoppertom was launched on 01 Jan 2014. The Honeymoon period (google gives extremely high page rank to your pags) lasted 1-2 weeks. see http://www.ecreativeim.com/blog/2011/04/how-long-does-google-punish-a-new-site-google-sandbox/
  2. The Sandbox period followed immediately
    1. During the sandbox period I have tried multiple changes to the website and its sitemap but all pages remained extremely low on the google page rank.
  3. The Standard operation period started roughly on 04 March 2014 (roughly two months after the website was launched).
  4. The website stopped getting hits due to a DNS attack which prevented people reaching it.
  5. Trying to fix the DNS issue we have moved the website to a different DNS with a CDN - the CDN has delayed our crawling rate. See my previous blog
    1. later on the website was moved to a standard DNS solution returning back to standard operation.

Tuesday, April 29, 2014

Backlog - Shoppertom.com 

This post will contain our backlog

Pending

  1. Finish better categorization
  2. Add Retail RTI
  3. Add more stores - start with Golf
  4. add sitemap by factes
  5. write an article crawling using nutch

  6. //why does this product contain store name in its name http://shoppertom.com/products/Business_Source_Ring_Binder_Nordisco_com_/all/all/searchResults
  7. //dhgate and newagg contain numbers only



Done:

  1. Results are now sorted correctly
  2. Blog now appears in shoppertom sitemap.
  3. better looking pagination
  4. deal with exceptions better (logging)
  5. better sitemaps - include blog in the sitemap and adjust the time of the sitemap.


Monday, April 28, 2014

ShopperTom.com - Architectural Overview

Shoppertom is an e-commerce comparison site. In order to understand it's architecture we should first understand its business requirements:

1. Collect data from multiple websites daily (using crawling and API methods)
2. Store data in highly efficient queryable format to be used for future data display
3. Be highly self maintained (The team maintaining shoppertom is small and cannot spend a lot of time on maintenance)
4. Display web pages to users
5. Collect images from web sites, create thumbnails from them and share them to users


To answer the above business requirements I am using the following tools and technologies.


  1. Heavily modified Nutch 1.6 version is being used to collect the data from different stores.
    The main reasons for using Nutch are:
    1. Nutch is a mature environment
    2. Nutch has built in support for plugins
    3. Nutch has built in support for robots.txt - which is critical to respect the law/common practice in regards to crawling
  2. MySql Database is used to host all the website crawling metadata and some crawling collected data
  3. Solr is used to host the main crawling data 
    1. Facetting is used for supporting the web search categories
    2. Extended Dis Max is used to support complicated query syntax and returning similar results from solr
  4. Combination of Python scripts and Jenkins are used to run the Nutch jobs and track their progress.
  5. ASP .Net MVC 4 is used to return the web pages displayed to the users
    1. I have choosen ASP .NET MVC as simple implementation is easily built and deployed - though TCO is higher.
  6. Entity Framework and Hibernate are used to query data from the MySQL database (from c# and java respectively)
  7. Image collection is done using:
    1. Some images are collected automatically using multiple Nutch plugins
    2. Some images are collected by a CGI script

In future posts I will elaborate more on some of these technologies (while still trying to hide some of the IP).



DNS issues and SEO

Shoppertom.com is an E-Commerce comparison site. In this blog I review Technical challenges in setting it up.

Correct DNS setup is critical to your website. On March 2014 I encountered a downtime in the shoppertom availability.

Debugging this downtime was easily done by the following steps:

1. I couldn't browse to shoppertom.com from anywhere
2. Connection to my hosting service and using ip address to browse my website did work

My conclusion was that dns problems where happening. Contacting my registrar I have found out that their DNS server was under a DDOS attack and hence couldn't reply to DNS queries. The domain name registrar had recommended an alternative DNS hosting service.

I have switched Shoppertom.com to use this dns service and was back online within a few minutes (there are a lot of DNS online test tools which you can use in a similar issue).

After a few days I have noticed a big decline in the crawling rate in google web master tools. The importance of high crawling rate is explained in multiple SEO blogs and articles for example:

http://www.shoutmeloud.com/top-10-killer-tips-to-increase-google-crawl-rate.html
http://blissseo.com.au/reduced-crawl-rate-reduced-traffic/

Checking the crawl rate on google web master tools (site settings) showed me that I can no longer control my crawl rate and that its been stabilized on a fixed rate because I am on a CDN.

"Your site has been assigned special crawl rate settings. You will not be able to change the crawl rate"

Changing my DNS to a simple DNS solution has solved this and returned things back to where they were - google no longer limits my crawl rate and my daily visits are recovering as well.

Bottom line, I agree with http://ignitevisibility.com/cdn-can-crash-google-crawl-rate-sales/

Amit
Building an E-Commerce Comparison Site


A few months ago I have decided on building an E-Commerce Comparison Site. 
The main goal was to provide price comparison between multiple Retailers.
This web site will be free of charge for both retailers and users and therefore will be able to give more shopping choices to the web users. Being free will also allow my website to (in the long term) reduce costs for Retailers, allowing them potentially to reduce costs for end shoppers.

In this blog I will detail the technical challenges (and other challenges) I face while building Shoppertom.com and the way to resolve these challenges.

Future discussion pages:





....