Monday, April 28, 2014

ShopperTom.com - Architectural Overview

Shoppertom is an e-commerce comparison site. In order to understand it's architecture we should first understand its business requirements:

1. Collect data from multiple websites daily (using crawling and API methods)
2. Store data in highly efficient queryable format to be used for future data display
3. Be highly self maintained (The team maintaining shoppertom is small and cannot spend a lot of time on maintenance)
4. Display web pages to users
5. Collect images from web sites, create thumbnails from them and share them to users


To answer the above business requirements I am using the following tools and technologies.


  1. Heavily modified Nutch 1.6 version is being used to collect the data from different stores.
    The main reasons for using Nutch are:
    1. Nutch is a mature environment
    2. Nutch has built in support for plugins
    3. Nutch has built in support for robots.txt - which is critical to respect the law/common practice in regards to crawling
  2. MySql Database is used to host all the website crawling metadata and some crawling collected data
  3. Solr is used to host the main crawling data 
    1. Facetting is used for supporting the web search categories
    2. Extended Dis Max is used to support complicated query syntax and returning similar results from solr
  4. Combination of Python scripts and Jenkins are used to run the Nutch jobs and track their progress.
  5. ASP .Net MVC 4 is used to return the web pages displayed to the users
    1. I have choosen ASP .NET MVC as simple implementation is easily built and deployed - though TCO is higher.
  6. Entity Framework and Hibernate are used to query data from the MySQL database (from c# and java respectively)
  7. Image collection is done using:
    1. Some images are collected automatically using multiple Nutch plugins
    2. Some images are collected by a CGI script

In future posts I will elaborate more on some of these technologies (while still trying to hide some of the IP).



No comments:

Post a Comment