Apache Nutch Online Training

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and array other document formats.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

Day 1

  • Installing and configuration of Nutch
  • Verify your Nutch installation
  • Crawl your first website
  • Crawling the web, the CrawlDb, and URL filters
  • Parsing and Parse filters
  • Nutch plugins and plugin architecture
  • Analysis, Link analysis, and scoring

Day 2

  • Indexing and custom fields
  • Deployment, shard architecture
  • Writing custom tools for Nutch
  • Setup Solr for search
  • Integrate Solr with Nutch
  • Hadoop architecture

© 2016 Laliwala IT. All rights reserved.