We all know Google’s early success came from it’s search algorithm, and that’s what we still talk about most today. Behind the scenes though, is GoogleBot scanning the internet indexing URL’s at a furious pace. For search engines the ability to crawl the web, following links from one site to the next is at the core of what they do. Why would the rest of us need a crawler? Well let’s look at BuiltWith.com as an example of using a web crawler to extract valuable research data. With fees ranging from $500 to several thousand, BuiltWith.com has made a lucrative business out of selling access to data that it acquires through web crawling. At least we can presume they use web crawling, doubt they are running a telemarketing boiler room and calling website owners all day asking about what CMS and plugins they are using. And that’s the point, is without a web crawling getting largescale data from thousands or millions of websites is nearly impossible.
It Web Crawling Ethical?
There seems to be some debate about this. Generally speaking the ethics of web crawling is entirely dependent on how it is done, it depends on the context. Are you crawling public sites, are you abiding by robots.txt directives? Robots files are designed specifically to provide a mechanism for developers and site owners to control how robots behave. GoogleBot and other major crawlers will usually follow whatever the standards are, given the directions you setup in your robots files. You can easily block GoogleBot from indexing your site for instance.
More Importantly to Some, is Web Crawling Powerful?
Web crawling robots search/find/compile data, in a world where data can, and does translate into dollars. So yes, web crawling is a way on some level to create money automatically. Obviously there is an investment, significant perhaps into both building and operating a web crawler.ork 24/7/365 with as much accuracy and speed as you can build into it. Which leads to the next question…
How on Earth do you Build a Web Crawler Robot?
The simple answer is you don’t because there are so many already existing that there is actually a top 50 web crawlers page over at Big Data Made Simple. First thing I noticed about the list was holy Java beans, that’s a lot of Java apps and a healthy portion of C/C++ to go with it. Useless to me personally as a PHP dev presuming I want to be able to hack and customize the deployment which probably I do. That makes PHPCrawler the obvious choice. Another PHP option is OpenWebSpider. With OpenWebSpider I found the documentation looked a bit sparse, it seems cool for indexing pages like a search engine but no sign of how we might customize the indexing to store custom data.
And then Along Comes a Sphider
I felt Sphider is deserving of it’s own header because let’s face it when you think of crawling the web, secretly stashing away data for your own purposes there is a certain machiavellian air to it. And none of the other crawlers captures that better than Sphider. Designed primarily for search indexing, by default it seems Sphider won’t jump from one domain to another. That can changed in it’s options. I like that it can be run from command line as well.
Web Crawling so Easy a 16-Year Old Can Do It?
One of the few code and PHP only examples of web crawling I could find was How to Create a Simple Web Crawler on Subin’s Blog. Congrats to him for that and Simple HTML Dom library he’s using there is actually really a good choice for scraping. Used it before myself for both scraping and data importing/manipulation. The novel idea is scraping out the URL’s, getting them into the format you need and then following those URL’s. As suggested in the article this can be very resource intensive so you’ll need to put some limits on depth or how many URL’s you cover in a given crawl.
Designing the Data Scrape
There wouldn’t be much point crawling unless we scraped data and stored it away. The question is how and what? Well both PHPCrawl and Simple HTML Dom make the job fairly easy. The latter uses a parsing approach similar to jQuery where you can transverse the dom. I’m not to familiar with PHPCrawl but reading through the docs it seems to mainly just focus on getting the URL data. Perhaps using them together would work best. If you know what information you want, for instance we know we want to test what framework or CMS the site uses. One approach is we find try to load /wp-admin, if it’s there safe to say it’s a WP site, if it’s not that usually means it isn’t. Some exceptions apply. For Drupal there are a few other ways to test, version text file, admin etc. Other systems leave other kinds of traces. What we really want to know is if the site is WP, and if so what plugins and theme does it have installed.
Crawling Right Along Then
More to come later on how this project unfolds. What this initial research session showed is that it is possible, and even fairly well supported (especially for Java/C programmers) do build a web crawler robot using existing libraries. Not as much information about the design or approach as I would have liked out there right now.