29.07.2012 - 

Scraping using XPath or Regular Expressions

There is an either/or debate or some might even refer to it as a war between how to get exact values out of specific web pages. This is a vital part of the field of data retrieval from the Internet which still has many names. Some of those names are spidering, site scraping, Internet harvesting, data mining, and web extraction. Many in the regular expressions camp have been harvesting data for many years and know their tools well and are competent in scraping complex sites. While the XPath troop claims to have the modern, more flexible, and more powerful tool. There is some truth in both claims.

In the beginning, there was only pattern matching to gather specific data from web pages. Once a website was spidered and the HTML page retrieved, the various parsing methods would begin. Some would use Perl. Python became a loved language for this use later, and many others were also used. Regular Expressions (often simply called RegEx) is a syntax that was developed to notice the various patterns and could be used in many programming languages.

As XML became more prominent, tools began to be developed for analyzing and processing it. XPath is one of those. XPath has the major advantage that it sees not only the text of a web page as one long string of data but rather as a structured document. It can use this structure to not only analyze strings for patters but also take advantage of the order and hierarchy of a document. Because of this, nearly all of the web scraping tools or Internet data mining software uses this method. It not only allows this multilayered approach, but it also lends itself better to simple user interfaces for configuring the jobs to spider sites and scrape data from pages.

There is one significant down side to XPath though, speed. It requires each individual page to be parsed before XPath can be applied. Now this may only take a second or less, but this adds up quickly with large sites. On the other side, parsing a HTML page with a Regular Expression takes a small fraction of the time and memory particularly if the other engine also renders the pages. Hence industrial web scraping or enterprise web extraction functions much better with the Regular Expression method. The fact is though, sometimes RegEx simply hits its limit on what it can parse because of the lack of taking the structure inherent in HTML into account.

For a long time, 30 Digits simply stuck to the RegEx method in its Web Extractor solution because the majority of its data harvesting jobs dealt with millions of page requests a day. With top engineers and many tools at its disposal, this handled nearly any website. Now the Web Extractor has gone to the next level. It has incorporated XPath as an option at any level. This means the war is over. It is no longer an either/or questions but a both/and. The benefits and advantages of regular expressions and XPath are now combined in one Web Extractor solution.

Connect with us
First Name:*
Last Name:*