Pragmatic Prose
Sunday, July 13, 2008
Web Scraping is Hard Because Sites Are Not Valid
Scraping the web is hard.
Matt Cutts says so:
http://www.mattcutts.com/blog/the-web-is-a-fuzz-test-patch-your-browser-and-your-web-server/
I've found this to be true.
A couple of implications.
It's hard to build a web crawler that can suck information out of pages reliably.
Validation doesn't matter b/c google doesn't penalize for it. And if Google doesn't care, you shouldn't either.
Web
Sunday, July 13, 2008 7:54:28 PM (Central Standard Time, UTC-06:00)
Comments [0]
Name
E-mail
Home page
Remember Me
Comment (HTML not allowed)
Enter the code shown (prevents robots):
© Copyright 2008 Chris Weber
Theme design by
Bryan Bell
newtelligence dasBlog 1.9.6264.0
| Page rendered at Saturday, November 22, 2008 4:50:10 PM (Central Standard Time, UTC-06:00)
On this page....
<
November 2008
>
Sun
Mon
Tue
Wed
Thu
Fri
Sat
26
27
28
29
30
31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
4
5
6
Search
Navigation
Home
newtelligence AG
dasBlog
SourceForge
Scott Hanselman
Omar Shahine
Categories
ASP.NET
biz
Books
C#
Design
firefox
General
Greasemonkey
hacks
Hardware
Implementation
Java
javascript
Languages
Laszlo
Money
organization
personal
PowerShell
Productivity
Prometheus
Python
Ruby
Ruby on Rails
SOA
SocioEcon
SQL
Tools
Web
Windows
Blogroll
Healthy Reader
Pop Therapy
Slippery Brick
The Art of Ware
Wealthy Reader
What's your 20
Sign In