Scraping the web is hard.
Matt Cutts says so:
http://www.mattcutts.com/blog/the-web-is-a-fuzz-test-patch-your-browser-and-your-web-server/
I've found this to be true.
A couple of implications.
It's hard to build a web crawler that can suck information out of pages reliably.
Validation doesn't matter b/c google doesn't penalize for it. And if Google doesn't care, you shouldn't either.