|
by Vested Id 04/25/2012, 10:57pm PDT |
|
 |
|
 |
|
I'm a big fan of the Chicago Reader capsule reviews, and I'd like to download all of them and have them on my hard drive. I know a little Python and it seems like it would be easy to do using Beautiful Soup, but in the process of investigating how to go about this I have questions.
First, the obvious one about what sort of countermeasures servers user to prevent people from stealing their content and what I should do to circumvent that.
Two, each capsule has an "OID", so my original plan was to start at 1 and iterate; but they don't seem to be numbered consistently. For example, there's a 3677240 and a 3677244, but nothing inbetween. I know I can go through each page of capsules and store the urls, I was just wondering if there is any kind of convention behind how they're numbering the capsules. I realize it's probably stupid to think there might be.
Three, they use a CMS called Gyrobase, and I wondered if there was a way to manipulate that directly.
Thanks for your help. |
|
 |
|
 |
|
|
|