How Can A Scraped Html Be Different From The Source Code?
Solution 1:
In a word, javascript. You're downloading the basic HTML page, but you're not a browser, and you're not downloading and running any of the javascript code that the browser would run. And many sites these days start with a very small HTML page, and use scripting to dynamically load and display additional data from the server.
Solution 2:
You can use Selenium for this purpose. It will render your web page in run time just like your browser does. You can use Selenium with firefox, chrome or phantomjs.
Selenium
We use selenium basically to completely render our web page as most of the sites are made up of Modern JavaScript frameworks. Mostly it is used in developing Crawlers/Scrappers for gathering data from different pages of a website or Selenium is also used in web automation.
More on Selenium, read it here http://selenium-python.readthedocs.io/ Also I have blog post on Slenium for the beginners. Check this one too http://blog.hassanmehmood.com/creating-your-first-crawler-in-python/
Example
import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
profile_link = 'http://hassanmehmood.com'classTitleScrapper(object):
def__init__(self):
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.startup.homepage_override.mstone", "ignore") #Avoid startup screen
fp.set_preference("startup.homepage_welcome_url.additional", "about:blank")
self.driver = webdriver.Firefox(firefox_profile=fp)
self.driver.set_window_size(1120, 550)
defscrape_profile(self):
self.driver.get(profile_link)
print self.driver.title
self.driver.close()
defscrape(self):
self.scrape_profile()
if __name__ == '__main__':
scraper = TitleScrapper()
scraper.scrape()
Post a Comment for "How Can A Scraped Html Be Different From The Source Code?"