Web Scraping Using Beautiful Soup - How Can I Get All Categories
Solution 1:
You need to get the permalink values from the script using regex and join with the base url. Here is the sample
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = 'https://www.sfma.org.sg/member/category/manufacturer'
script_txt = """<script>
var tmObject = {'tmember':[{id:'1',begin_with:'0-9',name:'1A Catering Pte Ltd',category:'22,99',mem_type:'1',permalink:'1a-catering-pte-ltd'},{id:'330',begin_with:'A',name:'A-Linkz Marketing Pte Ltd',category:'3,4,10,14,104,28,40,43,45,49,51,52,63,66,73,83,95,96',mem_type:'1',permalink:'a-linkz-marketing-pte-ltd'},{id:'318',begin_with:'A',name:'Aalst Chocolate Pte Ltd',category:'30,82,83,84,95,97',mem_type:'1',permalink:'aalst-chocolate-pte-ltd'},{id:'421',begin_with:'A',name:'ABB Pte Ltd',category:'86,127,90,92,97,100',mem_type:'3',permalink:'abb-pte-ltd'},{id:'2',begin_with:'A',name:'Ace Synergy International Pte Ltd',category:'104,27,31,59,83,86,95',mem_type:'1',permalink:'ace-synergy-international-pte-ltd'}
</script>"""
soup = BeautifulSoup(script_txt)
txt = soup.script.get_text()
pattern = re.compile(r'permalink:\'(.*?)\'}')
permlinks = re.findall(pattern, txt)
for i in permlinks:
href = "../info/{{permalink}}"
href = href.split('{')[0]+i
print(urljoin(base, href))
https://www.sfma.org.sg/member/info/1a-catering-pte-ltd
https://www.sfma.org.sg/member/info/a-linkz-marketing-pte-ltd
https://www.sfma.org.sg/member/info/aalst-chocolate-pte-ltd
https://www.sfma.org.sg/member/info/abb-pte-ltd
https://www.sfma.org.sg/member/info/ace-synergy-international-pte-ltd
Solution 2:
To get the correct total number of 240 for manufacturer (and get total all categories or any given category count):
If you want just the manufacturer listings first look at the page and check how many links there should be:
By ensuring the css selector has the class of the parent ul
i.e. .w3-ul
we are limiting to just the appropriate links when we add in the child class selector of .plink. So, we have 240
links on the page.
If we simply used that on the returned html from requests
we would find we are far short of this, as many links are dynamically added and thus not present with requests
where javascript doesn't run.
However, all links (for all dropdown selections - not just manufacturing) are present in a javascript dictionary, within a script
tag, which we can see the start of below:
We can regex out this object using the following expression:
var tmObject = (.*?);
Now, when we inspect the returned string, we can see that we have unquoted keys which may pose problems if we wish to read this dictionary in with a json library:
We can use the hjson
library for parsing as this will allow the unquoted keys. * pip install hjson
Finally, we know we have all listings and not just manufacturers; inspecting the tags in the original html we can determine that the manufacturers
tag is associated with group code 97
.
So, I extract both the links and the groups from the json object as a list of tuples. I split the groups on the "," so I can use in to filter for the appropriate manufacturing code:
all_results = [(base + item['permalink'], item['category'].split(',')) for item in data['tmember']]
manufacturers = [item[0] for item in all_results if '97' in item[1]]
Checking the final len of the list we can get our target 240
.
So, we have all_results
(all categories), a way to split by category, as well as a worked example for manufacturer
.
import requests
from bs4 import BeautifulSoup as bs
import hjson
base = 'https://www.sfma.org.sg/member/info/'
p = re.compile(r'var tmObject = (.*?);')
r = requests.get('https://www.sfma.org.sg/member/category/manufacturer')
data = hjson.loads(p.findall(r.text)[0])
all_results = [(base + item['permalink'], item['category'].split(',')) for item in data['tmember']] #manufacturer is category 97
manufacturers = [item[0] for item in all_results if '97' in item[1]]
print(manufacturers)
Solution 3:
The links you are looking for are obviously populated by a script (Look for response of https://www.sfma.org.sg/member/category/manufacturer in Chrome->Inspect->Network). If you look at the page, you'll see the script loading it. Instead of scraping links, scrape scripts, you'll have the list. Then since link format is known, plugin the values from json. Et voila! Here is the starter code to work with. You may infer the rest.
import requests
from bs4 import BeautifulSoup
page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')
links = [soup.find_all('script')]
Post a Comment for "Web Scraping Using Beautiful Soup - How Can I Get All Categories"