Skip to content Skip to sidebar Skip to footer

Web Scraping Using Beautiful Soup - How Can I Get All Categories

How could i get all the categories mentioned on each listing page of the same website 'https://www.sfma.org.sg/member/category'. for example, when i choose alcoholic beverage categ

Solution 1:

You need to get the permalink values from the script using regex and join with the base url. Here is the sample

import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

base = 'https://www.sfma.org.sg/member/category/manufacturer'

script_txt = """<script>
        var tmObject = {'tmember':[{id:'1',begin_with:'0-9',name:'1A Catering Pte Ltd',category:'22,99',mem_type:'1',permalink:'1a-catering-pte-ltd'},{id:'330',begin_with:'A',name:'A-Linkz Marketing Pte Ltd',category:'3,4,10,14,104,28,40,43,45,49,51,52,63,66,73,83,95,96',mem_type:'1',permalink:'a-linkz-marketing-pte-ltd'},{id:'318',begin_with:'A',name:'Aalst Chocolate Pte Ltd',category:'30,82,83,84,95,97',mem_type:'1',permalink:'aalst-chocolate-pte-ltd'},{id:'421',begin_with:'A',name:'ABB Pte Ltd',category:'86,127,90,92,97,100',mem_type:'3',permalink:'abb-pte-ltd'},{id:'2',begin_with:'A',name:'Ace Synergy International Pte Ltd',category:'104,27,31,59,83,86,95',mem_type:'1',permalink:'ace-synergy-international-pte-ltd'}
        </script>"""

soup = BeautifulSoup(script_txt)

txt = soup.script.get_text()
pattern = re.compile(r'permalink:\'(.*?)\'}')

permlinks = re.findall(pattern, txt)
for i in permlinks:
    href = "../info/{{permalink}}"
    href = href.split('{')[0]+i
    print(urljoin(base, href))  

https://www.sfma.org.sg/member/info/1a-catering-pte-ltd
https://www.sfma.org.sg/member/info/a-linkz-marketing-pte-ltd
https://www.sfma.org.sg/member/info/aalst-chocolate-pte-ltd
https://www.sfma.org.sg/member/info/abb-pte-ltd
https://www.sfma.org.sg/member/info/ace-synergy-international-pte-ltd

Solution 2:

To get the correct total number of 240 for manufacturer (and get total all categories or any given category count):

If you want just the manufacturer listings first look at the page and check how many links there should be:

enter image description here

By ensuring the css selector has the class of the parent ul i.e. .w3-ul we are limiting to just the appropriate links when we add in the child class selector of .plink. So, we have 240 links on the page.


If we simply used that on the returned html from requests we would find we are far short of this, as many links are dynamically added and thus not present with requests where javascript doesn't run.

However, all links (for all dropdown selections - not just manufacturing) are present in a javascript dictionary, within a script tag, which we can see the start of below:

enter image description here


We can regex out this object using the following expression:

var tmObject = (.*?);

enter image description here


Now, when we inspect the returned string, we can see that we have unquoted keys which may pose problems if we wish to read this dictionary in with a json library:

enter image description here

We can use the hjson library for parsing as this will allow the unquoted keys. * pip install hjson


Finally, we know we have all listings and not just manufacturers; inspecting the tags in the original html we can determine that the manufacturers tag is associated with group code 97.

enter image description here


So, I extract both the links and the groups from the json object as a list of tuples. I split the groups on the "," so I can use in to filter for the appropriate manufacturing code:

all_results = [(base + item['permalink'], item['category'].split(',')) for item in data['tmember']]
manufacturers = [item[0] for item in all_results if '97' in item[1]]

Checking the final len of the list we can get our target 240.

So, we have all_results (all categories), a way to split by category, as well as a worked example for manufacturer.


import requests
from bs4 import BeautifulSoup as bs
import hjson

base = 'https://www.sfma.org.sg/member/info/'
p = re.compile(r'var tmObject = (.*?);')
r = requests.get('https://www.sfma.org.sg/member/category/manufacturer')
data = hjson.loads(p.findall(r.text)[0])
all_results = [(base + item['permalink'], item['category'].split(',')) for item in data['tmember']]  #manufacturer is category 97
manufacturers = [item[0] for item in all_results if '97' in item[1]]
print(manufacturers)

Solution 3:

The links you are looking for are obviously populated by a script (Look for response of https://www.sfma.org.sg/member/category/manufacturer in Chrome->Inspect->Network). If you look at the page, you'll see the script loading it. Instead of scraping links, scrape scripts, you'll have the list. Then since link format is known, plugin the values from json. Et voila! Here is the starter code to work with. You may infer the rest.

import requests 
from bs4 import BeautifulSoup

page = "https://www.sfma.org.sg/member/category/manufacturer"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')
links = [soup.find_all('script')]

enter image description here


Post a Comment for "Web Scraping Using Beautiful Soup - How Can I Get All Categories"