Skip to content Skip to sidebar Skip to footer

Using Beautifulsoup To Extract A Table In Python 3

I would like to use BeautifulSoup to extract a table from a website and store it as structured data. The final output I require is something that can be exported to a .csv with a h

Solution 1:

So you already have this:

datasets = [
  (('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')), 
  (('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))

Here's how you can transform it. Assuming all rows are the same, you can extract headers from the first row:

headers_row = [hdr for hdr, data in datasets[0]]

Now, extract the second field of each tuple like ('Tests', '103') in each row:

processed_rows = [
  [datafor hdr, datain row]
  for row in datasets
# [['103', '24', '76.70%', '71 ms', '0 ms', '829 ms'], ['109', '35', '82.01%', '12 ms', '2 ms', '923 ms']]

Now you have the header row and a list of processed_rows. You can write them to a CSV file with the standard csv module.

A better solution may be to keep your original format and use csv.DictWriter.

  1. Extract the headers into headers_row, as shown above.

  2. Write the data:

    import csv
    withopen('data.csv', 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames= headers_row)
        for row in datasets: # your original data

Here dict(datasets[0]), for example, is:

{'Tests': '103', 'Failures': '24', 'Success Rate': '76.70%', 'Average Time': '71 ms', 'Min Time': '0 ms', 'Max Time': '829 ms'}

Solution 2:

At the end, just convert your zip iterator to a list:

for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(list(dataset))  # process iterator to list


Final Output:

[[('Tests', '103'), 
('Failures', '24'), 
('Success Rate', '76.70%'), 
('Average Time', '71 ms'), 
('Min Time', '0 ms'), 
('Max Time', '829 ms')], 

[('Tests', '109'), 
('Failures', '35'), 
('Success Rate', '82.01%'), 
('Average Time', '12 ms'), 
('Min Time', '2 ms'), 
('Max Time', '923 ms')]]

If you want to convert the dataset to a csv string, use this code:

# convert to csv string

hdrline = ','.join(e[0] for e in datasets[0]) + "\n"
data = ""for rw in datasets:
    data += ','.join([e[1] for e in rw]) + "\n"
csvstr = hdrline + data



Tests,Failures,Success Rate,Average Time,Min Time,Max Time103,24,76.70%,71 ms,0 ms,829 ms
109,35,82.01%,12 ms,2 ms,923 ms

Solution 3:

If you are using the standard csv module, then you don't need to associate values with their labels

You can do the following, assuming you have a csvwriter which can be obtained via

import csv

withopen('file.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile) # You may add options here to format your csv file as needed

    headings = [th.get_text() for th in table.find("tr").find_all("th")]


    for row in table.find_all("tr")[1:]:
        data = (td.get_text() for td in row.find_all("td"))

Post a Comment for "Using Beautifulsoup To Extract A Table In Python 3"