最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How to scrape website which has hidden data inside table? - Stack Overflow

programmeradmin3浏览0评论

I am trying to Scrape Screener.in website to extract some information related to stocks. However while trying to extract Quarterly Results section there are some field which is hidden and when click on + button it show additional information related to parent header. I need to have this information

I am using below python code which is giving me a dataframe but without additional information

url = f'/'
print(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req).read()
soup = BeautifulSoup(page, 'html.parser')
table = soup.find_all("table", {"class": "data-table responsive-text-nowrap"})[0]
df = pd.read_html(StringIO(str(table)))[0]
df

Above code is working fine however I am not able to pull additional information

Can somebody help me with this?

I am trying to Scrape Screener.in website to extract some information related to stocks. However while trying to extract Quarterly Results section there are some field which is hidden and when click on + button it show additional information related to parent header. I need to have this information

I am using below python code which is giving me a dataframe but without additional information

url = f'https://www.screener.in/company/TATAPOWER/consolidated/'
print(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req).read()
soup = BeautifulSoup(page, 'html.parser')
table = soup.find_all("table", {"class": "data-table responsive-text-nowrap"})[0]
df = pd.read_html(StringIO(str(table)))[0]
df

Above code is working fine however I am not able to pull additional information

Can somebody help me with this?

Share Improve this question edited Feb 15 at 19:52 HedgeHog 25.2k5 gold badges17 silver badges41 bronze badges asked Feb 15 at 18:01 Data-7scientistData-7scientist 14510 bronze badges 1
  • 1 Many sites are driven by JS and the html is not generated until the JS events are triggered to add elements to the DOM. You can't parse HTML that isn't there, so in such cases you may need a different library more like a web scraper or web automation library (there are many but as an example when I was taking a short course on web development we were introduced to Splinter, a python framework I believe based on Selenium). Although you may use your automation code to "click" the right button ... and still use BS to parse the new html. – topsail Commented Feb 15 at 18:07
Add a comment  | 

1 Answer 1

Reset to default 2

As already commented, the content is reloaded on demand, but it is precisely these requests that can be replicated in order to obtain the content as well.

To do this, you have to iterate over the rows of the table and make the request if necessary.

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = f'https://www.screener.in/company/TATAPOWER/consolidated/'
soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text)

keys = ['Item'] + list(soup.select_one('#quarters thead tr').stripped_strings)

data = []

for row in soup.select('#quarters tbody tr')[:-1]:
    if row.td.button:
        data.append(dict(zip(keys,[c.text for c in row.select('td')])))
        d = requests.get(f'https://www.screener.in/api/company/3371/schedules/?parent={row.td.button.text.strip(" +")}&section=quarters&consolidated=', headers={'User-Agent': 'Mozilla/5.0'}).json()
        first_key = next(iter(d))
        data.append({"Item": first_key, **d[first_key]})     
    else:
        data.append(dict(zip(keys,row.stripped_strings)))

pd.DataFrame(data)

Result:

Item Dec 2021 Mar 2022 Jun 2022 Sep 2022 Dec 2022 Mar 2023 Jun 2023 Sep 2023 Dec 2023 Mar 2024 Jun 2024 Sep 2024 Dec 2024
Sales + 10,913 11,960 14,495 14,031 14,129 12,454 15,213 15,738 14,651 15,847 17,294 15,698 15,391
YOY Sales Growth % 43.63% 15.41% 43.06% 43.02% 29.47% 4.13% 4.95% 12.17% 3.69% 27.24% 13.67% -0.26% 5.05%
Expenses + 9,279 10,091 12,812 12,270 11,810 10,526 12,500 12,967 12,234 13,540 14,232 12,427 12,312
Material Cost % 8.67% 13.38% 6.74% 4.04% 6.55% 12.13% 6.00% 6.09% 9.29% 13.86% 5.50% 3.59% 6.75%
Operating Profit 1,634 1,869 1,683 1,760 2,319 1,928 2,713 2,771 2,417 2,307 3,062 3,271 3,079
OPM % 15% 16% 12% 13% 16% 15% 18% 18% 16% 15% 18% 21% 20%
Other Income + 865 62 1,227 1,502 1,497 1,352 877 567 1,092 1,407 578 632 589
Exceptional items 0 -618 0 0 0 0 235 0 0 39 0 -140 0
Interest 953 1,015 1,026 1,052 1,098 1,196 1,221 1,182 1,094 1,136 1,176 1,143 1,170
Depreciation 758 846 822 838 853 926 893 926 926 1,041 973 987 1,041
Profit before tax 788 71 1,062 1,373 1,864 1,158 1,476 1,231 1,489 1,537 1,490 1,773 1,457
Tax % 30% -794% 17% 32% 44% 19% 23% 17% 28% 32% 20% 38% 18%
Net Profit + 552 632 884 935 1,052 939 1,141 1,017 1,076 1,046 1,189 1,093 1,188
Profit after tax 552 632 884 935 1,052 939 1,141 1,017 1,076 1,046 1,189 1,093 1,188
EPS in Rs 1.33 1.57 2.49 2.56 2.96 2.43 3.04 2.74 2.98 2.80 3.04 2.90 3.23
发布评论

评论列表(0)

  1. 暂无评论