最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Problem scraping table row data into an array - Stack Overflow

programmeradmin2浏览0评论

Background

I am looking to measure statistics from a website wotstars for a XBOX and Playstation game, World of Tanks Console.

Initially I tried just using Excel to scrape the site directly for me into Power Query, the immediate issue was that only 5 rows of recent match data (5 games) is available from the website as loaded, a button needs to be click to view the last 100 matches.

I thought this was a good idea to learn more about web-scraping with python.

I successfully have managed to use Selenium to activate the "View More" button to shows all 100 rows and I can return the table of interest.

Problem

I am currently stuck through writing the td data to a 2D array that I can then dump to a csv file, I seem to be writing 100 columns rather than 100 rows?

I have looked at sites such as A Guide to Scraping HTML Tables with Pandas and BeautifulSoup for approaches (but this particular example doesn't show the source table).

Other

If I look at Excel the query connection that attaches to this table is shown below

Code

from selenium import webdriver
from selenium.webdrivermon.by import By
import time
from bs4 import BeautifulSoup
import pandas as pd

# Set up the WebDriver
driver = webdriver.Chrome()
# Famous player
url = ''

# Open the login page
driver.get(url)
time.sleep(5)

login_button = driver.find_elements(By.CLASS_NAME, "_button_1gcqp_2")
for login in login_button:
    print (login.text)

if login_button[3].text == 'Start tracking':
    login_button[4].click()

print("button activated")

time.sleep(2)
#login_button[3].click()

soup = BeautifulSoup(driver.page_source, 'html.parser')

rows = []
for child in soup.find_all('table')[3].children:
    row = []
    for td in child:
        #print (td.text)
        try:
            row.append(td.text.replace('\n\r', ''))
        except:
            continue
        if len(row) > 0:
            #print (row)
            rows.append(row)
print (len(rows[0]))

#not working
#df = pd.DataFrame(rows[1:], columns=rows[0])
#display(HTML(df.to_html()))
        
print("done")

# Close the browser
driver.quit

Object Inspector

Background

I am looking to measure statistics from a website wotstars for a XBOX and Playstation game, World of Tanks Console.

Initially I tried just using Excel to scrape the site directly for me into Power Query, the immediate issue was that only 5 rows of recent match data (5 games) is available from the website as loaded, a button needs to be click to view the last 100 matches.

I thought this was a good idea to learn more about web-scraping with python.

I successfully have managed to use Selenium to activate the "View More" button to shows all 100 rows and I can return the table of interest.

Problem

I am currently stuck through writing the td data to a 2D array that I can then dump to a csv file, I seem to be writing 100 columns rather than 100 rows?

I have looked at sites such as A Guide to Scraping HTML Tables with Pandas and BeautifulSoup for approaches (but this particular example doesn't show the source table).

Other

If I look at Excel the query connection that attaches to this table is shown below

Code

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
import pandas as pd

# Set up the WebDriver
driver = webdriver.Chrome()
# Famous player
url = 'https://www.wotstars.com/xbox/6757320'

# Open the login page
driver.get(url)
time.sleep(5)

login_button = driver.find_elements(By.CLASS_NAME, "_button_1gcqp_2")
for login in login_button:
    print (login.text)

if login_button[3].text == 'Start tracking':
    login_button[4].click()

print("button activated")

time.sleep(2)
#login_button[3].click()

soup = BeautifulSoup(driver.page_source, 'html.parser')

rows = []
for child in soup.find_all('table')[3].children:
    row = []
    for td in child:
        #print (td.text)
        try:
            row.append(td.text.replace('\n\r', ''))
        except:
            continue
        if len(row) > 0:
            #print (row)
            rows.append(row)
print (len(rows[0]))

#not working
#df = pd.DataFrame(rows[1:], columns=rows[0])
#display(HTML(df.to_html()))
        
print("done")

# Close the browser
driver.quit

Object Inspector

Share Improve this question edited 5 hours ago Bob asked 7 hours ago BobBob 354 bronze badges 2
  • Please add the HTML code where the table is created. It will help. – Lewis Commented 6 hours ago
  • @Lewis Objector Inspector snapshot of the code of the table at wotstars.com/xbox/6757320 now shown at bottom – Bob Commented 5 hours ago
Add a comment  | 

1 Answer 1

Reset to default 2

Since you have to use selenium for the view more button click anyway, you don't need to use beautifulsoup.

Using only class names for selecting elements is difficult for the page as it has multiple tables and buttons with the same class.

Instead, use XPath to find the section BattleTracker by text. that way you can access the right table every time.

To scrape the table using pandas, you have to provide the HTML. You can get that by selecting the table you want and then pass it to pandas.

Check the following code:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import pandas as pd

# Set up the WebDriver
chrome_options = Options()
chrome_options.add_argument("--headless")  # headless mode
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 30)

# Famous player
url = 'https://www.wotstars.com/xbox/6757320'

# Open the login page
driver.get(url)

# wait for the view more button to be available and click
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[contains(@class,'sessionWrap') and contains(.,'BattleTracker')]//button[text()='View more']"))).click()

# wait for the table to be expanded
wait.until(EC.visibility_of_element_located((By.XPATH,"//div[contains(@class,'sessionWrap') and contains(.,'BattleTracker')]//table/tbody/tr[6]")))

# getting the table which will be passed to pandas
table = driver.find_element(By.XPATH,"//div[contains(@class,'sessionWrap') and contains(.,'BattleTracker')]//table")

# parse the table using pandas
df = pd.read_html(table.get_attribute('outerHTML'))

print(df)

# Close the browser
driver.quit()

OUTPUT:

[                         Tank               Result  Kills  Dmg+A   Dmg  Received  ...  Hits     Pens  Spots    WN8  Duration                  When  
0   IXFraternité Char Mle. 75   Victory (survived)      2   3089  2469         0  ...    15   86.67%      2   3768     5m 1s             2h 4m ago   
1   IXFraternité Char Mle. 75  Victory (destroyed)      2   3217  2649      1200  ...    16   87.50%      2   4144     5m 5s            2h 12m ago   
2   IXFraternité Char Mle. 75   Victory (survived)      1   3556  3123       176  ...    17   94.12%      1   3921    3m 58s            2h 19m ago   
3   IXFraternité Char Mle. 75   Victory (survived)      0   3055  2803       691  ...    16   87.50%      2   2731    4m 31s            2h 26m ago   
4   IXFraternité Char Mle. 75   Victory (survived)      2   4434  4157         0  ...    22  100.00%      0   6526    5m 41s            2h 36m ago   
..                        ...                  ...    ...    ...   ...       ...  ...   ...      ...    ...    ...       ...                   ...   
95    IXObject 777 Version II   Defeat (destroyed)      0   2948  1357      1850  ...     5   60.00%      3    952    4m 13s  2/3/2025, 2:30:38 AM   
96    IXVK 45.02 (P) Ausf. B7   Defeat (destroyed)      1   1417  1417      1950  ...     3  100.00%      0   1344        3m  2/3/2025, 2:28:04 AM   
97                 IXMäuschen   Victory (survived)      3   8778  6818       453  ...    31   63.33%      1  14097    5m 52s  2/3/2025, 2:17:47 AM   
98                 IXMäuschen   Defeat (destroyed)      0   5529  3846      2150  ...     8   87.50%      1   3256     6m 5s  2/3/2025, 2:09:19 AM   
99                 VIIType 62   Defeat (destroyed)      0    542   542       880  ...     3  100.00%      3    691    1m 33s  2/3/2025, 2:04:01 AM   

[100 rows x 16 columns]]
发布评论

评论列表(0)

  1. 暂无评论