Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Filter by
Sorted by
Tagged with
0
votes
0answers
7 views

Rvest - Can't read content of website - Don't know which nodes to select

Description I am scraping a web-page to retrieve relevant data. As an example I'll use this URL: Example https://isbnsearch.org/search?s=THE+GODFATHER+%2C+Mario+Puzo My first iteration is ...
0
votes
0answers
6 views

Web scraping cheerio and node js

Is It possible indent cheerio package inside HTML tag "script"? Example: <script> "cheerios code" </script>
0
votes
1answer
12 views

MaxRetryError when scarping with selenium: happens when I quit and relaunch the browser

This is the snippet of the code. The error occurs in the last try-except block where I attempt to quit and relaunch the browser in case I encounter a TimeOut Exception. Here is the error I get: ...
0
votes
0answers
15 views

write-output returning nothing with a PSCustomObject on first launch

I have a automated IE script (powershell 2.0) to webscrape a serial number. The script bypasses a login page, and scrapes a serial number off a certain page. However i am having issues in write-...
0
votes
1answer
18 views

How to build a webscraping function for subreddit?

Summary: I want to web scrap subreddit and then turn data into data-frames. I know how to do them individually. But I am stuck with using a function. Here is how I do it one by one. url = 'https://...
0
votes
0answers
52 views

CRONTAB executing Python who executes Node with puppeteer for web scrapping not working

I made this web scraper to get tables in web pages. I used puppeteer (not knowing that has issues with crontab), Python for the cleaning and to handle the output to a DB But for my surprise when I ...
0
votes
0answers
8 views

Excel web query mixing up data between multiple sources

Excel Web-Query First time asking a question here so apologizes in advance. I am creating a spreadsheet that pulls data from multiple links although they are the same source technically. The sources ...
0
votes
0answers
19 views

Can not find nested ul/div using Beautifulsoup

I am trying to extract all the links to the store locations on a web site: https://www.ulta.com/stores/directory The structure of the web looks like this i want to extract all the links under the ul ...
-4
votes
0answers
12 views

Web scraping oddsportal [closed]

Kindly help out here, am in a fix. Am studying this complex QHarr's code under this link to scrape only Home, Draw & Away Average odds from Oddsportal.com: How to solve "error 70 permission ...
2
votes
2answers
32 views

How to target a specific Wikipedia table element for bs4 scrape?

Here is my code so far: from bs4 import BeautifulSoup soup = BeautifulSoup(website_url,'lxml') my_table = soup.find('table',{'class':'wikitable sortable'}) from urllib.request import urlopen as uReq ...
-2
votes
1answer
33 views

Webscraping in R From Dataframe

From the following data frame I am trying to use the package rvest to scrape each words Part of speech and synonyms from the website: https://www.thesaurus.com/browse/research?s=t into a csv. I am ...
0
votes
3answers
45 views

Why is selenium only picking up the first 12 items?

I'm trying to create a web scraper for a website (https://pokemondb.net/pokedex/national) that copies a list of images and saves them in a directory. Everything seems to work, except that instead of ...
-1
votes
0answers
39 views

Python doesn't parse full txt file

I'm new in python and I'm trying to parse website with MBA data. I download the html into txt file(1,1 MB) containing 3 pages (actually, there're much more pages I want to parse out). So, when I parse ...
0
votes
0answers
19 views

How to return array from same level of elements in Cheerio?

How would you return an array from elements on the same level? A point to the right direction would be great. TIA Here's HTML the code that I'm working on ============================= &...
0
votes
1answer
17 views

Looping through all the pages in Python web scrapping error

I am trying to scrape a webpage and looping through all the pages within a link. When I am looping through all the pages below code gives many duplicates lst = [] urls = ['https://www.f150forum.com/...
-1
votes
1answer
25 views

Save rows of a text in a dictionary

I'm web scraping to get some text in a website, for that text I want to save the header of the text and the description as key and value in a dictionary. When I run the code, I get an error ...
0
votes
0answers
16 views

Is there a way to scrape a “LinkedIn Member” name from search results?

I am trying to scrape employee names from a specific company on LinkedIn. Let's take this search example: https://www.linkedin.com/search/results/people/?facetCurrentCompany=%5B%2210424439%22%5D. My ...
0
votes
2answers
52 views

How can i write a never ending job in Rails (Web Scraping)?

Goal: I want to make a web scraper in a Rails app that runs indefinitely and can be scaled. Current stack app is running on: ROR/Heroku/Redis/Postgres Idea: I was thinking of running a Sidekiq Job ...
0
votes
1answer
19 views

Selenium Long Page Load in Chrome [duplicate]

I have built a scraper in python 3.6 using selenium and scrapinghub crawlera. I am trying to fetch this car and download its photos. https://www.cars.com/vehicledetail/detail/800885995/overview/ but ...
0
votes
0answers
28 views

Key Error: “None of — are in the columns”

I wrote a script to scrape Yahoo Finance stock data using the Yahoo_Fin package The aim of the script is to grab company financials to be able to perform some calculations. The input to the script is ...
0
votes
0answers
17 views

R Parsing multiple charts from the same webpage through xpath

I am trying to parse the www.tradingeconomics.com page. @AllanCameron has been very helpful in this. Anyway, when I try to get all the charts from a page (data for 1 year, for 5 year and for 10 year, ...
0
votes
2answers
30 views

How to scrape information about a specific product using search bar

I'm making a system - mostly in Python with Scrapy - in wich I can, basically, find information about a specific product. But the thing is that the request URL is massive huge, I got a clue that I ...
1
vote
2answers
27 views

Scraping URLs in a webpage using BeautifulSoup

Below is the code to scrape this webpage. Out of all the URLS on the page, I need only those which have further information about the job postings, for example, the URLs to companies name like - "...
0
votes
1answer
15 views

Use Cheerio to get a variable value inside script tag

172/5000 Good afternoon! I am trying to get the value of the variable "var JS_WCACHE_CK =" inside the tag, but I have already tested and tried to adapt some codes, but without success. <script>...
0
votes
1answer
27 views

Web scrapping from a div tag is returning random product's title name, whereas it should return the first one

I'm trying to scrape data from a website, using the following code: containers = page_soup.findAll("div", {"class": "item-info"}) container = containers[0] output: <div class="item-info"> <...
1
vote
1answer
34 views

Using Selenium and Python to scrape Morningstar website. Selenium doesn't download the full webpage

Here's my code: from selenium import webdriver import pandas as pd from lxml import etree url = 'https://www.morningstar.com/stocks/xbsp/UGPA3/quote' browser = webdriver.Chrome() browser.get(url) ...
0
votes
0answers
9 views

How to hook up scrapy-splash with aquarium

I am trying to crawl a website using scrapy and aquarium, the latter is a load balancer that handles multiple splash instances for rendering javascript. Im running aquarium using docker-compose up and ...
0
votes
0answers
30 views

requests very slow and sometimes return error

I run requests url = 'https://www.yellowpages.com/boston-ma/mip/the-oceanaire-seafood-room-455904020' r = requests.get(url) but sometimes it takes long time and returns the Response object, and ...
0
votes
2answers
25 views

How do you write German text into a CSV file?

I'm trying to write text that was scraped from a German website into a CSV file. I tried using UTF-8 encoding as such: with open('/Users/filepath/result.csv', 'a', encoding='utf8') as f: f.write(...
-1
votes
1answer
18 views

Web-parsing using BeautifulSoup with the same div - Cant return 'N/A' if not found on page

So i'm trying to scrape this entire website but the issue is that the page uses the same with the entries I want. So this is why I am doing the findAll for that same then looking for the individual ...
0
votes
0answers
30 views

BeautifulSoup4 - findAll gets only 10 occurrences

I am trying to scrap some information from website using BeautifulSoup4. The html looks like this : <ul class=results__list-container"> <li class="results__list-container-item"&...
0
votes
1answer
19 views

Pass two parameters into a url element while running the loop to webscrape data

import requests for i in range(len(lat_lon_df)): lat,lon = lat_lon_df.iloc[i] try: page = requests.get("https://forecast.weather.gov/MapClick.php?lat={}&lon={}&unit=0&lg=...
0
votes
1answer
40 views

How to download multiple files with for loop

I'm stuck on what should be a fairly simple problem. But I'm a beginner coder so it's not obvious to me. I'm trying to download images from a website using dynamic names. I think what happens is that ...
-2
votes
0answers
37 views

BeautifulSoup4 - findAll won't get all the occurrences right

I am trying to scrap some information from website using BeautifulSoup4. The html looks like this : <ul class=results__list-container"> <li class="results__list-container-item"> ...
0
votes
2answers
37 views

Xpath functions not working in playwright

Playwright is not working as expected when i try to use xpath functions. This is the code that i wrote to scrape the text inside the <h1> tag of https://example.org. const pw = require('...
-2
votes
0answers
20 views

Intelligence web scraping [closed]

i'm new to scrapy and i want to crawl an e-commerce web site which it should crawl products and i want an algorithm that crawl products frequently based on Price and Sellers change. For example the ...
0
votes
0answers
18 views

Get a specific tag with a part of the string beautiful Soup

i have many web pages for scrap. some of them have 4/5 <p> tags, but they have same string in their <p> tags. how can i access the whole contents of the <p> tag. (newbie)
-2
votes
0answers
29 views

How to Extract and insert content into MySQL database from Multiple Url's in php [closed]

I'm making an php web scraper and I want to follow links by links from input url and save the extracted data into database. This code to follow links by links from one url. Here my ...
1
vote
0answers
16 views

Puppeteer delete DOM element and children and free memory

In Puppeteer: Right now i'm removing an element from DOM with the following function (i'm removing items from an infinite scroll): remove_element = async (element) => { //removes element ...
1
vote
2answers
44 views

Webscraping inconsistently built tables using BeautifulSoup [gurufocus site]

I'm trying to get three indicators from gurufocus site, and encountered an issue I'm not sure how to address properly - the thing is tables I'm scraping are inconsistent regarding how many rows they ...
-3
votes
0answers
23 views

Scrape More Then 5000 pages at a time? [closed]

I am Scraping a real state website. I want to scrape more then 5000 pges I just Wrote a code : ''' import pandas import bs4 import requests MainURL = "https://www.aarz.pk/buy-property" req = ...
1
vote
1answer
40 views

How can I optimize a web-scraping code snippet to run faster?

I wrote this piece of code and it is currently running, scraping a massive amount of data. So far the loop has run 800 times. It will have to run ~16,000 times to grab all of the data. Generally ...
0
votes
0answers
30 views

Scraping from site that requires login, how to access the contents?

So I am trying to scrape a website that requires a login. I have used requests and submitted my login details, although when I try to extract the data from the website, I am not getting the website I ...
2
votes
0answers
43 views

Javascript asynchronous behavior with Mysql

I am extremely confused about the async behavior of JS. I have a CSV file that contains a list of URL which needs to be scraped. And the result should then be added to a database row by row in a for ...
0
votes
1answer
29 views

can only write one result to csv file

I am trying to write my first python script, which scrapes jobs and their ads for specified companies. However, I can only write the last result to csv. what am I doing wrong? I have put my 'with ...
0
votes
0answers
17 views

Home Depot Purchase History Download

I work for a construction company and have started making dashboards for our Key Metrics. I do this using Python. One of the most tedious parts of the process is downloading the daily Home Depot ...
0
votes
1answer
28 views

Object exists in the HTML but I'm unable to select it

I'm writing a scrapper. When I use inspect element in chrome I see the following: but when I run my code Elements data = doc.select("div.item-header"); and I print the object data I see that the ...
1
vote
1answer
44 views

Extract email address from a website for each link inside DOM of page

I Want to develope an app I give Url of a specific website to it,and it extract all links from that Web page. For each extracted link I want to get the HTML content. I am based in the concept of deep ...
0
votes
1answer
38 views

Web scraping with BeautifulSoup Python returns None

Im trying to get some text from http://rss.cnn.com/rss/money_markets.rss and when I run the code i keep getting the a None output. If it helps, I trying to get all the small headlines from the web and ...
1
vote
1answer
31 views

Python script - Web Scraping

I'm doing a script that get's some data from a URL (http://www.pmo.cz/portal/nadrze/cz/mereni_1_mes.htm). All i need is to get the data (And the date + time) from this chart: Chart The problem is I ...