Questions tagged [web-scraping]

Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.

Filter by
Sorted by
Tagged with
0
votes
1answer
26 views

I want to scrape user and location information from Trustpilot reviews

I have a ready code to scrape information from Trustpilot. I'm successfully scraping information on reviews, heading, timestamp and ranking for all pages. I want to also scrape reviewer details and ...
0
votes
0answers
9 views

Is it mandatory for us to provide .prettyfy() command in beautiful soup?

For past 2 days I am trying to know what exactly .pretty() actually do in the code. Like from .prettyfy() humans can see the proper aligned code but is it really important for computer as well ? I ...
0
votes
0answers
14 views

how to scrap the data of specific html table(using class) using url in python?

I want to scrape data from a website by using its class. The table data is generated on run time.I tried with below code, but it doesn't work. tables = soup.findAll("table", { "class" : "tab05" }) ...
-3
votes
0answers
22 views

Instagram web page is not loading all the posts

I am trying to scrape some data from Instagram. I wrote java code using selenium library. The code works as follows: 1- Go to login page 2- Login 3- Go to this url (instagram location) 4- ...
1
vote
1answer
23 views

i am trying to scrap website using selenium and beautiful soup

How could i get all the categories mentioned on each listing page of the same website i.e. code as well as title i am trying to scrap website through selenium and using beautiful soup to scrap each ...
0
votes
1answer
18 views

Scrapy output empty

I am trying to use Scrapy to extract paper titles from IEEE Xplore by scrapy shell 'https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5962385' For the first paper title, I used copy ...
0
votes
1answer
11 views

Scraping text between pseudo elements

I am trying to crawl an auction website(https://onlineonly.christies.com/s/first-open-post-war-contemporary-art/massimo-vitali-b-1944-203/43092). I want to use css.selector to select the price of the ...
1
vote
1answer
20 views

Trying to web-scrape through hundrededs of thousands of pdfs on government website. Want to do it as fast as possible

I'm trying to search through the U.S. technical manuals for anything relating to levees and flooding events. I think there are about 400,000 files hosted by them, and I'm trying to write code to ...
-2
votes
0answers
12 views

Web Scraping-Beatiful Soup: 'NoneType' object has no attribute 'text' [duplicate]

I am trying to web scraping application. I get values without for loop however in for loop it gives error. for x in range (615, 620): url = "http://xperteleven.com/gameDetails.aspx?GameID=317180"...
1
vote
0answers
32 views

I want to reference specific text in a website, copy it, and paste to excel

I've written VBA code to open IE, enter a specific query based on excel data, and I now want to copy a specific piece of text to paste back in excel, then repeat for 15000 rows. This VBA code works ...
0
votes
0answers
15 views

Excel VBA getting to .ASPX page [on hold]

I am using Excel VBA to populate selection criteria on a web page (the first, obtained with a URL that contains XXX.com/YYY/ZZZ.ASPX), click on a few choices, and the click to submit the page. So far ...
0
votes
0answers
12 views

Web scraping overlay boxes

I want to scrape a grocery store ad like the "weekly ad" found here. Some of the information I want is available when inspecting the html of the elements, but the full details of what I want aren't ...
0
votes
1answer
30 views

Creating a dataframe from paragraph text scraped from website in R

I'm trying to scrape a website that has numerous different information I want in paragraphs. I got this to work perfect... However, I don't understand how to break the text up and create a dataframe. ...
-1
votes
1answer
43 views

How to remove a None type from a string

I am scraping a webpage using this code import requests import bs4 res=requests.get(URL) res.text soup=bs4.BeautifulSoup(res.text, 'lxml') lis=[] for k in soup.find_all('a'): Fin=(k.get('href')) ...
0
votes
1answer
20 views

Scrapy concatenate array elements inside div in python

I need to concatenate some text inside a <div> with xpath in Scrapy. The div has the next structure: <div class="col-12 e-description" itemprop="description"> "-Text1" <br> &...
0
votes
0answers
24 views

My code get stuck at some point using lapply in R to scrape multiple links

I am scraping the content of multiple links. However, in the middle of the running, the code stop working (I do not get more prints in the console). Here is the code I am using The same code worked ...
0
votes
2answers
29 views

Getting empty list while using xpath with html.fromstring

I am trying to extract text from a webpage using below code. It is working fine for other websites but here i am getting empty list import requests from lxml import html siteurl = 'https://...
0
votes
1answer
23 views

Can anyone please write in english what exactly this code means : soup.find_all(“p”, class_=“strikeout”)

I wan to undetand in english what does this code means exacty. I have tried leanring codes from beautifulsoup i got the hint but i am not able to get confidence. soup.find_all("p", class_="strikeout"...
0
votes
2answers
35 views

How to save a PDF from a link that automatically starts to download?

I am trying to scrape and SAVE pdf files that automatically start to download once you click on the URL, such as: https://ec.europa.eu/research/participants/documents/downloadPublic?documentIds=...
-1
votes
0answers
21 views

render js scripts on server and scrap fully generated dom

I am using Goutte package in laravel to scrap sites. I came to find that this doesn't receive fully generated dom since most elements are rendered later with js etc. I have this snippet: $crawler = ...
0
votes
1answer
26 views

Downloading JavaScript-loaded audio using Python

I'm trying to write a script to automate the downloading of english audio files from a website, using Python. The audio plays/loads on click, but I don't know how to "capture" the file as it loads ...
-1
votes
0answers
26 views

Is there a function in rvest that can help me pull the information from the tables on this website?

I want to pull the data from this website url: https://www.pro-football-reference.com/boxscores/201410120cle.htm#all_player_offense into R studio but can't seem to find a correct way to do so. I'm ...
0
votes
1answer
37 views

Scrapy get's redirected to follow 302 and it does not crawl the site

Scrapy gets 302 redirect to another link. In the link 'https://xxxxxx.queue-it.net?c.....com' Scrapy does not add the '/'. It should be'https://xxxxxx.queue-it.net/?c.....com'. I have tried adding '...
-2
votes
0answers
13 views

Scrape twitter tweets using request-promise and cheerio [on hold]

I want to make a twitter scrapper using request-promise library but twitter is using Javascript so I can't use it. I don't want to use a headless browser. is there any other way I can make it using ...
0
votes
1answer
36 views

Issue with regular expressions while parsing source code

Im trying to get some information from a page source code. For example, lets take this amazon product. https://www.amazon.com/gp/product/B07PWCJZJ6?pf_rd_p=2d1ab404-3b11-4c97-b3db-48081e145e35&...
0
votes
1answer
47 views

Can't parse weird looking website addresses from some identical links

I'm trying to fetch the website address out of some identical webpages. I've created a regex expression to parse the same but the pattern I've defined is undoubetedly the worst one. How can I get only ...
-2
votes
0answers
8 views

Iam not able to scrap some contents python? [on hold]

my goal is to find the number of times the word trump occurs in a given url "https://www.nytimes.com/2019/08/18/us/politics/trump-economy-recession.html?rref=collection%2Ftimestopic%2FTrump%2C%...
3
votes
1answer
27 views

Unable to get some items located scatteredly from a webpage

I'm trying to get four fields from a webpage using python but the problem is the data I'm after are not within any structured html, so I can't find any way to get them individually. webpage address ...
1
vote
2answers
36 views

Python web-scraping and downloading specific zip files in Windows

I'm trying to download and stream the contents of specific zip files on a web page. The web page has labels and links to zip files that use a table structure and appear like this: Filename Flag ...
-1
votes
0answers
18 views

Scraping twitter for tweets that are similar to the corpus of tweets that I already have gathered?

What would be the best approach for scraping tweets that are similar to the tweets that I have already gathered? For example if my corpus generally says something like "I believe x will lead to y" ...
3
votes
1answer
45 views

VBA: How to select a specific webpage div based on class

I'm looking to select part of a table (basically don't want the title, would prefer to leave off headers row too) I can't seem to get it to work. HTML: <table id="mainContent" Class="MainContent-...
0
votes
1answer
27 views

How to login and proceed with Scraping in JAVA?

My problem is that I must be able to extract certain information such as the price, quantity and name of each product on a website selling electronic products and devices (this website), but the ...
-3
votes
0answers
22 views

I want to get a URL that includes a specific URL path [on hold]

■Assumption  * The URL is a dummy.  There is a URL in the format https://xxxxxxxxx.co.jp/yyyy/. 【Thing you want to do】  I want to get all URL strings including https://xxxxxxxxx.co.jp/yyyy/.  Is ...
-3
votes
1answer
31 views

Web Scrapping Text that is missing in the HTML

I am attempting to collect some information from a series of forms. The majority of the online forms have the response text coded into the HTML, however, there is one section where this does not seem ...
-1
votes
1answer
46 views

Scraper to extract all emails in outlook inbox with a certain subject [on hold]

I am trying to develop a scraper that can check the entirety of my outlook inbox for emails with a specific subject line, and extract the data/body from those emails. Specifically, the format of the ...
0
votes
1answer
24 views

SCraping Google Resulsts

I am trying to Scrape Google Results using Beautiful soup. The results I get back are not what is displayed on the screen. What is needed to convert the results to the real text I see on the screen? ...
0
votes
0answers
27 views

Is there a way for scraping a specific table row or table cell with powershell?

I'm coding a web-scraper with PowerShell and want to select a specific value in a table or the whole row. What I've managed already is to print out my values but it selects the column names and the ...
1
vote
1answer
29 views

How to extract text from online PDF using pdfminer in python

I want to extract text from online PDF using pdfminer using below code, it is showing no error but output is nothing from pdfminer.pdfpage import PDFPage from urllib import request from pdfminer....
-1
votes
0answers
15 views

I need to mirror a website from Mac Terminal using Httrack but It's a website with form authentication (I have the log-in details)

I'm trying to download the content from a specific website and according to what I've read, I need to mirror it by using HTTRACK. Now, I have Mac OS, so I download HTTRACK through HomeBrew package and ...
0
votes
2answers
33 views

Scraping Inspect Element and Dynamic Webpage using Python

I am trying to get the news content from https://www.thehindu.com/life-and-style/travel/the-embers-of-war/article29202579.ece Actually, I look for the pattern to get the news content only.. I use the ...
1
vote
3answers
49 views

lengths of lists are not same when appending items

I am working on a web-scrapping project, in which I have to search for a product in a website and append all details of the product to respective lists. for example, the first page of this URL lists ...
0
votes
0answers
14 views

Trying to access MongoDB collection, use multi processing to web scrape then input to different collection

Currently I am using a script which access a MongoDB collection using pymongo, creates a list of urls I would like to scrape, scrapes them, then adds the result to another collection. Since this takes ...
0
votes
1answer
35 views

R - rvest - scrap all data from p (directors on IMDb page)

I'm trying to scrap film details from IMDb webpage. Problem is with Directors data. I'm able to scrap only first director, but would like to scrap all of them for each film. On mentioned below page ...
0
votes
0answers
21 views

Scrapy two factoar authentication with different crawlers

I am trying to scrap gitlab.com with two factor authentication, which works fine when I have a single spider. I login to the git lab and then take input for otp from console and then submit it. ...
0
votes
2answers
47 views

Data from page2 same as from page1 when scraping

I am trying to scrape all event links from https://www.tapology.com/fightcenter. Have already quite some experience in webscraping using R but in this case I am stuck. I am able to scrape from page ...
1
vote
3answers
60 views

Some function gives wrong results instead of None

I'm trying to print only two fields from two functions. The both functions take the same url but produce different results. The first function get_names() prints the name of different users. The ...
0
votes
2answers
39 views

selenium click not working for a link in nasdaq site [duplicate]

So the issue is very simple and straightforward. In the link https://www.nasdaq.com/symbol/iff/revenue-eps I want to click the link "Previous 3 Years" using selenium, but it just doesnt seem to work. ...
0
votes
0answers
30 views

How do I scrape a specific HTML tag?

I'm making a 3 letter username checker that checks against a web address where it's displayed in plaintext if the username is taken or not. It is inside of a <pre> tag and since the website is ...
0
votes
1answer
31 views

Logging into website that doesnt use a POST request - web-scraping with Python

I am trying to log into the website(using requests get/post) https://www.robertparker.com/sign-in neither chrome or mozilla can see these sessions. Please point me in the right direction, There are ...
0
votes
0answers
25 views

scrapy startproject tutorial command throwing errors

After i downloaded scrapy using pip3 install scrapy in my Ubuntu shell on Windows 10 (using Windows Subsystem for Linux), when i try the command scrapy startproject tutorial I get thrown this error ...