Which of these scrapers...? | American Association of Woodturners

23, Jun. 2025

 

Which of these scrapers...? | American Association of Woodturners

Getting ready to drop some $$ on my first set of wood turning tools and I've got everything figured out except my scraper. I'm primarily interested in turning bowls and could use some advice on which one of these three I should get.

In order, the below pictures are:

1) Hurricane M2 Cryo 1" (x3/8") Heavy Duty Bowl Finishing Scraper....$114
2) Hurricane M42 Cryo 3/4" Bowl Finishing Scraper .......................................$82
3) Hurricane M2 Cryo 1" Round Nose Scraper ....................................................$76

At first I was thinking of getting the cheaper 1" round nose and grind / sharpen so that it has the profile of a bowl scraper. Is it ok to do this, or should I go for one of the other tools that are actual bowl scrapers? And if so, what's the advantage / functionality of the heavy duty scraper vs. the smaller 3/4" one?



Brad - everyone will have their opinion, but........I turn a lot of bowls and 95% of the time use a traditional 5/8" bowl gouge only. My bowl gouges have 55-60 degree bevels which generally gets it done for me. I will occasionally use a negative rake scraper (pretty heavy duty, not a lightweight) for some interior tear-out related issues, but it's rare. On those somewhat rare occasions where I go deeper than the width of the bowl, I have a "bottom-feeder" bowl gouge ground to about 85 degrees from an old traditional ground gouge someone gave me. On the exteriors of bowls, after shaping, I mostly shear-cut (tool handle down low and flute turned toward the bowl) with my bowl gouge and use the same bowl gouge as a scraper as needed. So, my 2-cents, use a bowl gouge (mostly) and a negative-rake scraper if necessary. Al Hockenbery, and a few other experienced turners that frequent this forum, probably have good videos showing how you turn the whole bowl with a traditional ground bowl gouge. I invested in the Glenn Lucas negative rake scraper (woodturnerscatalog.com) a few months ago, and have not looked back. With a proper burr burnished on the edge, it is a joy to use and does a great and safe job of shaving off finishing cuts in the inside of bowls. I don’t know if I’ve gotten much use out of my traditional (single-bevel) scrapers on bowls since I made the purchase. One exception is a homemade scraper I made from a concrete demolition tool I found on the side of the road several years ago. It is a one inch steel rod about 30” long . Probably weighs ten pounds. The end is flattened spoon shape and I ground a scraping edge on it. It’s great for heavy stock removal from the inside of rough bowls.
Getting ready to drop some $$ on my first set of wood turning tools and I've got everything figured out except my scraper. I'm primarily interested in turning bowls and could use some advice on which one of these three I should get.

In order, the below pictures are:

1) Hurricane M2 Cryo 1" (x3/8") Heavy Duty Bowl Finishing Scraper....$114
2) Hurricane M42 Cryo 3/4" Bowl Finishing Scraper .......................................$82
3) Hurricane M2 Cryo 1" Round Nose Scraper ....................................................$76

At first I was thinking of getting the cheaper 1" round nose and grind / sharpen so that it has the profile of a bowl scraper. Is it ok to do this, or should I go for one of the other tools that are actual bowl scrapers? And if so, what's the advantage / functionality of the heavy duty scraper vs. the smaller 3/4" one?

View attachment
View attachment
View attachment
Get a negative rack scraper, the huge ones like 1 1/2 aren’t much better than the 1 inch ones I find, actually I like the 1 inch better, less area to catch When doing a fine scrape to clean up marks. I purchased a number of scrapers 20yrs ago when I started turning. Scraped a few bowls back then... Now, as others have said, I seldom reach for a scraper. I do need to eventually invest in a negative rake one though. The scraper I use most often is a 1/4" x 1" Thompson slightly rounded that I clean up plates and platters with, otherwise most of them just gather dust for months on end until I use them for a few minutes on a particular issue. Of the couple dozen (or more) tools I have the 5/8" bowl gouge easily sees the most use. A small skew sees use on every bowl to make the tenon for my chuck, and usually a 3/8" bowl gouge, or a 3/8" detail gouge to clean off the bottom tenon when the bowl is completed. With that said, I wouldn't part with the others I have and continually think about getting more What I don't get is, why would one invest in a negative rake scraper, if they already have a standard single-bevel (assuming the shape is how you want it to be, or could be ground to the shape you want) , all one would need to do, it seems, is take your single bevel scraper and grind a new bevel on the top, then grind your bevel again to raise a burr (or even, hone and raise a burr with a burnishing tool?) , voila, you have a negative rake scraper? Myself, I recently got a pair of PSI bowl scrapers (1" and 1-1/2" ) and while I have little practice in with them as yet, I was considering that I could just grind a negative rake into one or both, (but first I want to see if I can refine the shape I currently have on them, which does not "quite" fit the transitions I end up with - most of my difficulty now is on bowl bottoms just after the transition, invariably I seem to get a "ridge" in that area... )
…why would one invest in a negative rake scraper, if they already have a standard single-bevel (assuming the shape is how you want it to be, or could be ground to the shape you want) , all one would need to do, it seems, is take your single bevel scraper and grind a new bevel on the top, then grind your bevel again to raise a burr (or even, hone and raise a burr with a burnishing tool?)..,

Because, of course, I don’t want to sacrifice one of the 5 rarely used scrapers I already have…you can *always* use just one more tool IF you're just starting out, I would not spend the extra money for cryo steel. Get a basic M2 scraper in the size you think you want and put the shape on it you think you want. Then, when you realize it's not quite right, you can change the shape until you get what actually works for you. And not worry about how much expensive steel you're grinding off in the process. Think you actually want a negative rake scraper? Fine, off you go to the grinder and turn your inexpensive scraper into a negative rake. After 5 or 8 years, when you're getting pretty good at this, and your cheap scraper is getting kinda short, buy that fancy steel. At that point, you'll be able to appreciate the longer lasting edge. Just my 2 cents.
What I don't get is, why would one invest in a negative rake scraper, if they already have a standard single-bevel (assuming the shape is how you want it to be, or could be ground to the shape you want) , all one would need to do, it seems, is take your single bevel scraper and grind a new bevel on the top, then grind your bevel again to raise a burr (or even, hone and raise a burr with a burnishing tool?) , voila, you have a negative rake scraper? Myself, I recently got a pair of PSI bowl scrapers (1" and 1-1/2" ) and while I have little practice in with them as yet, I was considering that I could just grind a negative rake into one or both, (but first I want to see if I can refine the shape I currently have on them, which does not "quite" fit the transitions I end up with - most of my difficulty now is on bowl bottoms just after the transition, invariably I seem to get a "ridge" in that area... )
You can pretty easily make a double bevel scraper (NRS) from a single bevel, but you probably don't want to go back and forth.

If you take a 70* single and add a 10* top bevel you will get an NRS, but it will be blunter with an included angle of 80* . To get back to a 70* included angle you'll have to grind the bottom bevel to 60* loosing a bit of metal in the process.

Now if you change your mind and want to go back to single bevel you not only have to return the bottom bevel to 70*, but you have to grind away all of that shallow 10* bevel on the top, and that is a lot of metal. Thanks so much everybody for your advice, I always read word. I'd like to address everyone individually but I find I risk having my follow up questions disappear in the fray when I do.

I've had the suggestion that if I am considering buying the more expensive scraper that I might think about going with the Easy Wood Tools Finisher (carbide) instead. I understand the wisdom in the advice to get cheap tools first then see which ones I like/use invest in more expensive tools, but I'm a measure twice, cut once kind of guy so I tend to ask a bunch of questions. Any opinions on the Finisher vs. 3/8" NR scraper? You can get a Thompson 3/4" scraper for $75 where the EWT is well over a hundred. With my CBN wheels I will never use all of my Thompson 3/4" scraper (and I wouldn't use it all if I was still using stone wheels). You know from demos you hear the pros saying that the scraper needs the burr to cut well and that the burr is gone in a few seconds but then you watch and he uses the scraper for minutes and if you watch closely if he needs it again he just picks it up again and uses it without resharpening. I don't use my 3/4" Thompson very often but I cannot remember the last time I sharpened it and it still does exactly what I need it to do. I have never felt the need to get an EWT tool and I would not take one if it were free as my Thompson does the same job. The only carbide I use regularly are the Hunter carbides. Well, as a self proclaimed scraper psycho, I think the biggest appeal of the carbide scrapers, other than the Hunter tools which are cupped and designed to slice, is that they are small scrapers, which makes them easier to handle, especially for beginners. I do not like big wide thick scrapers. As I say in my Scary Scrapers video, I don't need anything over 1 inch wide and 5/16 thick.

As for cutting with and without burrs, this does present a puzzle. Most of the time, for scraping cuts, especially heavy ones, you want a burr. I was floored when I saw Nick Stagg do a demo in Salem, OR where he finish cut a hard maple lamp base that was side grain/bowl grain orientation, with a scraper. He honed off the burr and bevel. It left a very smooth surface. That type of edge does work in harder and more dense woods. It does not work in softer woods.

No idea how the carbide NRS works, or maybe I can't explain why it cuts. In theory, it shouldn't because the NRS needs a burr to cut. Apparently, it does work. A sharp edge does cut.

robo hippy

Everything to Know to Start Web Scraping in Python Today - Scrapfly

Web scraping with Python is a massive subject and this guide will introduce you to all main contemporary concepts and techniques. From how to web scrape basic HTML to scraping dynamic pages with headless browsers and AI — we'll cover it all!

You can find more information on our web, so please take a look.

We'll start from the very basics and cover every relevant major technique and feature with some real life examples. Finally, we'll wrap up with a full scraper breakdown and some tips on scraper scaling, scraper blocking and project deployment.

What is Web Scraping?

To start, let's clarify what is web scraping and the parts that make it.

Web scraping is an automated web data retrieval and extraction process. It's an invaluable tool for collecting data from websites at scale to power applications, analytics, and research. Web scraping can be used to extract product prices, news articles, social media posts, and much more.

The web is full of public data and why not take advantage of it? Using Python we can scrape any page given the right techniques and tools so to start let's take a look at the main parts of web scraping.

Web Scraping Overview

Generally modern web scraping in Python development process looks something like this:

  1. Retrieve web page data:
    1. Use one or multiple HTTP requests to reach the target page
    2. Or use a headless browser to render the page and capture the data
  2. Parse the data:
    1. Extract the desired data from the HTML
    2. Or use AI and LLMs to interpret scraped data
  3. Scale up and crawl:
    1. Fortify scrapers against blocking
    2. Distribute tasks concurrently or in parallel

There are a lot of small challenges and important details here that start to emerge as you dive deeper into web scraping though before we dig into those, let's start with a basic example that'll help us glimpse the entire process.

Basic Web Scraping Example

The most basic web scraping example needs:

  • HTTP client library - httpx is a modern and fast HTTP client for Python.
  • Data parsing library - Beautifulsoup is the most popular HTML parsing library.

To start, install those packages using pip terminal command:

$ pip install httpx beautifulsoup4

For our Python scraper example let's use https://web-scraping.dev/product/1 page which is designed for testing and developing web scrapers. For this let's scrape the page and extract product name, description, price and features:

Which we can scrape with the following Python code:

import httpx
import bs4  # beautifulsoup

# scrape page HTML
url = "https://web-scraping.dev/product/1"
response = httpx.get(url)
assert response.status_code == 200, f"Failed to fetch {url}, got {response.status_code}"

# parse HTML
soup = bs4.BeautifulSoup(response.text, "html.parser")
product = {}
product['name'] = soup.select_one("h3.product-title").text
product['price'] = soup.select_one("span.product-price").text
product['description'] = soup.select_one("p.product-description").text
product['features'] = {}
feature_tables = soup.select(".product-features table")
for row in feature_tables[0].select("tbody tr"):
    key, value = row.select("td")
    product['features'][key.text] = value.text

# show results
from pprint import pprint
print("scraped product:")
pprint(product)
{'description': 'Indulge your sweet tooth with our Box of Chocolate Candy. '
                'Each box contains an assortment of rich, flavorful chocolates '
                'with a smooth, creamy filling. Choose from a variety of '
                'flavors including zesty orange and sweet cherry. Whether '
                "you're looking for the perfect gift or just want to treat "
                'yourself, our Box of Chocolate Candy is sure to satisfy.',
 'features': {'brand': 'ChocoDelight',
              'care instructions': 'Store in a cool, dry place',
              'flavors': 'Available in Orange and Cherry flavors',
              'material': 'Premium quality chocolate',
              'purpose': 'Ideal for gifting or self-indulgence',
              'sizes': 'Available in small, medium, and large boxes'},
 'name': 'Box of Chocolate Candy',
 'price': '$9.99 '}

This basic scraper example uses our httpx HTTP client to retrieve the page HTML and BeautifulSoup and CSS selectors to extract data. Returning a python dictionary with product details.

This is a very basic single page scraper example and to fully understand web scraping in Python, let's break each individual step down and dive deeper into python web scraping next.

Step 1: Scraping The Data

Generally there are two ways to retrieve web page data in web scraping:

  1. Use a fast HTTP client to retrieve the HTML
  2. Use a slow real web browser to render the page, capture the browser data and page HTML

HTTP clients are very fast and resource-efficient. However, many modern pages are not only compromised of HTML but javascript and complex loading operations where HTTP client is often not enough. In those cases using a real web browser is required to load the page fully.

Web browsers can render the page and capture the data as a human would see it and the web browser can be automated through Python. This is called a "headless browser" automation (headless meaning no GUI). Headless browsers can be used to scrape data from pages that require javascript rendering or complex page interactions like clicking buttons, inputs or scrolling.

Both of these methods are used in web scraping and the choice depends on the target page and the data you need. Here are some examples:

  • web-scraping.dev/product/1 page is a static HTML page and we can easily scrape it using an HTTP client.
  • web-scraping.dev/reviews page is a dynamic page that loads reviews using javascript. We need a headless browser to scrape this page.

If you're unfamiliar with web basic like what is URL, HTML, request etc. then take a brief look at this short Scrapfly Academy lesson before continuing:

Related Lesson: Basic Website Reverse Engineering

Let's take a look at each of these 2 web data retrieval methods in Python.

Scraping HTML Pages

Most web pages are pretty simple and contain static HTML that can be easily scraped using an HTTP client.

To confirm whether a page is static or dynamic you can disable Javascript in your browser and see whether the results remain in the page display or page source. If the data is in the page source then it's static and can be scraped with an HTTP client.

To scrape HTML data use a HTTP client like httpx (recommended) which can send HTTP requests to the web server which in turn return responses that either contain page HTML or a failure indicator.

To start, there are several types of requests identified by HTTP method:

Method Description GET Retrieve data. The most common method encountered in web scraping and the most simple one. POST Submit data. Used for submitting forms, search queries, login requests etc. in web scraping. HEAD Retrieve page metadata (headers) only. Useful for checking if the page has changed since the last visit. PUT, PATCH, DELETE Update, modify, and delete data. Rarely used in web scraping.

Here are some examples how to perform each type of web scraping request in popular Python httpx client:

As result each response has some key indicators whether our scrape was successful. To start the status code is a numeric code where each number has a specific meaning:

Code Range Meaning 2xx ✅ success! 3xx ???? redirect to different location. Page has moved or client failed. 4xx ⏹️ client error. The request is faulty or being blocked. 5xx ⏸️ server error. The server is inaccessible or client is being blocked.

For more on status codes encountered in web scraping see our full breakdown in each of our HTTP status code articles:

Status Code Brief 403 Forbidden / Blocked 405 Request method not allowed 406 Not Acceptable (Accept- headers) 409 Request Conflict 413 Payload Too Large 412 Precondition Failed (If- headers) 415 Unsupported Media Type (for POST requests) 422 Unprocessable Entity (mostly bad POST request data) 429 Too many Request (Throttling) 502 Service Unavailable

For more on static HTML page scraping see our related lesson on Scrapfly Academy:

Related Lesson: Static HTML Scraping

Next just scraping page HTML is often not enough for many websites so let's take a look at how a real web browsers can be used in web scraping to scrape dynamic pages.

Scraping JavaScript Rendered HTML

Modern pages are complex and use javascript to generate some page sections on demand. You've probably encountered this when scrolling through page and new content loads or when you click on a button and a new section appears.

An example of such page would be web-scraping.dev/reviews page which loads reviews using javascript:

To scrape JS rendered pages, we can automate real web browsers like Chrome or Firefox to load the page and return the page contents for us. This is called headless browser scraping and headless simply means there is no visible GUI.

Each web browser exposes an automation API called CDP which allows applications to connect to a browser and control it. There are several popular libraries that implement this API in Python and click on each of these to see our full introduction guide:

  • Selenium is the most mature and well known library for Python. It has a giant community but can be a bit harder to use due to it's age.
  • Playwright is the newest and most feature rich option in Python developed by Microsoft.

Here are Python selenium web scraping example and Python playwright web scraping example for easier comparison:

This type of scraping is generally called dynamic page scraping and while Selenium and Playwright are great solutions there's much more to this topic that we cover in our Scrapfly Academy lesson ????

Related Lesson: Dynamic Page Scraping

As headless browsers run real web browsers scaling browser based web scrapers can be very difficult. For this, many scaling extensions are available like Selenium Grid for Selenium. Alternatively, you can defer this expensive task to Scrapfly's cloud browsers.

Scrapfly Cloud Browsers

Scrapfly's Web Scraping API can manage a fleet of cloud headless browsers for you and simplify the entire browser automation process.

Here's an example how to automate a web browser with Scrapfly in Python:

# pip install scrapfly-sdk[all]
from scrapfly import ScrapflyClient, ScrapeConfig

client = ScrapflyClient(key="YOUR SCRAPFLY KEY")

# We can use a browser to render the page, screenshot it and return final HTML
api_result = client.scrape(ScrapeConfig(
    "https://web-scraping.dev/reviews",
    # enable browser rendering
    render_js=True,
    # we can wait for specific part to load just like with Playwright:
    wait_for_selector=".reviews",
    # for targets that block scrapers we can enable block bypass:
    asp=True,
    # or fully control the browser through browser actions:
    js_scenario=[
            # input keyboard
            {"fill": {"value": "watercolor", "selector": 'input[autocomplete="twitch-nav-search"]'}},
            # click on buttons
            {"click": {"selector": 'button[aria-label="Search Button"]'}},
            # wait explicit amount of time
            {"wait_for_navigation": {"timeout": }}
        ]
    ))
print(api_result.result)

Capturing Browser Data

There's more to scraping with web browsers than just rendered HTML capture. The browsers are capable of non-html features:

  • Local Storage, IndexDB database storage
  • Executing background requests for retrieving more data from the server (like comments, reviews, etc)
  • Storing data in Javascript, CSS and other non-html resources

When web scraping with a headless browser we can also capture all of this data which can be crucial for many more complex web scraping processes.

So, if you see your page missing some data in the HTML maybe it wasn't there in the first place! Let's see how to capture non-html data in Python and Playwright or Selenium libraries.

Capturing Network Requests

XHR are background requests that are made by the page to retrieve more data from the server. These requests are often used to load comments, reviews, graph data and other dynamic page elements.

To start, we can view what XHR requests are being made by the browser using browser developer tools. See this quick video how that is being done using Chrome:

To capture XHR in Python we must use a headless browser which can be setup with a background request listener. Here's an example how to capture XHR requests in Python and Playwright or Selenium libraries:

For more see our complete intro to background request scraping.

Capturing Local Storage

Local Storage is a browser feature that allows web pages to store key-value data in the browser itself. This data is not visible in the page HTML and can be used to store user preferences, shopping cart items, and other client-side data.

An example of local storage can be found on web-scraping.dev/product/1 product pages where the page stores car information in local storage:

To capture Local Storage data in Python we must execute Javascript in the browser to retrieve the data. Here's an example how to capture Local Storage data in Python and Playwright or Selenium libraries:

Capturing Screenshots

Screenshots are a great way to capture the entire page state at a specific moment in time. They can be used to verify the page state, debug the scraping process, or even as a part of the scraping process itself.

To capture screenshots in Python and Playwright or Selenium libraries we can use the screenshot() method which implement the entire screenshot process:

Screenshot capture can be surprisingly challenging with page pop-ups, javascript exeuction and blocks. This is where Scrapfly can assist you once again!

Scrapfly Screenshot API

Scrapfly's Screenshot API uses headless browsers optimized for screenshot capture to make it as simple and seemless as possible:

  • Automatically block ads and popups
  • Bypass CAPTCHAs and other bot detection mechanisms
  • Target specific areas and different resolutions
  • Scroll the page and control the entire capture flow

Here's an example of how easy it is to capture screenshot of any page using the Screenshot API:

# pip install scrapfly-sdk[all]
from scrapfly import ScrapflyClient, ScreenshotConfig

client = ScrapflyClient(key="your scrapfly key")

api_result = client.screenshot(ScreenshotConfig(
    url="https://web-scraping.dev/reviews",
))
print(api_result.image)  # binary image
print(api_result.metadata)  # json metadata

No matter which method you use to scrape websites you'll end up with the entire page dataset which on its own not very useful. Next, let's take a look at how to parse this data and extract the information we need.

Step 2: Parsing Scraped Data

Websites generally return data in one of two formats: HTML for page content and JSON for API responses (like background requests). Each of these formats requires a different approach to extract the data.

Parsing Scraped HTML

To parse HTML there are 3 standard options (click on each for our dedicated tutorial):

  • CSS Selectors - commonly encountered in web development like CSS style application rules and DOM parsing, CSS selectors allow to select HTML elements by attributes like element tag, class, id or location in the document.
  • XPath - a more powerful and flexible option, XPath allows to select HTML elements by their attributes, exact location in the document and even by their text content.
  • Beautifulsoup - offers a Pythonic way to parse HTML using .find(), .find_all(), .select() methods.

To create your parsing rules it's first important to understand the structure of the HTML document. Take a look at this short video on how to explore HTML using Chrome developer tools:

With that, here's an example how to parse product HTML from web-scraping.dev/product/1 with each of these methods:

Overall, CSS selectors as sufficiently powerful for most web scraping operations though for more complex pages the more advanced features of XPath or BeautifulSoup can be very useful.

You will get efficient and thoughtful service from Jiafu.

Here's a quick overview table comparing XPath and CSS selectors and BeautifulSoup:

Feature CSS Selectors XPath BeautifulSoup select by class ✅ ✅ ✅ select by id ✅ ✅ ✅ select by text ❌ ✅ ✅ select descendant elements (child etc) ✅ ✅ ✅ select ancestor element (father etc) ❌ ✅ ✅ select sibling elements limited ✅ ✅

While both XPath and BeautifulSoup are more powerful than CSS selectors, CSS selectors are very popular in web development and really easy to use. Because of this, most developers tend to use CSS selectors where possible and switch to XPath or BeautifulSoup when CSS selectors are not enough.

Whichever you choose see our handy cheatsheets:

  • CSS Selectors Cheatsheet
  • Xpath Cheatsheet

Next, modern websites not always store the data in the HTML element themselves. Often data is stored in variables or scripts and then unpacked to the page HTML on page load by javascript. We can scrape this as well, let's take a look at how next.

Scraping Hidden Dataset

Often we might see HTML parsing techniques return empty results even though we are sure the data is there. This is often caused by hidden data.
Hidden Web Data is when pages store page data in JavaScript variables or hidden HTML elements and then unpack this data into page HTML using javascript on page load. This is a common web development technique that allows server side and client side development logic separation but can be annoying to deal with in scraping.

Take a look at this short video for a quick overview of hidden web data:

To verify whether that's the case it's best to find unique identifier and browser around the page source code to see if the data is there. Often it's hidden in:

  • tags as JSON objects
  • data- attributes of HTML elements
  • style="display: none" element that are hidden by CSS

For example, the web-scraping.dev/product/1 stores the first set of product reviews in page source and then on page load turns them into HTML elements we see on the screen:

Here we can verify this by searching the page source for unique review text.

So, here instead of parsing HTML we'd scrape the page source, extract the element and load the hidden dataset. Here's an example of how to scrape hidden data in Python and it's popular HTML parsing libraries:

Hidden web data can be placed in almost any part of the entire of page resources so for more on that see our full intro to hidden web data scraping.

Parsing Scraped JSON

JSON is the second most popular data format encountered in web scraping and modern websites can return colossal datasets that can be difficult to parse for valuable information. These datasets are often complex, nested and even obfuscated to prevent scraping.

To parse JSON in Python there are a few standard ways:

  1. Convert JSON to Python dictionaries and use tools like nested-lookup to traverse the data and find the information you need.
  2. Use standard parsing path languages like Jmespath and JsonPath (similar to CSS selectors and Xpath)

Here's an example how to parse JSON data from web-scraping.dev/reviews with each of these methods:

Overall, JSON parsing is becoming a bigger part of web scraping as more websites use client-side rendered data rather than server side rendered HTMLs though as you can see — it's not very difficult to parse JSON data in Python given the right toolset!

For more see our full introductions to JMesPath and JsonPath JSON parsing languages.

Finally, modern Python has access to new tools like AI models and LLMs which can greatly assist with parsing or just perform the entire parsing task for you. Let's take a look at these next.

Scraping with AI and LLMs

AI and LLMs are becoming a bigger part of web scraping as they can greatly assist with parsing complex data or even perform the entire parsing task for you. Generally we can separate this process into two parts:

  • Machine learning AI models. These are specially trained models to understand HTML for data extraction.
  • Large Language models (LLMs). These are general purpose models that can be fine-tuned to understand HTML for data extraction.

While dedicated AI models are very difficult to build the LLMs are easily accessible in Python through services like OpenAPI, Gemini and other LLM marketplaces. Let's take a look at some examples.

Extracting Data with LLMs

To extract data using LLMs from scraped HTML we can add our HTML to the LLM prompt and ask questions about it like:

Extract price data from this product page

or more interestingly we can ask LLMs to make intelligent assessment of the data like:

Is this product in stock?
Summarize review sentiment of this product and grade it in a scale 1-10

Furthermore, we can use LLMs to assist us with traditional parsing technologies with prompts like:

For HTML: Generate CSS selector that select the price number in this product page
For JSON: Generate JmesPath selector that selects the review text in this product review JSON

These prompts can be very powerful and can save a lot of time and effort in web scraping. Here's an example how to use LLMs to extract data from HTML in Python with OpenAI API:

import httpx
import openai
import os

# Replace with your OpenAI API key
openai.api_key = os.environ['OPENAI_KEY']

response = httpx.get('https://web-scraping.dev/product/1')


llm_response = openai.chat.completions.create(
    model="gpt-4o",
    messages = [{
        "role": "user",
        "content": f"Extract price data from this product page:\n\n{response.text}"
    }]
)

# Extract the assistant's response
llm_result = llm_response.choices[0].message.content
print(llm_result)
'The price data extracted from the product page is as follows:\n\n- Sale Price: $9.99\n- Original Price: $12.99'

Important recent discovery is that LLMs perform better with data formats more simple than HTML. In some cases, it can be beneficial to convert HTML to Markdown or plain text before feeding it to the LLM parser. Here's an example how to convert HTML to Markdown using html2text in Python and then parse it with OpenAI API:

import httpx  # pip install httpx
import openai  # pip install openai
import html2text  # pip install html2text
import os

# Replace with your OpenAI API key
openai.api_key = os.environ['OPENAI_KEY']

# Fetch the HTML content
response = httpx.get('https://web-scraping.dev/product/1')

# Convert HTML to Markdown
html_to_markdown_converter = html2text.HTML2Text()
html_to_markdown_converter.ignore_links = False  # Include links in the markdown output
markdown_text = html_to_markdown_converter.handle(response.text)

print("Converted Markdown:")
print(markdown_text)

# Send the Markdown content to OpenAI
llm_response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "user",
            "content": f"Extract price data from this product page in markdown format:\n\n{markdown_text}"
        }
    ]
)

# Extract the assistant's response
llm_result = llm_response.choices[0].message.content
print(llm_result)
'The price data for the product "Box of Chocolate Candy" is $9.99, discounted from the original price of $12.99.'

Web scraping with LLMs is still a relatively new field and there are a lot of tools and tricks are worth experimenting with and for more on that see our web scraping with LLM and RAG guide.

Extracting Data with Extraction API

Scrapfly's Extraction API contains LLM and AI models optimized for web scraping and can save a lot of time figuring this new technology out.

For example here's how easy it is to LLM query your documents:

from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig

client = ScrapflyClient(key="SCRAPFLY KEY")

# First retrieve your html or scrape it using web scraping API
html = client.scrape(ScrapeConfig(url="https://web-scraping.dev/product/1")).content

# Then, extract data using extraction_prompt parameter:
api_result = client.extract(ExtractionConfig(
    body=html,
    content_type="text/html",
    extraction_prompt="extract main product price only",
))
print(api_result.result)
{
  "content_type": "text/html",
  "data": "9.99",
}

Or how to extract popular data object like product, reviews, articles etc. without any user input at all:

from scrapfly import ScrapflyClient, ScrapeConfig, ExtractionConfig

client = ScrapflyClient(key="SCRAPFLY KEY")

# First retrieve your html or scrape it using web scraping API
html = client.scrape(ScrapeConfig(url="https://web-scraping.dev/product/1")).content
# Then, extract data using extraction_model parameter:
api_result = client.extract(ExtractionConfig(
    body=html,
    content_type="text/html",
    extraction_model="product",
))
print(api_result.result)
Example Output
{
    "data": {
        "aggregate_rating": null,
        "brand": "ChocoDelight",
        "breadcrumbs": null,
        "canonical_url": null,
        "color": null,
        "description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.",
        "identifiers": {
            "ean13": null,
            "gtin14": null,
            "gtin8": null,
            "isbn10": null,
            "isbn13": null,
            "ismn": null,
            "issn": null,
            "mpn": null,
            "sku": null,
            "upc": null
        },
        "images": [
            "https://www.web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp",
            "https://www.web-scraping.dev/assets/products/orange-chocolate-box-small-2.webp",
            "https://www.web-scraping.dev/assets/products/orange-chocolate-box-small-3.webp",
            "https://www.web-scraping.dev/assets/products/orange-chocolate-box-small-4.webp"
        ],
        "main_category": "Products",
        "main_image": "https://www.web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp",
        "name": "Box of Chocolate Candy",
        "offers": [
            {
                "availability": "available",
                "currency": "$",
                "price": 9.99,
                "regular_price": 12.99
            }
        ],
        "related_products": [
            {
                "availability": "available",
                "description": null,
                "images": [
                    {
                        "url": "https://www.web-scraping.dev/assets/products/dragon-potion.webp"
                    }
                ],
                "link": "https://web-scraping.dev/product/6",
                "name": "Dragon Energy Potion",
                "price": {
                    "amount": 4.99,
                    "currency": "$",
                    "raw": "4.99"
                }
            },
            {
                "availability": "available",
                "description": null,
                "images": [
                    {
                        "url": "https://www.web-scraping.dev/assets/products/men-running-shoes.webp"
                    }
                ],
                "link": "https://web-scraping.dev/product/9",
                "name": "Running Shoes for Men",
                "price": {
                    "amount": 49.99,
                    "currency": "$",
                    "raw": "49.99"
                }
            },
            {
                "availability": "available",
                "description": null,
                "images": [
                    {
                        "url": "https://www.web-scraping.dev/assets/products/women-sandals-beige-1.webp"
                    }
                ],
                "link": "https://web-scraping.dev/product/20",
                "name": "Women's High Heel Sandals",
                "price": {
                    "amount": 59.99,
                    "currency": "$",
                    "raw": "59.99"
                }
            },
            {
                "availability": "available",
                "description": null,
                "images": [
                    {
                        "url": "https://www.web-scraping.dev/assets/products/cat-ear-beanie-grey.webp"
                    }
                ],
                "link": "https://web-scraping.dev/product/12",
                "name": "Cat-Ear Beanie",
                "price": {
                    "amount": 14.99,
                    "currency": "$",
                    "raw": "14.99"
                }
            }
        ],
        "secondary_category": null,
        "size": null,
        "specifications": [
            {
                "name": "material",
                "value": "Premium quality chocolate"
            },
            {
                "name": "flavors",
                "value": "Available in Orange and Cherry flavors"
            },
            {
                "name": "sizes",
                "value": "Available in small, medium, and large boxes"
            },
            {
                "name": "brand",
                "value": "ChocoDelight"
            },
            {
                "name": "care instructions",
                "value": "Store in a cool, dry place"
            },
            {
                "name": "purpose",
                "value": "Ideal for gifting or self-indulgence"
            }
        ],
        "style": null,
        "url": "https://web-scraping.dev/",
        "variants": [
            {
                "color": "orange",
                "offers": [
                    {
                        "availability": "available",
                        "price": {
                            "amount": null,
                            "currency": null,
                            "raw": null
                        }
                    }
                ],
                "sku": null,
                "url": "https://web-scraping.dev/product/1?variant=orange-small"
            },
            {
                "color": "orange",
                "offers": [
                    {
                        "availability": "available",
                        "price": {
                            "amount": null,
                            "currency": null,
                            "raw": null
                        }
                    }
                ],
                "sku": null,
                "url": "https://web-scraping.dev/product/1?variant=orange-medium"
            },
            {
                "color": "orange",
                "offers": [
                    {
                        "availability": "available",
                        "price": {
                            "amount": null,
                            "currency": null,
                            "raw": null
                        }
                    }
                ],
                "sku": null,
                "url": "https://web-scraping.dev/product/1?variant=orange-large"
            },
            {
                "color": "cherry",
                "offers": [
                    {
                        "availability": "available",
                        "price": {
                            "amount": null,
                            "currency": null,
                            "raw": null
                        }
                    }
                ],
                "sku": null,
                "url": "https://web-scraping.dev/product/1?variant=cherry-small"
            },
            {
                "color": "cherry",
                "offers": [
                    {
                        "availability": "available",
                        "price": {
                            "amount": null,
                            "currency": null,
                            "raw": null
                        }
                    }
                ],
                "sku": null,
                "url": "https://web-scraping.dev/product/1?variant=cherry-medium"
            },
            {
                "color": "cherry",
                "offers": [
                    {
                        "availability": "available",
                        "price": {
                            "amount": null,
                            "currency": null,
                            "raw": null
                        }
                    }
                ],
                "sku": null,
                "url": "https://web-scraping.dev/product/1?variant=cherry-large"
            }
        ]
    },
    "content_type": "application/json"
}

Step 3: Scaling Up

Finally, scaling web scraping is a major topic and a challenge of its own and we can break down scaling challenges into two parts:

  • Scaling technical challenges like concurrency, scheduling, deploying and resource use.
  • Scraper blocking challenges like proxies, user agents, rate limiting and fingerprint resistance.

Scaling and Solving Challenges

There are several important problems when it comes to scaling web scraping in Python and let's take a look at the most common challenges and solutions.

Speed of Scraping

The biggest bottle neck of python web scrapers is the IO wait. Every time a request is sent it has to travel around the world and back which is physically slow process and there's nothing we can do about it. That's an issue for a single request but for many requests we can actually do a lot!

Once we have multiple scrape requests planned we can scale up our scraping significantly by sending multiple requests concurrently. This is called concurrent scraping and can be achieved in Python with libraries like asyncio or httpx which support concurrent requests. Here's an httpx example:

import asyncio
import httpx
import time


async def run():
    urls = [
        "https://web-scraping.dev/product/1",
        "https://web-scraping.dev/product/2",
        "https://web-scraping.dev/product/3",
        "https://web-scraping.dev/product/4",
        "https://web-scraping.dev/product/5",
        "https://web-scraping.dev/product/6",
        "https://web-scraping.dev/product/7",
        "https://web-scraping.dev/product/8",
        "https://web-scraping.dev/product/9",
        "https://web-scraping.dev/product/10",
    ]
    async with httpx.AsyncClient() as client:
        # run them concurrently
        _start = time.time()
        tasks = [client.get(url) for url in urls]
        responses = await asyncio.gather(*tasks)
        print(f"Scraped 10 pages concurrently in {time.time() - _start:.2f} seconds")
        # will print
        "Scraped 10 pages concurrently in 1.41 seconds"

    with httpx.Client() as client:
        # run them sequentially
        _start = time.time()
        for url in urls:
            response = client.get(url)
        print(f"Scraped 10 pages sequentially in {time.time() - _start:.2f} seconds")
        "Scraped 10 pages sequentially in 10.41 seconds"


asyncio.run(run())

Just by running 10 requests concurrently we've reduced the scraping time from 10 seconds to 1.4 seconds which is an incredible improvement! This can scale up even higher based on scraper limits in a single process.

For a detailed dive in web scraping speed see our full guide below:

Retrying and Failure Handling

Another important part of scaling web scraping is handling failures and retries. Web scraping is a very fragile process and many things can go wrong like:

  • Network errors
  • Server errors
  • Page changes
  • Rate limiting
  • Captchas and scraper blocking

All of these issues are difficult to handle but they can also be temporary and transient. For this, it's a good practice to equip every Python scraper with retrying and failure handling logic as a bare minimum to ensure they are robust and reliable.

Here's an example how to retry httpx requests in Python using tenacity general purpose retrying library:

import httpx
from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_exception, retry_if_result

# Retry policy when no exception is raised but response object indicates failure
def is_response_unsuccessful(response: httpx.Response) -> bool:
    return response.status_code != 200

# Retry policy when an exception is raised
def should_retry(exception: Exception) -> bool:
    return isinstance(exception, (httpx.RequestError, httpx.TimeoutException))

# apply tenacity retry decorator to scrape function
@retry(
    stop=stop_after_attempt(5),  # Retry up to 5 times
    wait=wait_fixed(2),  # Wait 2 seconds between retries
    retry=retry_if_exception(should_retry) | retry_if_result(is_response_unsuccessful),
)
def fetch_url(url: str, **kwargs) -> httpx.Response:
    with httpx.Client() as client:
        response = client.get(url, **kwargs)
        # Explicitly raise an exception for non-200 responses
        response.raise_for_status()
        return response

# Example usage
try:
    url = "https://web-scraping.dev/product/1"
    response = fetch_url(url)
    print(f"Success! Status code: {response.status_code}")
    print(response.text[:200])  # Print first 200 characters of the response
except Exception as e:
    print(f"Failed to fetch the URL. Error: {e}")

Tenacity provides a lot of options for handling any web scraping target though not all issues can be solved with retries. Scraper blocking is by far the biggest challenge facing the industry so let's take a look at that next.

Scraper Blocking

While web scraping is perfectly legal around the world it's not always welcome by the websites being scraped. Many pages desire to protect their public data from competitors and employ bot detection mechanisms to block scrapers.

There are many ways to identify scrapers but they all can be summarized by the fact that scrapers appear different from regular users. Here's a list of key attributes:

  • Request IP address is being tracked or analyzed
  • Requests are performed through HTTP version 1.1 rather than HTTP2+ which is used by modern browsers. To address this use http2=True in httpx client.
  • Request headers are different of those of a regular browser. To address this use headers parameter in httpx client to set what you see in your web browser.
  • Scraper's TLS fingerprint is different from regular users.
  • Scraper doesn't execute Javascript.
  • Scraper Javascript fingerprint gives away it's a scraper.
  • Scraper doesn't store cookies or local storage data.
  • Scraper doesn't execute background requests (XHR).

Basically, any lack of normal user technology or behavior can be used to identify a scraper and we've covered all of these in our full guide below:

With that being said there are a few open source options available that can give you a head start against scraper blocking:

  • Curl Impersonate - enhances libcurl which powers many HTTP clients to appear more like a web browser.
  • Undetected Chromedriver - enhances Selenium to appear more like a user web browser.
  • Rotate IP proxies - distribute traffic through multiple IP addresses to avoid IP blocking.

These open source solutions can get you quite far!

Scraper blocking is incredibly diverse subject so new tools are being developed and available every day which can be difficult to keep track off and that's where Scrapfly can help you out.

Power Up with Scrapfly

Existing Scrapers

There are a lot of great open source tools for popular web scraping targets and often there's no reason to reinvent the wheel.

Here's a collection web scraping guides and scrapers on Github for the most popular scraping targets covered by Scrapfly:

Deploying Python Scrapers

To deploy Python scrapers docker is the most popular option as it allows packing the entire scraper environment into a single container and run it anywhere.

For this, let's use Python's poetry package manager to create a simple web scraping project:

  1. Install Poetry using the official install instructions
  2. Create your python project using poetry init
  3. Create a scraper.py file with your scraping logic
  4. Package everything with Dockerfile and docker build command

Here's a simple Dockerfile for a modern Python web scraper Python 3.12 and Poetry as package manager:

# Use the official lightweight Python image
FROM python:3.12-slim as base

# Set environment variables for Poetry
ENV POETRY_VERSION=1.7.0 \
    POETRY_VIRTUALENVS_IN_PROJECT=true \
    POETRY_NO_INTERACTION=1

# Set working directory
WORKDIR /app

# Install Poetry
RUN apt-get update && apt-get install -y curl && \
    curl -sSL https://install.python-poetry.org | python3 - && \
    ln -s /root/.local/bin/poetry /usr/local/bin/poetry && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Copy only Poetry files to leverage caching
COPY pyproject.toml poetry.lock ./
# Copy your project files,
COPY scraper.py .
# Install the project (if it's a package)
RUN poetry install --no-dev

# Specify the command to run your Python script
CMD ["poetry", "run", "python", "scraper.py"]

This Dockerfile will create a lightweight Python image with your scraper and all dependencies installed. You can deploy this to your favorite platform capable of running Docker containers like AWS, GCP, Azure or Heroku and run your scraper at any point!

Scraping Frameworks

Lastly, there are a few popular web scraping frameworks that can help you with building and scaling your web scraping operations. Most popular by far one is Scrapy which is a feature rich and powerful web scraping framework for Python.

Scrapy bundles a lot of technologies we've covered in this guide into a single cohesive framework and here's a quick breakdown:

Pros Cons ➕ Powerful HTTP client capable of concurrent requests. ➖ High learning curve and complex architecture. ➕ Crawling rules and crawling path generators for crawling websites. ➖ Does NOT support headless browser scraping. ➕ XPath and CSS selectors for parsing HTML. ➖ No AI or LLM support. ➕ Auto Retries and failure handling.

So, Scrapy offers quite a bit out of the box and can be a great starting point for people who are familiar with web scraping but need a more robust solution. For more on Scrapy see our full introduction guide below:

Summary

In this extensive guide we did a deep overview of contemporary web scraping techniques and tools in Python.

We've covered a lot so let's summarize the key points:

If you want to learn more, please visit our website Scraper Feeder.

  • Scraping can be done using HTTP client or headless web browser.
  • HTTP clients are faster but can't scrape dynamic pages.
  • Headless browsers can scrape dynamic pages but are slower and harder to scale.
  • Headless browsers can capture screenshots, browser data and network requests.
  • HTML Parsing can be done using CSS selectors, XPath, BeautifulSoup or AI models.
  • Datasets can be hidden in HTML elements
  • JSON Parsing can be done using JmesPath or JsonPath.
  • LLMs can be used to extract data from HTML and JSON.
  • Concurrency through Python asyncio can speed up scraping dozens to hundreds of times.