Scraping
Web scraping is the process of automatically extracting data from websites efficiently and reliably.
What is web scraping?
Scraping (or web scraping) allows you to automatically extract large amounts of data from websites systematically and efficiently, without having to manually navigate to (and copy and paste) bits of information from webpages. It helps speed up (by fully or partially automating) web-based research tasks, as well as eliminating transcription errors that can occur when copying information from one location to another by hand.
Uses of web scraping
There are lots of peronsal and professional reasons you might want to scrape the web. For example:
→
Price monitoring: automatically gather and track the prices and availability of products on different providers→
Market mapping: map out the important players in a space to build a market overview→
Job listings: track the job openings advertised by competitors, customers or suppliers to receive advance-notice of changes in their strategies (or track new openings advertised by companies you'd potentially like to work at in the future!)
How to scrape the web
Using HASH, it's extremely easy to scrape the web for information you're looking for - both on a one-time and automated/recurring basis.
→
Goal-driven scraping: provide AI workers with a research goal, and they'll determine what websites to visit on your behalf.→
Manual scraping: with the HASH browser extension installed, you can scrape entities from any webpage you're on with a single click.→
Passive scraping: rather than manually click "analyze" in the browser extension when visiting a page, you can also toggle HASH's "auto-analyze" mode to "on". When you do, entities will automatically be extracted as structured data from the web-pages you visit, and added to your HASH web. You can confine this activity to only run on certain domains (protecting you privacy), and specifically focus on scraping certain types of entities (if you wish).
To ensure your web scraping yields useful information, make sure you're outputting:
→
Structured, typed data: ensuring entities scraped conform to common schemas, known as "types", guarantees consistency and helps scrapers more completely capture all available information about an entity. This makes data more useful, and reliable.→
Validated, clean, labeled data: using a real-world example, you want to ensure that "Celcius" and "Farenheit" values aren't confusingly mixed up when comparing "Temperature" data from across different websites. Many scraping tools fail to account for data types in the same way HASH does, and this can lead to inaccurate, unreliable data.→
Provenance data: alongside the information you scrape, you want to know where it came from (tracking its source, and origins). You ideally also want the ability to inspect the original source (e.g. web page) as it existed at the specific point in time the information was scraped, to guarantee data was faithfully and accurately scraped in the first place.
Where appropriate, also consider:
→
Regularly scraping: if information is sensitive, and changes over time, set up recurring scraping jobs to check for updates, and keep information in your datastore up-to-date.→
Cross-referencing: some information can be gotten directly from its source, while other data may only be accessible through third-parties. Consider scraping data from multiple locations as a "check" on its accuracy.
All information scraped by HASH is strutured, typed, validated, labeled and cleaned. Provenance information is provided for every attribute of every entity scraped, so there's never any question where information came from or whether or not you can rely upon it.
Risks
While web scraping isn't inherently risky, there can be a few things to consider.
→
Legal risks: we're not lawyers, and can't provide legal advice (nor can any AI!), so it's important that you familiarize yourself with the laws and regulations around web scraping that might apply to your use case. While scraping a website, you may be required to comply with a website's own terms of service or other policies, and there may be restrictions on what you can do (at least commercially) with publicly-available information you access.→
Ethical risks: assuming something is legal, there's also an ethical component to consider. You should consider any relevant implications yourself and decide if you're comfortable with the risk.→
Reputational risks: ask yourself... if what you were doing was front-page news, and your family or friends knew what you were doing, how would you feel?→
Access risks: if you break a website's terms of use, its provider may seek to cut off your ability to access their service. Before scraping a website, ask yourself in a worst case scenario, if you were suddenly unable to access a given website, how would that impact you or your business? Certain sites such as LinkedIn may take measures to temporarily disable (and in extreme cases permanently ban) accounts who view too many profiles in a way that appears to be automated, or unnatural, in too short a span of time.
HASH gives you the choice whether to scrape webpages from your computer (using your normal IP address), or using our network of computers in the cloud. There can be advantages and disadvantages to each approach, and you can experiment on a per-site basis, or ask HASH to automatically optimize your scraping for you (splitting scraping up between both, or defaulting to one and falling back to the other).
Legally scraping the web
Is web scraping legal? Generally speaking, if something is publicly accessible via the internet, you're allowed to view and use it. However, there are exceptions.
The US Court of Appeals has determined that scraping of public information is legal, even if a website's terms of service may prevent it. However, just because something may be accessed and scraped, doesn't mean you have an unlimited commercial right to do whatever you want with it. And in certain legal jurisdictions, you may require a legitimate reason to hold certain kinds of information in the first place - e.g. personally identifiable information (PII).
Dutch data regulators have declared that if the personal information of EU citizens is found amongst the data you're scraping, you may require their consent. They give examples of certain kinds of scraping which are "always prohibited" under the European General Data Protection Regulation (GDPR), including:
→
scraping the internet to create profiles of people and then resell them;→
scraping information from protected social media accounts or private forums;→
scraping data from public social media profiles, with the aim of determining whether or not people qualify for a certain kind of insurance.
This guidance (and GDPR rules more broadly) do not apply to "domestic use", as well as in certain other "exceptional" cases. For example:
→
a private individual can scrape information (including PII) for a personal purpose (e.g. a hobby project), provided they then subsequently limit the collect information's distribution (e.g. to a few friends);→
a company can scrape personally identifiable information provided if it has a "legitimate interest" and the information scraped is used in a very targeted manner (e.g. an organization may scrape the websites of news media to gain insight into relevant news about its own company). Under most EU country's interpretations of GDPR, using PII to make money is not considered to be a "legitimate interest", while protecting against loss (via reputation monitoring or fraud prevention) generally is.
Obtaining consent to scrape PII. Under GDPR, an individual may only consent to the collection of their personal data if explicitly asked in advance.
Scraping non-PII. Many kinds of information on the web do not contain PII. For example, product listings and their prices on a marketplace like Amazon. In other cases, scraped information may be deanonymized (e.g. customer names removed from reviews of a product that may appear online).
Legally preventing scraping
As a user, you can attempt to assert your individual right to privacy, and any intellectual property rights you may hold (e.g. image rights, copyright). However, not all information you post online will necessarily be covered by these, and in any case it may be technically impossible to prevent others from scraping your information that appears online in the first place, if it's publicly accessible. You can, however, try to control what information appears online in the first place.
Meanwhile, as a website operator, you may expect that you have the right to prevent others scraping your website. This may often be true, but depending on the jurisdiction you're in, it may also not be universally the case. For example, in the US, depending on the circumstances, it may constitute "malicious interference with a contract" if you prevent one party from scraping your website, while at the same time not restricting the general availability of the information on it to others (see: HiQ v. LinkedIn). Current best-practice website operators can adopt to guard against scraping therefore includes:
→
ensuring that information on your website which you wish to protect against scraping is only visible to authenticated users of your website with their own accounts;→
requiring new users agree to your website's terms of service upon sign up;→
stipulating in your terms of service that each individual may only hold one account, and forbidding the use of automated scraping agents and software.
However, making your website's content inaccessible to unauthenticated users may also mean pages don't appear in search results, limiting your website's reach, and many potential real visitors are deterred from accessing content on your website in the first place. You should determine whether or not this makes sense for you.
The information on this page is provided for informational purposes only. HASH is not a law firm, and we cannot provide you with professional legal advice. Before making any decision related to any matter discussed by this page, you should obtain your own independent professional advice.
Create a free account
Sign up to try HASH out for yourself, and see what all the fuss is about
By signing up you agree to our terms and conditions and privacy policy