Web data collection: busting five myths that everyone should know about!

When analysing data collection, it’s important to understand what is being collected and how it is being processed. Ron Kol, CTO at Bright Data, speaks to Intelligent SME.tech about public web scraping and debunks five myths surrounding it.

Waiting for data is boring. Businesses of all kinds, especially consumer-facing brands that have to move quickly within their market and react to customer changes, are tired of waiting around for it. This is why public web data collection – perhaps better known as public web scraping – has become a ‘go to’ strategic move. It helps organisations remain competitive in a market that’s predominantly volatile, where one market action triggers another and another and so on. Real-time data can answer many of the questions that top-level executives have about their future profits. However, this vast and expansive field carries more than a few myths around how it works, why it’s important and whom it benefits.

Before we continue, it is important to note and keep in mind that web scraping is an essential real-time resource that contributes to the success of an organisation. Some people still think the industry has ambiguous borders, but given how rapidly it is developing and expanding, it is important to dispel some of the myths that have been connected to it recently.

Myth #1: Harvesting, collecting, scraping, it’s all illegal… wrong!

Let’s put this one to bed; public web scraping is not illegal, full stop. The website is within the bounds established by the law as long as it is freely available and not behind a paywall or log-in type portal. In fact, a recent judgement from a US Federal Court in the hiQ/LinkedIn case likens instances of public online scraping to window shopping.

Additionally, start-ups, SMEs and large corporations all take part in public online data collection to monitor the strategic choices and market trends of their rivals, as well as to conduct fresh market research on their own data. The main goal is to find new avenues for innovation and growth while also making sure that an organisation doesn’t pass up any opportunities that will help guide more success.

As with all processes, it is vital that businesses follow compliance regulations and if their public web scraping is outsourced, they must always work with their data collection provider to ensure that all operations are legal and ethical. To avoid any doubt, businesses should work with providers to understand what can and can’t be collected, both from a legal and ethical standpoint.

One has a moral obligation to ensure that the data they collect is ethical and promotes the greater good because there are no regulations in this area. If not, they need to re-evaluate their plans – it would be immoral and also illegal to do otherwise.

Myth #2: Web scraping hurts businesses and makes it more difficult for them to compete.

Another myth busted! Totally not true, in fact, quite the opposite. Public web data collection, or web scraping, provides anyone with the transparency needed when accessing the Internet. It allows all players in the market to compete openly by providing accurate market research information. For example, if Company A wishes to set its own pricing strategy in motion, they obviously need to be aware of the special offers or pricing of one of its main competitors, let’s call them Company B.

In the old days, Company A would send out ‘mystery shoppers’ who would manually take note of Company B’s offerings and pricing and adjust their own accordingly to make them more attractive to consumers. Today, our shopping ecosystem has clearly gone digital and these ‘mystery shoppers’ have simply shifted into online data collection, which provides companies with the information they need to decide their pricing strategy or special offers. Online data collection ensures that companies can effectively compete and continue to attract their target consumer base.

The ability to compete openly benefits businesses, providing better price offerings, new products and an improved shopping experience that benefits consumer communities. Online data collecting encourages information transparency and advances an openly competitive economy and you can’t really argue otherwise.

Myth #3: Question marks surround web data collection’s ethical nature

Let’s look at the fact that all public domain data can be openly accessed – that’s a given. However, questions around ethics kick in when you select your web data collection provider. They have to be committed to accessing public web data only. Public web data discussed here must be treated with the utmost sensitivity, integrity and professionalism. If done right, which means following international regulations and clear and well-established ethical guidelines to preserve users’ data privacy, then you are ensuring that you are legal and ethical.

Simply said, public web scraping gives you the same level of Internet transparency as the typical user. To ensure that your data collection is being done ethically, there are clear hazards and important standards you must fulfil. These standards are an absolute need that all operators must follow, without exception. They are neither optional nor a ‘good to have’ addition to your company policy.

Myth #4: Information sources are mostly private

Totally false: the majority of web-based data is public. Internet growth statistics from Statista show that 4.66 billion people are using the Internet (as of January 2021). That’s close to 60% of the world’s population. Considering that most of the world’s data has been generated within the last two years alone, it is estimated that close to 70% of the data being generated is public (out of which, humans are responsible for close to 60% of that generated data). Although these statistics only give us a rough indication, the trend is clear to see.

When it comes to web scraping providers can only gather information that is open to the public. To further simplify this, that means anything that you or I could access using a standard browser on the Internet without logging in. The data is off bounds if you have to log in, simple.

Myth #5: Online data collection is only carried out by “dodgy” businesses

Wrong! Companies of all sizes, from Fortune 500 firms to start-ups and SMEs, gather and utilise public web data to inform their decision-making. The only difference is in the type of data they require and how frequently they need it. In today’s real-time economy, companies can’t thrive without being able to see the full market reality, and to do that they need access to the largest data source. When our reality is mostly led by digital innovation, it is no surprise that public web data has become the ‘no-brainer’ solution.

As the CTO of a market leader in the data collection domain, you might think it is a given that I am fighting for this corner. However, for this industry to succeed, we must be our own harshest critics and ensure that we and others looking to collect data aren’t tempted to engage in illegal or unethical activities in lieu of strict regulations.

With any emerging technology, especially within the data space, there is always going to be an analysis that explores its purpose and legalities. However, there is a cause for the greater good, allowing businesses to prosper from the latest, publicly available online insights. When analysing data collection, it’s important to understand what is being collected and how it is being processed.

With so many leading brands dependent on data insights, this will become a fast-growth industry, and it’s up to everyone in this community to promote legal and ethical compliance, if anything, it’s our moral duty to do so.

Click below to share this article

Web data collection: busting five myths that everyone should know about!

Intelligent Technologies

Regional News

Analysis

Content Hubs

Other Websites