2025-11-05 16:00:20

In September 2025, podcaster Pablo Torre published an investigation alleging that the NBA’s Los Angeles Clippers may have used a side deal to skirt the league’s strict salary cap rules. His reporting, aired on multiple episodes of Pablo Torre Finds Out, focused on star forward Kawhi Leonard.
Leonard, one of the NBA’s most sought-after free agents, signed a four-year, US$176 million contract renewal with the Clippers during the 2021-22 off-season — the maximum allowed under league rules at the time. But Torre reported that in early 2022, Leonard’s LLC, KL2 Aspire, signed a cash and equity deal amounting to roughly $50 million through a brand sponsorship with Aspiration, a now-bankrupt financial technology startup that marketed itself as a climate-friendly bank.
Torre highlighted how the sponsorship coincided with major investments in Aspiration by Clippers owner Steve Ballmer and another team investor. The arrangement, Torre suggested, looked less like a conventional endorsement deal and more like a “no-show” side payment that could have helped the Clippers keep their star without technically violating the salary cap.
Leonard has denied that the partnership was improper, insisting he fulfilled his contractual obligations. The Clippers and Ballmer have also rejected claims of wrongdoing.
Torre’s reporting nevertheless had an immediate impact. Major outlets picked up the story, Aspiration’s bankruptcy filings drew renewed scrutiny, and the NBA announced it was investigating the matter.
In the wake of Pablo Torre’s revelations, many legacy media outlets highlighted his reporting.
At the University of Florida’s College of Journalism & Communications, part of my research involves unpacking the importance of decentralized networks of local outlets that cover stories from underrepresented areas of the country.
I see Torre’s work as a clear example of the growing need for this kind of bottom-up, citizen journalism — particularly given media industry trends.
Watchdog journalism is supposed to hold power to account.
This is sometimes referred to as the “fourth estate.” A term that dates back to the 17th century, it reflects the idea that an independent press is supposed to act as a fourth pillar of power, alongside the three traditional branches of modern democracies — legislative, executive, and judicial.
Proudly independent from political or financial influence, fourth estate news media has traditionally demonstrated a public service commitment to exposing corruption, encouraging debate, highlighting issues that are important and forcing leaders to address those issues.
The need for watchdog journalism appears more urgent than ever.
In the Western world, with authoritarianism on the rise, the fourth estate is experiencing widespread threats. Reporters Without Borders’ latest World Press Freedom Index found that global press freedom reached an all-time low in 2025. For the first time, it classified the situation as “difficult.”
Meanwhile, market forces and profit motives have weakened the media’s role in upholding democratic checks and balances. Fierce competition for clicks, eyeballs, and ad revenue impacts the type of content and stories that commercial outlets tend to focus on.
There appears to be less and less of a financial incentive to put in the time, resources, and effort required for deep investigative reporting. It’s just not worth the return on that investment for commercial outfits.
In the US, the Trump administration and media consolidations have further weakened the press’s ability to serve as a check on those in power.
Over the past year, two major TV networks — ABC and CBS — reached settlements for separate lawsuits brought forward by President Donald Trump tied to editorial choices on their broadcast programming. Needless to say, both decisions create significant precedents that could prove consequential for journalistic integrity and independence.

Image: Screenshot, CNN
In July 2025, the GOP-led Congress stripped over US$1 billion from the Corporation of Public Broadcasting, dealing a blow to public nonprofit outlets NPR, PBS, and their local affiliates.
More recently, Washington Post columnist Karen Attiah lost her job after speaking out against gun violence on social media in the wake of Charlie Kirk’s assassination.
From a structural standpoint, the US media ownership landscape has, for decades, been plagued by consolidation. Media channels have become merely one slice of the massive asset portfolios of the conglomerates that control them.
It’s probably fair to say that producing costly and burdensome watchdog journalism isn’t exactly a priority for busy executives at the top of these holding companies.
Independent local outlets are a dying breed, too.
Studies have shown that news deserts — areas with little or no local coverage — are multiplying across the US.
This has dire consequences for democratic governance: News deserts often correlate with lower civic engagement, reduced voter turnout, and less accountability for business and political leaders.
What’s more, fewer local journalists means less scrutiny of local governments, which undermines transparency and enables corruption.

Half of all counties in the US only have one newspaper — hundreds have none at all. Image: UNC Hussman School of Journalism and Media
For these reasons, more readers seem to be getting their news from social media and podcasts. In fact, according to a new Pew Research Center report, one in five Americans get their news from TikTok alone. And in its 2025 Digital News Report, the Reuters Institute noted that “engagement with traditional media sources such as TV, print, and news websites continues to fall, while dependence on social media, video platforms, and online aggregators grows.”
With this in mind, the US government’s latest framework for a deal for TikTok’s parent company, ByteDance, to sell the social media platform’s stateside operations to a consortium of US-based investors takes on even more significance. Many of these investors are allies of Trump. They’ll get to control the algorithm — meaning they’ll be able to influence the content that users see.
At the same time, social media has also allowed independent journalists such as Torre to find an audience.
Granted, with past journalistic stints at both Sports Illustrated and ESPN, Torre is not exactly a pure outsider. Yet he’s far from a household name, with fewer than 200,000 podcast subscribers.
Luckily, he’s by no means the only independent journalist serving as a citizen watchdog.
In January 2025, freelance journalist Liz Pelly published her book “Mood Machine,” which details her investigation into Spotify’s dubious financial practices. Through her research and reporting, she alleges that the music technology company conspired to suppress legitimate royalty payments to artists.
Andrew Callaghan of Channel 5 News fame on YouTube runs one of the largest crowdfunded independent newsrooms in the world. His exclusive interview with Hunter Biden in July 2025 got him a type of access that established mainstream media couldn’t get.
In 2020, Canadian siblings Sukh Singh and Harleen Kaur founded GroundNews, an online platform providing news aggregation, curation, and rigorous fact-checking. All Sides and Straight Arrow News are similar bottom-up projects designed to expose media bias and fight misinformation.
Meanwhile, the nonprofit media outlet ProPublica has published award-winning investigative journalism through a distributed network of local reporters. Their Life of the Mother series, which explored the deaths of mothers after abortion bans, earned them multiple awards while prompting policy changes at federal and state levels.
All have surfaced meaningful stories worth bringing to light. Historically, these types of stories were the purview of newspapers of record.
Today, underground sleuths might be among the last bulwarks to abuses of power.
The work isn’t easy. It certainly doesn’t pay well. But I think it’s important, and someone has to do it.![]()
Editor’s Note: This article is republished from The Conversation under a Creative Commons license. Read the original article.
Alex Volonte is an interdisciplinary industry professional with broad experience in multimedia entertainment, having initially worked as content producer for broadcasters and digital outlets across Central Europe. He is binational (Swiss/Italian) and fluent in all major Western languages. He holds a MSc in Media & Communications from the London School of Economics. He is currently pursuing a PhD in Journalism at the University of Florida, while acting as Graduate Teaching and Research Assistant at the College of Journalism & Communications.
2025-11-05 08:22:26
2025-11-04 16:00:14

From investigating a money scam sent over Whatsapp to exposing lucrative phishing accounts, alumni of GIJN’s four Digital Threats courses have produced a number of exposés of online scams and disinformation, from India to Kenya to the Philippines.
Presented by several cyber experts in a dozen remote, hands-on sessions, these courses — launched in 2023 — have trained 107 investigative journalists and researchers across four cohorts. A recent GIJN survey of 31 alumni from those cohorts found that individual alumni and their newsroom teams have subsequently published increasingly ambitious and sophisticated exposés into deeply veiled online deceptions. Follow-up interviews also revealed how the reporters behind these investigations leveraged the Digital Threats course’s new tools and techniques and bolstered their confidence to pursue seemingly impenetrable frauds. Just as follow-the-money reporters have gained much greater capacity in recent years to uncover shell company networks and hidden assets, cyber journalists are now increasingly empowered to expose digital camouflage and online misconduct.
Here are three detailed examples of recent, impactful investigations that were sharpened by GIJN’s Digital Threats training.
Earlier in 2025, a BoomLive story by Hera Rizwan, entitled It Wasn’t Just A WhatsApp Image That Stole Rs 2 Lakhs, revealed an online money scam in India. Published by BoomLive’s Decode section, Rizwan’s story explained to readers how this complicated fraud was triggered by the seemingly harmless step of clicking on an image in a WhatsApp chat, which ultimately allowed bad actors to use steganography and malicious APK binding to gain access to people’s devices — and then drain their bank accounts. Rizwan, an alumnus from GIJN’s fourth Digital Threats course, says one of the less appreciated benefits of the training is that reporters learn how to handle malware-type evidence without the risk of being infected themselves, and develop the confidence to dig into all the relevant materials.

Hera Rizwan of BoomLive’s Decode section. Image: Courtesy of Rizwan
“Lessons from trainers like Craig (Silverman), Luis (Assardo), Jane (Lytvynenko,) and (Etienne) Maynier — on verifying sources and connecting technical details to real-world impact — guided how I structured the investigation and narrative,” she explains. “The course taught how to translate complex technical concepts into accessible language for readers, a skill I applied when explaining steganography and APK binding in the story.”
Rizwan notes that Indian media coverage of cyber scams generally focuses on just the basic facts released by police, like reporting the incident and the amount of money lost. She says her deeper understanding of the subject matter allowed her to expose “the technical and psychological layers behind the attack — how a harmless-looking image could carry hidden malware, and how scammers exploit trust and familiarity to deceive victims.”
In another innovative digital investigation, Rizwan leveraged her new skills to reveal how one of India’s largest gig economy platforms used AI to alter the photos of service professionals without their consent, and with harmful consequences.
“Beyond data privacy concerns, the story highlights how algorithmic decisions and AI-driven automation can directly impact the livelihoods of vulnerable gig workers,” she explains.
In October 2024, an investigation by Nyekerario Omari, for Kenya’s Piga Firimbi, revealed how a network of Facebook accounts were impersonating airports and airlines around the globe to “sell” unclaimed passenger luggage that does not exist. Her story Fake Bags for Sale exposed 112 phishing accounts, and — similar to Rizwan’s WhatsApp story — also alerted readers about how they can recognize and avoid this type of scam.
An alumnus from the third Digital Threats cohort, Omari says her first challenge for the project was to establish relationships between the Facebook accounts without an exhaustive network analysis process. “I relied on the common tactics and techniques used by these accounts,” she explains, “such as language which created urgency about the lost luggage; claims that the listed airports were selling off valuable unclaimed bags at an affordable price; the use of common manipulated/edited images, and fake testimonials across posts.”

Nyekerario Omari is a graduate of GIJN’s third Digital Threats training cohort. Image: Courtesy of Omari
Omari says her Digital Threats training was particularly helpful with her next challenge: to deconstruct the mechanics of how the scam worked.
“Through the GIJN Digital Threats course, [I] was able to highlight how these phishing campaigns work, by digging into Facebook campaigns which rely on clickable impersonated websites with a short shelf life to collect the victim’s personal data,” she says. “The techniques used for this relied on analyzing domain data and Facebook page transparency using tools like WhoIs, DNS Checker, Big Domain Data, and Reverse image searches to establish any manipulation.”
She adds: “The tools and techniques introduced during the second week of training by Craig Silverman were new for me, along with the fundamentals of investigating digital threats.”
She also points out that the training provided her valuable “backup tools” to have on hand for this and other investigations — alternate methodologies that she can try when a familiar tool fails to produce results.
“Most of the tools shared complemented each other,” she notes. “For example, DNSLytics, DNS Checker, and Big Domain data complemented WhoIs [search]. Additionally, InVid WeVerify and Photo Forensics proved valuable when a page’s transparency offered limited information.”
In August of this year, Philippine Center for Investigative Journalism (PCIJ) reporter Regine Cabato, uncovered a “cyborg” propaganda operation, in which a mix of humans and AI-driven bots were peddling coordinated misinformation across the Internet in support of that country’s Duterte family dynasty.

PCIJ reporter Regine Cabato. Image: Courtesy of Cabato
An alumnus from GIJN’s 2024 Digital Threats training, Cabato says one impact of the story was to help inoculate readers against this wave of disinformation. Its reach was also impressive, as a social media video version of the story received 130,000 views across Facebook, Instagram, and TikTok. The exposé also triggered a retaliatory harassment campaign against PCIJ from extreme pro-Duterte supporters, later described in a story by Rappler.
“My co-author Giano Libot and I have kept abreast of partisan influencers, then started comparing notes on the trends we noticed,” says Cabato. “One of the things we noticed was how thousands of suspicious accounts would use “scripts” — recurring arguments with repetitive keywords — in the comment sections of mainstream newsrooms’ posts about Duterte-related news, usually within an hour of posting. I bookmarked one of these posts when I first noticed the trend.”
“For the comment section analysis, Giano used an Apify tool for initial scraping of a sample size of some 2,000 comments. I learned about the tool via the GIJN training,” Cabato says, adding that “Who Posted What? was another helpful tool she learned about in the Digital Threats course.
Rowan Philp is GIJN’s global reporter and impact editor. A former chief reporter for South Africa’s Sunday Times, he has reported on news, politics, corruption, and conflict from more than two dozen countries around the world, and has also served as an assignments editor for newsrooms in the UK, US, and Africa.
2025-11-03 19:00:06

GIJN’s member organizations re-elected four current board members whose terms expired in 2025, and also voted in three new board members.
The GIJN community elected four at-large board representatives, including new members Jeff Kelly Lowenstein and Mercedes Bluske Moscoso and returning members Margo Smit and Khadija Sharife. Smit continues as vice chair.
Three regional representatives were also elected. Incumbents Oleg Khomenok (Europe) and Anton Harber (Sub-Saharan Africa) were re-elected for another term, while Yasuomi Sawa (Asia/Pacific) was elected to the board for the first time.
Yasuomi Sawa won the Asia/Pacific board seat through a coin toss, per GIJN by-laws, prevailing in the tiebreaker over Wahyu Dhyatmika.
In all, 18 candidates vied for seven seats on the 14-member board, and 139 votes were tallied. The number of candidates and votes were both the highest in GIJN history. Elections for half the board are held every year. Board members serve for two years.
The top vote-getter overall was Oleg Khomenok. Full voting information is available at GIJN Board Election – 2025.
The 2025 class of GIJN Board members includes:
Voting took place by electronic ballot from October 14 to 30. For election background and rules, see our post on GIJN’s 2025 Board Election.
The new board will officially start after the 14th Global Investigative Journalism Conference in Malaysia, and will have its first board meeting in December 2025.
GIJN welcomes its new board members, and thanks all the candidates for their participation. GIJN also thanks outgoing board members: Syed Nazakat (Asia/Pacific), Nina Selbo Torset (Europe), and Zikri Kamarulzaman (Conference representative) for their significant contributions to the organization and investigative journalism around the world.
2025-11-03 16:00:14

Editor’s Note: Ahead of the 14th Global Investigative Journalism Conference in Malaysia, GIJN is publishing a series of short interviews with a globally representative sample of conference speakers. These are among the more than 300 leading journalists and editors who will be sharing practical investigative tools and insights at the event.
Helena Bengtsson is the data journalism editor at Gota Media and Bonnier News Local in Sweden, and the former data projects editor of the Guardian in the UK. She was a pioneer of computer-assisted reporting in Sweden.
In addition to her investigative contributions to international collaborative projects such as The Panama Papers, Bengtsson has emerged as a leading global advocate for the illuminating power of data journalism, and has guided numerous reporters away from their needless dread of numbers and spreadsheets.
As she told GIJN in a prior interview: “If your editor asks you to do a story about the pension system, you just have to find the information… To be honest, data journalism is a lot easier to understand than the pension system. Every journalist should have basic knowledge of how to sort and filter a spreadsheet and do simple calculations.”
She also trains journalists in building databases with advanced digital tools, and champions collaboration. Twice a recipient of Sweden’s Stora Journalistpriset (Great Journalism Award), Bengtsson’s investigations have featured scoops with direct accountability impact, including an exposé about the impunity enjoyed by teachers implicated in sexual harassment, which triggered a new Swedish law to protect children.
In addition to sharing key data techniques in a practical GIJC25 session on Using Data for Local Investigations, Bengtsson will also lead a workshop in Kuala Lumpur called The Coding Mindset, which is designed to teach attendees how to think strategically about programming, and how to turn raw data into compelling stories.
“There are a lot of people today wanting to get into code, and they say, ‘I have to learn Python,’ and I say, ‘But do you know spreadsheets?’— and they often say ‘no,’” she explains. “We have to talk about structured data before the next steps. So that [Coding Mindset] session will be a way to give people the knowledge and the thought process, and also the thesaurus of how you should express yourself in these projects. Because even if you’re going to ask Claude or ChatGPT to program for you, you have to know when to ask for a loop or a variable.”
GIJN: Of all the investigations you or your team have worked on, which has been your favorite, and why?
Helena Bengtsson: It’s almost impossible to choose, as my current project is almost always my favorite. If I have to pick, I’ll mention two. One older project involved convincing the Swedish Statistical Agency to cross-match the database of teachers with another one of people convicted in court. We only received statistical data, of course, but by combining those numbers with case studies from around Sweden, we could show that many teachers who had been convicted of sexual harassment or even abuse were still working in schools. The story led to a change in Swedish law, allowing schools to conduct background checks before hiring.
A more recent project involved processing a huge amount of data from the Swedish Roads and Vehicle Agency, analyzing roads all over Sweden. The project identified almost 16,000 dangerously constructed curves, and we told stories from across the country about hazardous roads and accidents that had occurred there.
GIJN: What are the biggest challenges for investigative reporting in your country?
HB: Some might say that working as a journalist in Sweden is easy — we have one of the best open records laws in the world. However, this also means that many Swedish journalists are not skilled at cultivating sources; it’s not part of our tradition. Open records only get you so far. To do truly great investigations, you need to uncover the hidden information, and that usually comes from sources.
GIJN: What reporting tools, databases, or techniques have you found surprisingly useful in your investigations?
HB: My most important tool is the spreadsheet: I can’t really do anything without Excel or Google Sheets. But something that has surprised me is how helpful it can be to write down your methodology during a project, not just at the end. Describing and thinking through your process as you go allows you to discover gaps in your research and forces you to consider steps and structure. For me, this has saved me from making mistakes and helped me find things I had forgotten or overlooked.
GIJN: What’s the best advice you’ve received from a peer or journalism conference — and what words of advice would you give an aspiring investigative journalist?
HB: Many years ago, I received a grant to do a fellowship abroad. I chose to work with data journalism at the Center for Public Integrity in Washington, DC. I learned so much there, but one thing that stayed with me was the importance of tackling large datasets. They went through millions of records on contributions and lobbying, and I learned not to shy away from huge amounts of data. Nowadays, we have much better tools, of course, but one piece of advice I would give to a young investigative or data journalist is not to be afraid of large volumes of information. There are ways to process and analyze them — and if you keep your focus on finding stories, you’ll be fine.
GIJN: What topic blindspots or undercovered areas do you see in your region? And which of these are ripe for new investigation?
HB: One downside of having great access to public information is that we’re not as skilled at investigating corporations and other entities where there is no public access. I would love to do more stories about how corporations might exploit their employees or the environment. Also, as a data journalist, I also see opportunities to use various AI tools to analyze unstructured data in a more sophisticated way. It won’t be easy, and I’m still struggling to find the best tools for this, but I believe that in a few years we will be telling different kinds of stories than we do today.
GIJN: Can you share a notable mistake you’ve made in an investigation, or a regret, and share what lessons you took away?
HB: It’s always hard to share mistakes, but many years ago we did an investigation into a charity that collected a lot of money. Among other things, we found that the founders had transferred some of the money into accounts in Switzerland. One of the founders lived abroad, and there were no images of him available. Since this was for television, we needed footage. He had a very unusual name, and after a lot of research I found some old video in my network’s own archive. Because the name was so unusual and the person in the old video had the same occupation as our subject, we used that footage. But it wasn’t him — the man in the old video still lived in Sweden, and he was understandably very upset that we had used his image. He received a public apology and some compensation. I was young and new as a journalist when this happened, but I still feel awful when I think about it. This taught me to check, double-check, and then check again — and never, ever assume anything.
GIJN: Can you share what you are looking forward to GIJC25 in Malaysia, whether in terms of networking or learning about an emerging reporting challenge or approach?
HB: As always, I look forward to being humbled. There are so many amazing journalists from countries all over the world and meeting them always leaves me feeling inspired. The circumstances they work under are very different from the comfortable journalistic life I lead. I have no death threats, my phone is not tapped, I can request a lot of public information, and officials are usually (though not always) available for interviews.
Rowan Philp is GIJN’s global reporter and impact editor. A former chief reporter for South Africa’s Sunday Times, he has reported on news, politics, corruption, and conflict from more than two dozen countries around the world, and has also served as an assignments editor for newsrooms in the UK, US, and Africa.
2025-10-31 15:00:32

Data is a crucial part of investigative journalism: It helps journalists verify hypotheses, reveal hidden insights, follow the money, scale investigations, and add credibility to stories. The Pulitzer Center’s Data and Research team has supported major investigations, including shark meat procurements by the Brazilian government, financial instruments funding environmental violations in the Amazon, and the “black box” algorithm of a popular ride-hailing app in Southeast Asia.
Before embarking on a data-driven investigation, we usually ask three questions to gauge feasibility:
Even if the first two answers are a clear yes, we still can’t celebrate, because the last question is often the most challenging and time-consuming step in the entire process.
Accessing data published online can be as simple as a few clicks to copy-paste tables or download a data file. But we often need to extract large amounts of data from online databases or websites and move it to another platform for analysis. If the site doesn’t offer a download function, we first reach out to the people behind it to request access. In some countries, if the data belongs to the government, you may be able to request access through a Freedom of Information (FOI) or Right to Information (RTI) request. However, this process often comes with its own set of challenges, which we won’t go into here.
If those options are unavailable, we may have to manually extract the data from the website. That could mean repeatedly clicking “next page” and copy-pasting hundreds of times, opening hundreds of URLs to download files, or running many searches with different variables and saving all results. Repetitions can reach the hundreds of thousands, depending on the dataset size and site design. In some cases this is possible if we have the time or budget, but for many journalists and newsrooms, time and money are a luxury.
This is when we consider web scraping, the technique of using a program to automate extraction of specific data from online sources and organize it into a format users choose (e.g., a spreadsheet or CSV). The tool that does this is called a scraper.
There are many off-the-shelf scrapers that require no coding. They may be Chrome extensions such as Instant Data Scraper, Table Capture, Scraper, Web Scraper, and Data Miner. Most use a subscription model, though some offer free access to nonprofits, journalists, and researchers (e.g., Oxylabs).
Most commercial scrapers are built to target popular sites: e-commerce, social media, news portals, search results, and hotel platforms. However, websites are built in many ways, and journalists often face clunky, unfriendly government databases. When commercial tools can’t handle these, we build custom scrapers. Building our own can be cheaper and faster, since we don’t need the extra features offered by commercial scrapers.
Building scrapers used to require solid coding skills. In our recent investigations, however, Large Language Models (LLMs) like ChatGPT, Google Gemini, or Claude helped us build scrapers for complex databases much faster and without advanced coding skills. With basic web knowledge, guidance, and examples, even non-coding journalists can build a scraper with LLMs. Here’s our recipe:
Before asking an LLM to write a script, you need a basic sense of how the target page is built, especially how the data you want to scrape is loaded. This brings us to the difference between a static and a dynamic web page.
A static web page is like a printed newspaper: The content doesn’t change once created unless someone edits the page. What you see is exactly what’s stored on the server. If you refresh, you’ll get the same content. An example is a Wikipedia page with a list of countries.
You can imagine a dynamic web page like a social media feed, where content changes automatically depending on who’s visiting, what they click, or what new data is available. When you load or refresh, your browser typically runs scripts and talks to a database or API (Application Programming Interface) to fetch the content. A dynamic page we scraped for the shark meat procurements investigation was São Paulo’s public procurements database, which requires you to fill in the search box to view data.
Static pages are usually easier to scrape because the data is present in the page’s HTML source, whereas dynamic pages often require extra steps because the data is hidden behind scripts or loads only when you interact with the page. That calls for a more advanced scraper, but don’t worry, we’ll show you how to build it.
However, you can’t always tell by just looking. Some pages that appear static are actually dynamic. Do a quick test: Open the page, right-click, and select “View Page Source” (in Chrome) to see the HTML. Use Find (Ctrl+F/Cmd+F) to search for the data you want to scrape. If the data is in the code, it’s likely static. If you don’t see it, the page is probably dynamic.

Alternatively, you can ask an LLM to check for you. Suggested prompt:
For LLMs without internet browsing, copy and paste the full HTML source code and replace “URL” in the prompt with “HTML source code.” If the source code exceeds the prompt limit, you need to pick the part containing your target data. Step 3 will explain how to do that.
To make this easier, here’s a real-world example from one of our investigations. We asked ChatGPT (GPT-5 model) to look at The Metals Company’s press releases page. We scraped all the press releases to analyze the company’s public messaging on deep-sea mining. ChatGPT identified it as a static web page.

If the web page is static, you can ask an LLM to write a Python scraping script (Python is a common language for scraping). Include the following in your prompt:
Here’s the prompt we gave to ChatGPT:
The press releases are listed across multiple URLs, accessible via the pagination bar at the bottom of the page. Even though we only provided ChatGPT the first-page URL, it inferred the rest by appending “?page=N” to the base URL.
ChatGPT will return a script with a brief explanation. Copy and paste it into a code editor (we use Visual Studio Code), save it as a Python file (.py), and run it in Terminal (macOS) or PowerShell (Windows).
If you’ve never run a Python script on your computer, you’ll need a quick setup. LLMs can generate step-by-step instructions with the prompt below. Setup usually includes installing Python (and its package manager Pip, or Homebrew on macOS) plus common scraping libraries like Requests, Beautifulsoup4, and optionally Selenium.
Pro tip: Keep prompts in the same LLM conversation so it retains context. If the script doesn’t work or setup errors appear, paste the full error messages or logs and ask it to troubleshoot. You may need a few iterations to get everything working.
Scraping a dynamic web page requires a few more steps and some extra web development know-how, but LLMs can still do the heavy lifting.
If the page requires you to interact with it to access data, you need to tell the LLM the exact actions to perform and where to find the data you need, because unlike a static page, that information isn’t in the plain HTML source. The LLM will show you how to install Python libraries like Selenium or Playwright, which can open browsers in headed or headless mode (more on that later) and interact with the web page like a human.
We’ll use São Paulo’s procurements database as an example. You need to select or fill in some of the 12 fields in the search box and click the “Buscar” (search) button to view the table.

In this case, a scraper works like a robot: It imitates your actions, waits for the data to load, and then scrapes it. To tell the scraper which fields to fill, options to select, buttons to click, and tables or lists to extract, you need the “names” of those elements on the page. This is where some basic HTML knowledge helps.
HTML is the language that structures a web page. Each piece of content is inside an HTML element. Think of them as boxes that store content. Common elements include <h1> for a header, <a> for a link, <table> for a data table, and <p> for a paragraph. Many elements also have attributes (their “labels”), such as class and id, which help you identify them.
We need to know the elements we must interact with, the ones that hold the data we want, and their attributes (class or id) so we can tell the LLM what to do.
For example, below is the HTML element for the first field, Área, in São Paulo’s search box. It’s a <select> element with the id “content_content_content_Negocio_cboArea” and class “form-control.” You can copy and paste this into your LLM prompt to help it build the scraper.
<select name=”ctl00$ctl00$ctl00$content$content$content$Negocio$cboArea”
id=”content_content_content_Negocio_cboArea”
class=”form-control”
onchange=”javascript:CarregaSubareasNegocios();”>
<option value=””></option>
<option value=”8″>Atividade</option>
<option value=”6″>Imóveis</option>
<option value=”3″>Materiais e Equipamentos</option>
<option value=”2″>Obras</option>
<option value=”7″>Projeto</option>
<option value=”4″>Recursos Humanos</option>
<option value=”1″>Serviços Comuns</option>
<option value=”5″>Serviços de Engenharia</option>
</select>
An HTML element can be nested within another element, creating layers of hierarchical structure. For example, a <table> usually contains multiple <tr>, which refers to the rows in the table and each <tr> contains multiple <td>, which is the cells in the row. You might need to tell this to LLMs to reduce errors in some cases.
To identify an element and its attributes, right-click it and choose Inspect (in Chrome). This opens DevTools, showing the page’s HTML and attributes. The selected element is highlighted; hovering in DevTools highlights the matching content on the page. To copy it, right click the element in the Elements panel and choose Copy → Copy element (gets the full HTML, including nested children). To target it in code, choose Copy → Copy selector (or Copy XPath) to grab a unique selector that helps you locate the element.

If the web page loads slow, you need to tell LLMs to include waiting time after an action.
To build the scraper for São Paulo’s procurements database, below is our prompt for ChatGPT (GPT-5 model). We want to search for closed tenders (“ENCERRADA”) under “Materiais e Equipamentos” and “Generos Alimenticios” between January 1, 2024, and December 31, 2024 (see the previous video).
I would like to write a Python script that I will run from my computer to scrape data from an online database and store it in a CSV file. It is a dynamic web page. Below are the steps required to view the data. I’ve specified the HTML elements that the scraper should interact with. Print messages at different steps to show progress and help debug any errors. Let me know if you need any more information from me.
1. Go to the search page:
https://www.imprensaoficial.com.br/ENegocios/BuscaENegocios_14_1.aspx
2. Fill in the search criteria in nine dropdown menus:
Select “Materiais e Equipamentos” for:
<select name=”ctl00$ctl00$ctl00$content$content$content$Negocio$cboArea” id=”content_content_content_Negocio_cboArea” class=”form-control” onchange=”javascript:CarregaSubareasNegocios();”>
Select “Generos Alimenticios” for:
<select name=”ctl00$ctl00$ctl00$content$content$content$Negocio$cboSubArea” id=”content_content_content_Negocio_cboSubArea” class=”form-control”>
Select “ENCERRADA” for:
<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboStatus” id=”content_content_content_Status_cboStatus” class=”form-control” onchange=”fncAjustaCampos();”>
Select “1” for:
<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoInicioDia” id=”content_content_content_Status_cboAberturaSecaoInicioDia” class=”form-control”>
Select “1” for:
<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoInicioMes” id=”content_content_content_Status_cboAberturaSecaoInicioMes” class=”form-control”>
Select “2024” for:
<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoInicioAno” id=”content_content_content_Status_cboAberturaSecaoInicioAno” class=”form-control”>
Select “31” for:
<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoFimDia” id=”content_content_content_Status_cboAberturaSecaoFimDia” class=”form-control”>
Select “12” for:
<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoFimMes” id=”content_content_content_Status_cboAberturaSecaoFimMes” class=”form-control”>
Select “2024” for:
<select name=”ctl00$ctl00$ctl00$content$content$content$Status$cboAberturaSecaoFimAno” id=”content_content_content_Status_cboAberturaSecaoFimAno” class=”form-control”>
3. Use Javascript to click this button:
<input type=”submit” name=”ctl00$ctl00$ctl00$content$content$content$btnBuscar” value=”Buscar” onclick=”return verify();” id=”content_content_content_btnBuscar” class=”btn btn-primary”>
4. Wait for the result page to load; wait for the complete loading of the result table:
<table class=”table table-bordered table-sm table-striped table-hover” cellspacing=”0″ rules=”all” border=”1″ id=”content_content_content_ResultadoBusca_dtgResultadoBusca” style=”border-collapse:collapse;”></table>
5. Scrape the text content in the whole table. The first <tr> is the header. There are another 10 <tr> in the table after the header. In each <tr> there are 6 <td>. In the output CSV, create another column (eighth column) to store the <href> tag of the last <td> (the 7th <td>).
6. Go to the next page of the result table and repeat the scraping of the table until the last page. The number of pages appears in:
<span id=”content_content_content_ResultadoBusca_PaginadorCima_QuantidadedePaginas”>5659</span>
Use javascript to click this button to go to the next page:
<a id=”content_content_content_ResultadoBusca_PaginadorCima_btnProxima” class=”btn btn-link xx-small” href=”javascript:__doPostBack(‘ctl00$ctl00$ctl00$content$content$content$ResultadoBusca$PaginadorCima$btnProxima’,”)”>próxima >></a>
Remember to wait for the table to finish loading before scraping.
7. Data from all result pages should be appended to the same CSV file.
Several things to note in the prompt. We explicitly asked the scraper to print messages during the run; this is useful for troubleshooting if errors occur. We also instructed the scraper to click the button using JavaScript, which imitates a human click. If you don’t specify this, the button might be triggered by other methods that won’t work on some pages. Our instructions were very specific about the table content, but such detail may not be necessary as the LLM can often infer it.
Scraping a dynamic page usually requires additional Python libraries. If you see an error message: “ModuleNotFoundError,” paste the message into the LLM; it will provide the install command. Think of libraries as toolkits the scraper needs to perform certain functions.
If you want to test the scraper and inspect the downloaded data before running the full job (which may take a long time), add an instruction like: “Scrape only the first two pages for testing.”
Most online databases organize data across multiple layers of pages. In São Paulo’s procurements database, after getting the list of tenders, you click the objeto (object) of each tender to open a new page with the tender’s contents. On each tender page, there may be one or more evento (event). You then click detalhes for each evento to view its contents. Hence, you may need two more scrapers: one to scrape each tender page and another to scrape each evento page. It’s possible to combine all three in one script, but that can get complicated. For beginners, it’s better to break the operation into smaller tasks.
In our case, we built a first scraper to extract the search results, including the URL of each tender. We then fed that URL list to a second scraper, which visited each tender page to extract the tender’s contents and the URLs of its eventos. Finally, we built a third scraper that visited each evento page, extracted the contents, and searched for keywords like “cação,” “pescado,” and “peixe” to determine whether the tender involved shark meat.

Not all websites are built the same, so the prompts shown here may not work verbatim for your targets. The advantage of using LLMs is their ability to troubleshoot errors, suggest fixes, and fold those changes into your scripts. We often go back and forth with ChatGPT to handle more complex pages.
By now, you know what you need to build your first scrapers and start systematically collecting information from the web. Sooner or later, though, when you try a new site or run the same scraper frequently, you’ll likely hit a wall: Your code is correct, you’ve identified the right HTML elements, everything should work … but the server returns an error.
If the error says “Forbidden,” “Unauthorized,” “Too Many Requests,” or similar, your scraper may be blocked. You can ask the LLM what the specific error means. Generally, websites aren’t designed to be scraped, and developers implement techniques to prevent it.
Below are three common types of blocking and strategies to try. This isn’t exhaustive. You’ll often need a tailored solution, but it’s a practical guide.
Some sites restrict access by region or country. For example, a government site may only accept connections from local IP (Internet Protocol) addresses. To work around this, use a VPN (e.g., Proton VPN or NordVPN) to obtain a local IP. You can, for instance, appear to connect from Argentina while you’re in France.
Some sites detect VPN traffic or block known VPN ranges, so the VPN may not do the trick. In that case, you can try a residential proxy provider (e.g., Oxylabs), which routes requests through real household IPs and heavily reduces detection.
If your scraper makes many rapid requests, the site may flag that behavior as automated as that’s not the usual human behavior. Add random delays in your code, e.g., using the Python function time.sleep(seconds), and consider simulating mouse movement or scrolling to appear more human. An LLM can suggest the best approach for your stack.
Even with delays, a site may block your IP after a threshold of requests. You can rotate IPs with a code-driven VPN setup (e.g., ProtonVPN + Tunnelblick + tunblkctl) or, more simply, use a residential proxy that automates IP rotation across a large pool.
When you visit a website, your browser quietly sends a few behind-the-scenes notes like “I’m Chrome on a Mac,” “my language is English,” and small files called cookies. These notes are called HTTP headers. A bare-bones scraper often skips or fakes them, which looks suspicious. To blend in, make your scraper act like a normal browser: Set realistic HTTP headers (e.g., a believable User-Agent and Accept-Language), use a typical screen size (viewport), and include valid cookies when needed.
A starter prompt for the LLM to set this could be:
Below is the script for a Python scraper collecting information from the web page xxx. Provide a basic configuration for HTTP headers and cookies to minimize detection or blocking, and show where to add them in my code.
[paste the full scraper script]
The LLM will return parameters to place near the start of your script.
You can toggle a parameter to run a scraper in headless mode (no visible window) or headed mode (with a window). It’s genuinely fun to watch Chrome click around in headed mode, and it’s easier to debug because you can see clicks, errors, and pop-ups, but it’s slower and uses more CPU/RAM. Headless is more efficient, though it can be easier for sites to detect. A good practice is to develop and test in headed mode, then switch to headless for full runs.
These strategies are just a starting point. If your scraper is still blocked, copy and paste the error into the LLM to get more clues on how to fix it.
Sometimes you’ll want to run a scraper on a schedule — for example, when a page frequently adds or removes information and you need to track changes. You could run it manually every day, but that’s cumbersome and error-prone.
A better approach is a tool that runs at set intervals. GitHub Actions is a very practical (and free) platform for executing workflows in the cloud. And it brings a key advantage: Your computer doesn’t need to be on as everything runs online on a virtual machine.
If you’re familiar with GitHub, a site for storing code and collaborating on software, you can upload your scraper to your repository, create a .yml file, and use YAML to configure the virtual machine and specify the run steps.
Again, you can get both the steps and the YAML script from an LLM. You might start with a prompt like:
I have a Python scraper in xxxx.py inside the xxxx repository of my GitHub account. I want to use GitHub Actions to run it once a day at 1 PM UTC. Tell me the steps and the YAML code I need to configure it. This is the source code for my scraper:
[paste the full scraper script]
With the steps, examples, and prompts above, we hope this guide helps non-coding journalists leverage the growing power of LLMs to work faster and smarter. If you have any questions or feedback, please get in touch with us at [email protected] and [email protected].
Editor’s Note: This story was originally published by the Pulitzer Center and is reposted here with permission.
Kuek Ser Kuang Keng is a digital journalist, data journalism trainer, and media consultant based in Kuala Lumpur, Malaysia. He is the founder of Data-N, a training program that helps newsrooms integrate data journalism into daily reporting. He has more than 15 years of experience in digital journalism. He started as a reporter at Malaysiakini, the leading independent online media in Malaysia, and worked on high-profile corruption cases and electoral frauds during the eight-year stint. He has also worked in several U.S. newsrooms, including NBC, Foreign Policy, and PRI.org, mostly on data-driven reporting. He holds a master’s in journalism from New York University. He is a Fulbright scholarship recipient, a Google Journalism Fellow, and a Tow-Knight Fellow.
Federico Acosta Rainis is a data specialist at the Pulitzer Center’s Environment Investigations Unit. After a decade as an IT consultant, he started working as a journalist in several independent media in Argentina. In 2017 he joined La Nación, Argentina’s leading newspaper, where he reported on education, health, human rights, inequality, and poverty, and he did extensive on-the-ground coverage of the COVID-19 pandemic in Buenos Aires. He has participated in investigations that won national and international awards for digital innovation and investigative journalism from ADEPA/Google and Grupo de Diarios América (GDA). He holds a master’s degree in data journalism from Birmingham City University. In 2021 he was awarded the Chevening Scholarship, and in 2022 he joined The Guardian’s Visuals Team as a Google News Initiative fellow.