MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Create And Configure Azure Firewall

2026-03-14 09:30:33

An Azure Firewall is a cloud-based network security service in Microsoft Azure that helps protect your virtual network resources by filtering and controlling traffic between your Azure resources and the internet or other networks.

Scenario:

Your organization requires centralized network security for the application virtual network. As the application usage increases, more granular application-level filtering and advanced threat protection will be needed. Also, it is expected the application will need continuous updates from Azure DevOps pipelines. You identify these requirements.

. Azure Firewall is required for additional security in the app-vnet.

. A firewall policy should be configured to help manage access to the application.

. A firewall policy application rule is required. This rule will allow the application access to Azure DevOps so the application code can be updated.

. A firewall policy network rule is required. This rule will allow DNS resolution.

Skilling Tasks:

. Create an Azure Firewall.

. Create and configure a firewall policy.

. Create an application rule collection.

. Create a network rule collection.

Architecture diagram:

No 1. Create Azure Firewall subnet in our existing virtual network.

i. In the search box at the top of the portal, enter Virtual networks. Select Virtual networks in the search results.

ii. Select app-vnet.

iii. Select Subnets.

iv. Select + Subnet and Configure.

v. Save Changes.

Note: Leave all other settings as default.

No 2. Create an Azure Firewall.

i. In the search box at the top of the portal, enter Firewall. Select Firewall in the search results.

ii. Select + Create.

iii. Use Values provided in your deployment guide to create the firewall.

v. Select Review + create and then select Create.

No 3. Update the Firewall Policy.

i. In the portal, search for and select Firewall Policies.

ii. Select fw-policy.

No 4. Add an application rule.

i. In the Rules blade, select Application rules and then Add a rule collection.

ii. Configure the application rule collection and then select Add.

Note: The AllowAzurePipelines rule allows the web application to access Azure Pipelines. The rule allows the web application to access the Azure DevOps service and the Azure website.

No 5. Add a network rule.

i. In the Rules blade, select Network rules and then Add a network collection.

ii. Configure the network rule and then select Add.

I Was Tired of Parsing XBRL, So I Built a SEC EDGAR API

2026-03-14 09:24:22

If you've ever tried to pull financial data from the SEC's EDGAR system, you probably know where this is going.

I wanted to build a stock screener. Simple idea — just show me revenue trends for a few companies. Should take an afternoon, right?

Nope. First you need a company's CIK number (Apple is 0000320193 — don't ask me why it's zero-padded to 10 digits). Then you download their XBRL filings, which are nested XML with namespaces like us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax. Then you realize different companies use slightly different tags for the same concept. Then you start questioning your life choices.

I spent way too long on this, so I wrapped the whole thing in a REST API. Pass a ticker, get JSON back. Done.

Show me the data

Company info:

GET /v1/company/AAPL
{
  "cik": "320193",
  "ticker": "AAPL",
  "name": "Apple Inc.",
  "sic": "3571",
  "sic_description": "Electronic Computers",
  "fiscal_year_end": "0926",
  "state": "CA"
}

No CIK lookup. Just the ticker.

Financial statements:

GET /v1/financials/AAPL?statement=income_statement&limit=3
{
  "ticker": "AAPL",
  "name": "Apple Inc.",
  "statements": {
    "income_statement": [
      {
        "concept": "revenue",
        "label": "Revenue from Contract with Customer",
        "unit": "USD",
        "data": [
          {
            "fiscal_year": 2025,
            "fiscal_period": "FY",
            "value": 383285000000.0,
            "filed": "2025-10-31",
            "form": "10-K"
          }
        ]
      }
    ]
  }
}

Revenue, net income, EPS, operating expenses — all the stuff that took me hours to extract from XBRL, now in one call.

Getting Apple's revenue in Python

import requests

url = "https://sec-edgar-data-api.p.rapidapi.com/v1/financials/AAPL"
headers = {
    "x-rapidapi-key": "YOUR_API_KEY",
    "x-rapidapi-host": "sec-edgar-data-api.p.rapidapi.com"
}
params = {"statement": "income_statement", "limit": 5}

response = requests.get(url, headers=headers, params=params)
data = response.json()

for item in data["statements"]["income_statement"]:
    if item["concept"] == "revenue":
        for year in item["data"]:
            print(f"FY{year['fiscal_year']}: ${year['value']:,.0f}")
FY2025: $383,285,000,000
FY2024: $394,328,000,000
FY2023: $365,817,000,000

That stock screener I mentioned? Finally built it.

Other endpoints

Search companies (fuzzy matching, so typos are fine):

GET /v1/company?q=tesla

Filing history — 10-K, 10-Q, 8-K, whatever:

GET /v1/filings/TSLA?form=10-K&limit=5

Balance sheet and cash flow:

GET /v1/financials/MSFT?statement=balance_sheet
GET /v1/financials/MSFT?statement=cash_flow

Covers all 10,000+ SEC-registered public companies. The data comes directly from SEC EDGAR (data.sec.gov), so it updates as companies file.

Why not just use [existing API]?

Fair question. I looked at the alternatives — most are either $50+/month for basic access, or they give you 47 endpoints when you just need 4. I wanted something I could hand to a junior dev and they'd figure it out in 5 minutes. That's basically the design principle: if it needs documentation longer than a README, it's too complicated.

Try it out

Free tier on RapidAPI — 100 requests/month, no credit card:

SEC EDGAR Data API on RapidAPI

I'm also running a bot at @SECEdgarBot that tweets when notable filings drop and flags interesting financial signals. Still early, but it's been fun to build.

All data sourced from SEC EDGAR (data.sec.gov) — publicly available US government data.

Using Python to Load Google Docs into AI — Drive API Minimal Permission Setup

2026-03-14 09:18:33

Introduction: The Challenge of AI Not Being Able to Directly Read Google Documents

"Please analyze this document"

Have you ever encountered a situation where, when you submitted a Google Document URL to the latest AI models like Gemini 3.1 Pro or Claude Opus 4.6, you received a response saying, "I cannot directly access the URL and read its contents"?

While AI is a powerful tool for text generation and summarization, it cannot retrieve content from authenticated internal document URLs as-is. This results in inefficient manual tasks such as copying and pasting lengthy meeting minutes or proposals, or downloading files as PDF/DOCX formats before uploading them.

This article explains how to solve the issue of AI not being able to read Google Documents using Google Drive API and OAuth 2.0. With proper configuration and a few lines of Python code, you can securely obtain document content as input for AI.

Trap 1: Time Loss and Layout Issues from Manual Work

Because AI cannot read the URL directly, users resort to manually copying and pasting content or downloading and uploading files. For lengthy documents, this is time-consuming and risks copy errors or layout issues leading to missing information.

Trap 2: Security Risks from Uncontrolled Sharing Settings

One might think setting the document to "public on the web" would allow AI to read it directly. However, this is highly dangerous from a security perspective. Exposing confidential internal documents publicly can lead to severe data leaks.

Trap 3: '403 Error' Due to Lack of API Permission Knowledge

Even when attempting to use Google Drive API, incorrect permission settings can result in a googleapiclient.errors.HttpError: <HttpError 403: Insufficient Permission> error. Without understanding the necessary scopes, navigating Google Cloud Console can lead to getting stuck.

The Decisive Solution: OAuth 2.0 Flow and Minimal Privilege Scopes

The core solution to resolve errors and balance security with convenience is "secure permission delegation via OAuth 2.0" and "scope configuration based on the principle of least privilege."

OAuth 2.0 is a mechanism that allows an application (Python script) to securely access Google accounts without knowing the user's password, using temporary "access tokens" issued by Google.

The "scope" defines which operations the access token permits. For retrieving document text, set the following scope:

https://www.googleapis.com/auth/drive.readonly

drive.readonly is the minimal and safest permission, allowing only read access to Google Drive files. This eliminates the risk of accidental deletion or modification.

Practical Guide: Google Drive API Integration Steps (Python Version)

Step 1: Google Cloud Platform (GCP) Project Setup

  1. Log in to Google Cloud Console (console.cloud.google.com).
  2. From the project selection dropdown, create a new project.
  3. In the left menu, select "APIs & Services" → "Library".
  4. Search for "Google Drive API" and click "Enable".

Step 2: OAuth Consent Screen and Credential Setup

  1. In the left menu, select "APIs & Services" → "OAuth consent screen".
  2. Choose "External" (or "Internal" as appropriate) for User Type and click "Create".
  3. Fill in required fields like app name and support email.
  4. In the "Scopes" settings, click "Add or remove" and check .../auth/drive.readonly, then save.
  5. In the "Test users" settings, add your own Google account (Gmail address) and save. Forgetting this causes 403 errors.

Step 3: Download Credentials

  1. In the left menu, select "Credentials".
  2. Click "Create credentials" → "OAuth client ID".
  3. Choose "Desktop app" as the application type and click "Create".
  4. Click "Download JSON" and rename the file to credentials.json, placing it in the working directory.

Step 4: Python Environment Setup and Code Execution

Install required libraries. Here's an example using the fast package manager uv (regular pip works too).

uv pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib

Save the following Python code as main.py.

import os.path
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

Scope settings (read-only)

SCOPES =

def get_google_doc_content(doc_id):
"""Function to retrieve Google Document content as text"""
creds = None

# Load existing token if available
if os.path.exists("token.json"):
    creds = Credentials.from_authorized_user_file("token.json", SCOPES)

# Re-authenticate if token is missing or invalid
if not creds or not creds.valid:
    if creds and creds.expired and creds.refresh_token:
        creds.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(
            "credentials.json", SCOPES
        )
        creds = flow.run_local_server(port=0)

    # Save token for future use
    with open("token.json", "w") as token:
        token.write(creds.to_json())

try:
    service = build("drive", "v3", credentials=creds)

    # Google Documents are not binary files, so use export_media method
    # Retrieve as plain text (MIME type text/plain)
    request = service.files().export_media(
        fileId=doc_id,
        mimeType="text/plain"
    )

    # Download and decode content
    response = request.execute()
    text_content = response.decode('utf-8')

    print(f"--- Content retrieved successfully for document ID: {doc_id} ---")
    return text_content

except HttpError as err:
    print(f"Error occurred: {err}")
    return None

if name == "main":
# Target Google Document ID to process
# For URL https://docs.google.com/document/d/abc123xyz.../edit, ID is abc123xyz...
TARGET_DOC_ID = "Your document ID here"

content = get_google_doc_content(TARGET_DOC_ID)

if content:
print("\n=== Document content ===\n")
print(content + "...\n(Truncated)")

# Additional processing with the retrieved content (e.g., sending to AI API)




Code Explanation and Execution Behavior

When the script is executed for the first time, a browser window opens displaying the Google login screen. Select your account and grant the requested permissions to complete authentication. Simultaneously, token.json is generated, enabling subsequent runs without browser authentication.

The key point of the retrieval logic is the use of the service.files().export_media method. While regular files (images, PDFs, etc.) are downloaded using get_media, Google Document format does not have an actual file, so export_media must be used to convert it to the specified format (in this case, text/plain).

Note: The content exported via export_media is limited to 10MB. Extreme caution is needed when handling extremely large documents.

AI Integration: Utilizing Retrieved Text

Once the text is retrieved via API, it can be directly passed to services like Gemini API or Claude API.

Additionally, if you have a local PC environment equipped with a high-end GPU such as the RTX 5090 (32GB VRAM), you can load the retrieved text into local LLMs like Gemma 3 or NVIDIA Nemotron. This enables a completely offline environment for document retrieval, allowing secure data analysis of sensitive information.

Summary: Automating Document Processing via API

Following the steps introduced here provides the following benefits:

  • Elimination of manual work: No more copy-pasting or file conversion.
  • Enhanced security: OAuth 2.0 and minimal permission scopes ensure safe access.
  • Scalability: Programmatic automation enables handling large volumes of documents.

Leveraging APIs to build an environment where AI can operate efficiently is crucial for future automation workflows. Be sure to obtain credentials.json and try streamlining your document processing.

Hardware Selection for Local LLMs: Overcoming the VRAM Wall with Practical GPU, CPU, and Memory Configurations

2026-03-14 09:18:30

Introduction: Gemini Flash Equivalent Locally? The Despair of a Slow Development Environment

If you, like me, were thrilled by the explosive responsiveness of Google Gemini 2.5 Flash and dreamed of running it locally without privacy concerns, this article is for you. As a lawyer and auditor, I work daily with vast XBRL data and PDF documents, building a self-evolving AI system. My goal is clear: to construct a local LLM system that surpasses, or at least matches, Gemini 2.5 Flash in reasoning capability and speed, enabling it to achieve 80% accuracy on the bar exam multiple-choice section and flawless case handling in essays.

However, reality was harsh. The PC I used—a high-performance ASUS gaming rig with an RTX 5070 Ti and 8GB VRAM—was purchased with the assumption it could handle 32B-class models. Yet, when attempting to run such models, inference speed became unbearably slow, like a turtle. Even 7B models were sluggish, and 32B models caused main memory overflow, requiring data offloading to system RAM. This resulted in token generation taking minutes, with a spinning sandclock during web loading—a feeling of despair that eroded my development motivation. I felt trapped by the "intelligence wall," contemplating giving up.

Yet, through dialogue with Gemini, I found a breakthrough. This article details how I escaped three specific traps to reach the conclusion of "RTX 5090 + 32GB VRAM" as the optimal configuration, with step-by-step instructions for replicating a Gemini Flash-equivalent local LLM environment in 5 minutes, tailored for intermediate engineers.

What You'll Gain from This Article

  • The importance of VRAM and memory bandwidth in local LLM environment setup
  • The technical truth: professional vs. consumer GPUs for local LLMs
  • Rules for selecting PC parts to build the ultimate local LLM environment
  • Specific Python code and setup steps to run Gemini Flash-equivalent models on RTX 5090
  • Library selection (vLLM, bitsandbytes) to dramatically improve development efficiency

No more suffering from slow inference speeds. Your local LLM environment can transform into a "knowledge lab" today.

The Delays and Regrets I Fell Into

My project goal was to run a model with Gemini 2.5 Flash-level "intelligence" locally. I believed a minimum of 32B (32 billion parameters) was necessary. However, with my RTX 5070 Ti (16GB VRAM), this goal was physically impossible.

Tragedy with RTX 5070 Ti (16GB VRAM)

7B models ran, but complex queries or long text generation caused delays of seconds to tens of seconds. Attempting 32B models like DeepSeek-R1-Distill-Qwen-32B caused VRAM overflow, offloading parts to system RAM. This resulted in inference speeds over 10x slower. The bottleneck was PCIe bus bandwidth (max ~64GB/s for Gen4 x16) versus GPU internal memory bandwidth (hundreds of GB/s to 1TB/s). The overhead of data transfer between layers caused questions to take minutes to answer, breaking my thought cycle.

Breaking the 'VRAM Wall': 32GB VRAM, RTX 5090 Is the Key to "Knowledge Liberation"

After failures and trials, I reached a clear conclusion: "VRAM abundance is justice," and "RTX 5090 (32GB VRAM) is the only choice for local Gemini 2.5 Flash-equivalent performance."

Breaking the 'VRAM Wall': The Joy of Loading 32B Models Entirely on GPU

The biggest bottleneck was 32B models exceeding VRAM and spilling into main memory. To solve this fundamentally, 32B models must fit entirely in VRAM. RTX 5090's 32GB VRAM was the decisive solution.

Loading a 32B model in FP16 (16-bit floating point) requires ~64GB VRAM—insufficient even for RTX 5090. However, 4-bit quantization (AWQ, GPTQ, GGUF) is standard, reducing model size to ~1/4. A 32B model becomes ~18-20GB. Adding KV cache for context length, RTX 4090's 24GB VRAM leaves only ~4GB after loading, causing OOM with long contexts. RTX 5090's 32GB VRAM provides >10GB headroom, handling thousands of tokens smoothly and enabling RAG tasks comfortably.

The Optimal PC Configuration: The Shock of PC Kobo LEVEL-R789-LC285K-XK1X

I chose PC Kobo's LEVEL-R789-LC285K-XK1X model. The decisive factors were:

Complete Implementation Steps (Copy-Paste Ready): Running Gemini Flash Equivalent on RTX 5090 in 5 Minutes

Step 0: Installing WSL2 (Ubuntu) and Initial Setup

Open Windows PowerShell with administrator privileges and run the following command to install WSL2 and Ubuntu:

wsl --install -d Ubuntu-24.04

After installation, open the WSL2 terminal and keep the system up to date:

sudo apt update && sudo apt upgrade -y
sudo apt install build-essential git curl wget -y

Step 1: Installing CUDA Toolkit 13.1

In WSL2, installing NVIDIA drivers on the Windows side is sufficient for GPU recognition. No driver installation is needed on the WSL2 side. Install only the CUDA Toolkit (version 13.1):

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-1

Verify GPU recognition:




Success is confirmed when "NVIDIA GeForce RTX 5090" and "32768MiB" (32GB VRAM) are displayed.

### Step 2: Building Python Environment (uv)

Use the fast `uv` package manager to create a clean Python virtual environment:



```curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
uv venv llm-env --python 3.11
source llm-env/bin/activate```



### Step 3: Installing Inference Libraries

Install core libraries for LLM inference. Follow PyTorch's official guide for your CUDA version, then add:



```uv pip install transformers accelerate bitsandbytes sentencepiece protobuf scipy
uv pip install vllm  # High-speed inference engine```



### Step 4: Running LLM Model (Transformers Version)

Use Hugging Face's `transformers` library to load and run the model. Save the following code as `run_llm.py`. This configuration leverages 4-bit quantization to efficiently utilize 32GB VRAM:



```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Model ID (example: DeepSeek-R1 32B distilled model)
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"

# 4-bit quantization settings
# Utilizes RTX 5090's power to save VRAM and handle long contexts
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16  # RTX 50 series natively supports bfloat16
)

print(f"Loading model: {model_id}...")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically assigns to GPU
    trust_remote_code=True
)
print("Model loaded successfully!")
print(f"Current VRAM usage: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
def generate_text(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.1
        )

    return tokenizer.decode(outputs, skip_special_tokens=True)
print("\n--- Start Chat (type 'exit' to quit) ---")
while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break

    prompt = f"User: {user_input}\nAssistant:"

    response = generate_text(prompt)
    print(f"AI: {response.split('Assistant:').strip()}")
python run_llm.py

Step 5: Building a High-Speed Inference Server with vLLM (Advanced)

While transformers offers ease of use, vLLM delivers superior performance for production-level speed. By combining the RTX 5090's expansive VRAM and the PagedAttention algorithm, throughput can be increased severalfold.

# Load 32B model with 4-bit quantization (AWQ) and start server
# ※ Model must be provided in AWQ format.
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/deepseek-r1-distill-qwen-32b-awq \
    --quantization awq \
    --dtype half \
    --gpu-memory-utilization 0.9 \
    --port 8000
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "casperhansen/deepseek-r1-distill-qwen-32b-awq",
        "messages":,
        "temperature": 0.7
    }'

This response speed will let you experience the true power of the RTX 5090. The sight of tokens flowing like a waterfall is breathtaking.

Common Pitfalls and Avoidance Strategies: Checklist for Smooth AI Development

Even with the best hardware, software issues can stall development. I've compiled common pitfalls I've encountered or heard about, along with avoidance strategies.

1. NVIDIA Driver and CUDA Toolkit Version Mismatch

When new GPUs are released, older drivers or CUDA versions may not work properly. Especially PyTorch libraries are closely tied to specific CUDA versions. The RTX 5090 (Blackwell generation) may not function with older CUDA (e.g., 11.x series).

Avoidance Strategy: Always install the latest stable NVIDIA drivers and CUDA Toolkit. Check the PyTorch official site for the CUDA version compatible with your PyTorch version and install accordingly. As in this article, use CUDA 13.1 and develop the habit of checking driver version with nvidia-smi and CUDA version with nvcc -V.

2. Overloading VRAM (OOM) When Loading Large Models

Attempting to load a model that 'should fit' in VRAM can cause CUDA Out Of Memory errors, crashing the process. Alternatively, offloading to shared GPU memory (main RAM) can cause extreme slowdowns.

While the RTX 5090 (32GB VRAM) provides ample space for 32B models, 70B-class models require optimization.
・Use quantization: 4-bit (AWQ, GPTQ) or trending EXL2 formats.
・Limit context length: Infinite conversations cause KV cache bloat. Use max_model_len to restrict.

3. Insufficient Power Supply and Cooling

The RTX 5090, while highly performant, can exceed 450W-500W power consumption. Insufficient power supply can cause sudden shutdowns under high load.

Select a PC with a minimum 1000W, preferably 1200W+ 80PLUS PLATINUM certified power supply. Ensure secure 12VHPWR cable connections and regularly check that connectors are fully seated.

What I Gained from Interacting with Shogi AI: The Path to 1st Place in Floodgate and My Approach to Distilled Models

2026-03-14 09:18:27

Introduction

As a practical testing ground for verifying reasoning optimization and model handling, I first touched an OSS shogi software in January 2026.

As a result, I reached rank 1 by playing over 200 games with a rating exceeding 4500 on Floodgate (an online shogi server for computer shogi). Since I started programming in December 2025, this was achieved in approximately two months after touching the OSS.

This article is not a how-to guide on implementation, but rather discusses what was learned through shogi AI and how it can be applied to LLM research from the perspective of an LLM/RAG researcher.

Why Chess AI?

In LLM research, one frequently encounters challenges such as reasoning optimization and model selection. However, LLM evaluation can be ambiguous. "Is the answer good?" often involves subjectivity. In contrast, shogi AI has clear wins/losses and ratings, allowing immediate numerical verification of strategy effectiveness.

Additionally, skill sets such as CUDA/TensorRT build and batch processing optimization are completely common between LLM and shogi AI. Shogi AI serves as an ideal experimental ground for verifying these technologies through a strict win/loss feedback loop.

Overall Architecture: 3-Layer Hybrid

The constructed system has a 3-layer architecture.

Phase 1: Book (Opening Database) — Immediate move via Python dictionary lookup. No C++ engine startup, zero GPU/CPU load.
Phase 2: MCTS + DL Model — Inference of a large 40-block ResNet using TensorRT. Quantized to fit within RTX 5090's 32GB VRAM.
Phase 3: α-β + NNUE — Fast position evaluation via CPU search. Handles endgame reading victories.

A Python wrapper manages phase switching and protocol communication, selecting engines based on position characteristics. This design philosophy of "winning with the entire architecture rather than a single model" is fundamentally the same as RAG system composition (search → ranking → generation multi-stage pipeline).

OSS Modification: The Value of Cutting

I forked two OSS engines (DL and NNUE) and removed unnecessary features at the source level.

In the DL engine, features such as multi-GPU support, multiple backend branching, and various mate search were removed to specialize for RTX 5090 × TensorRT. USI options were reduced from 63 to 43 (-32%).

In the NNUE engine, test commands, book generation commands, and learning-related code were compiled out, reducing binary size from 916KB to 514KB (44% reduction).

This "cutting" work directly applies to LLM operations. Instead of adding functionality via LoRA or Fine-Tuning to distilled models, reduce unnecessary branches and control via prompts — a policy fully aligned with the article "An Era Without LoRA or FT: How to Approach Distilled Models."

Real-Time Book Rewriting: RAG-Inspired Approach

We manage a database of approximately 7 million book positions on the Python side. Book loading has been accelerated.

A notable feature is the real-time rewriting of the book during matches. After a loss, the early-game branching points are identified, and the book is modified to select different moves in the next match. The book is continuously refined as matches accumulate.

This "updating the database from experience and reflecting it in subsequent reasoning" cycle is identical to the feedback loop in RAG. The structure is the same as improving search result quality from dialogue logs.

LLM Utilization and Limitations

During development, I used Claude Opus as a coding partner. For niche specialized tools like dlshogi and YaneuraOu, LLM hallucinations frequently occur. Blindly trusting confidently generated code can lead to incorrect modifications that not only don't work but also lower shogi strength.

The lesson here is that "LLM is translation, not reasoning." The correct usage is to perform calculations with specialized engines (e.g., search engines for shogi AI, domain-specific logic for business) and use LLM for natural language translation of inputs/outputs. This aligns with RAG design principles: "Don't give LLM knowledge, but generate based on facts obtained from external sources."

Conclusion: Research is Cyclical

After organizing insights from two months of shogi AI development:

  • Additional learning on distilled models is ineffective or leads to overfitting → Prompt control is the correct approach
  • Winning with the entire architecture, not a single model → RAG pipeline design philosophy
  • Updating the database from experience and reflecting it in subsequent reasoning → RAG feedback loop
  • LLM is translation, not reasoning → Domain logic should be handled by specialized engines

This shogi AI experience has been returned to LLM research, and LLM research insights have been applied to shogi AI architecture design. This cycle is the greatest value of venturing into different fields. Currently, I'm back to researching local LLMs (building systems using NVIDIA's Nemotron models), but I'll participate again when the GPU is free. It was very enjoyable.

Hardware Used:

  • GPU: NVIDIA RTX 5090 (32GB GDDR7)
  • CPU: Intel Core Ultra 9 285K
  • RAM: 64GB
  • OS: Linux (WSL2)

Turn Conversation Data into Assets with Gemini API: History Export, RAG, and Streamlit

2026-03-14 09:18:24

Introduction: Taking Back Control of the AI "Brain"

For modern engineers, LLMs (Large Language Models) like Gemini and ChatGPT are more than mere tools; they are a "second brain." From daily coding, debugging, and architectural considerations to career advice, we entrust a massive amount of our thought processes to AI. However, we face a critical issue here: "Is this valuable dialogue data truly ours?"

When buried in browser histories and practically unsearchable, past insights cannot be utilized. Moreover, if standard features like Google Takeout fail to work as expected, our intellectual assets are at risk of disappearing. Furthermore, even if you acquire powerful hardware like the latest RTX 5090 (32GB VRAM), you cannot maximize its performance without the appropriate data and workflows.

This article is a practical guide for engineers who extensively use Gemini, covering everything from techniques to export easily scattered conversation histories, to building a knowledge base using Google Workspace, and developing applications combining local LLMs and RAG (Retrieval-Augmented Generation).

From gritty hacks to automation scripts, all code is designed to work. Use this as a reference to shift your AI utilization from "consumption" to "assetization."

Chapter 1: Export Techniques for Gemini Conversation History

Dialogues with Gemini reflect your thoughts. Let's start by securing this data locally. However, there are some points to note here.

Challenge: Google Takeout Export Issues

Google has a data export feature called "Google Takeout," but its behavior can be unstable regarding Gemini history. When attempting to export hundreds of chat histories, I once had a perplexing experience where the downloaded Zip file was "less than 1MB" and completely empty inside.

Particularly when using Google Workspace (enterprise accounts) or when API usage is mixed in, chat histories on the Web UI may not be archived correctly. Even if exported, they are in a complex JSON structure, which is not in a human-readable format as is.

Solution A: JSON Formatting via Python Script (If Takeout Succeeds)

If you successfully obtain GeminiChat.json via Google Takeout, you need to convert it into highly readable Markdown or CSV. The following Python script parses the nested JSON structure, formats the dates and titles, and outputs them.

import json
import os
import csv
from datetime import datetime

def format_timestamp(ts_str):
    """Format ISO 8601 timestamps"""
    try:
        if ts_str.endswith('Z'):
            ts_str = ts_str
        dt_object = datetime.fromisoformat(ts_str)
        return dt_object.strftime("%Y-%m-%d %H:%M:%S")
    except ValueError:
        return ts_str

def process_gemini_json(json_file_path, output_dir="exported_gemini_chats"):
    """Read GeminiChat.json and output Markdown and CSV"""
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    all_chat_data = []

    try:
        with open(json_file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
    except Exception as e:
        print(f"Error: {e}")
        return

    # Normalize data structure (ensure it's a list)
    if isinstance(data, dict):
        data = [data]

    print(f"Processing {len(data)} chat entries...")

    for i, chat_entry in enumerate(data):
        title = chat_entry.get('title', f"Untitled_Chat_{i+1}")
        # Remove characters unusable in filenames
        safe_title = "".join(c for c in title if c.isalnum() or c in (' ', '-', '_')).strip()
        if not safe_title: safe_title = f"chat_{i+1}"

        created_at = format_timestamp(chat_entry.get('create_time', 'Unknown'))

        # Generate Markdown
        md_filename = os.path.join(output_dir, f"{safe_title}.md")
        full_text = ""

        with open(md_filename, 'w', encoding='utf-8') as md_f:
            md_f.write(f"# {title}\n\n")
            md_f.write(f"Date: {created_at}\n\n")

            conversations = chat_entry.get('conversations', [])
            if not conversations and 'content' in chat_entry:
                # Fallback for different structures
                conversations = [{"speaker": "AI", "text": chat_entry.get('content')}]

            for convo in conversations:
                speaker = convo.get('speaker', 'Unknown')
                text = convo.get('text', '')
                md_f.write(f"## {speaker}\n{text}\n\n")
                full_text += f"{speaker}: {text}\n"

        all_chat_data.append({
            'title': title,
            'created_at': created_at,
            'summary': full_text.replace('\n', ' ')[:100] + '...',
            'file': md_filename
        })

    # CSV Output
    csv_file = os.path.join(output_dir, "summary.csv")
    with open(csv_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['title', 'created_at', 'summary', 'file'])
        writer.writeheader()
        writer.writerows(all_chat_data)

    print(f"Done: Saved to {output_dir}.")

if __name__ == "__main__":
    # Specify the JSON file path here
    # json_path = "takeout/Gemini/GeminiChat.json" 
    # process_gemini_json(json_path)
    print("Please specify a JSON path to run")

Solution B: Exporting via Chrome Extension (If Takeout Fails)

If Google Takeout does not work, or if the "Gemini Apps" item itself does not appear, an approach to directly save the information displayed in the browser is effective. By using Chrome extensions such as "ChatExporter for Gemini," you can extract text directly from the DOM and save it in Markdown format.

This method does not rely on server-side issues and can reliably save what is currently visible, making it highly effective as a backup. Even if there is a massive amount of history, it is crucial to select a tool that sequentially retrieves data in conjunction with browser scrolling.

Chapter 2: Building a Knowledge Base in the Google Ecosystem (GAS × AppSheet)

It would be a waste to leave the exported data as is. Next, we will rebuild this as a searchable "knowledge base" within Google Workspace. By combining Google Apps Script (GAS) and AppSheet, you can create a secure AI assistant that requires no server management.

Advantages of a Loosely Coupled Architecture

By keeping "Google Sheets (database)," "GAS (logic)," and "AppSheet (UI)" loosely coupled, this system achieves high maintainability.

  • Google Sheets: Saves conversation logs as structured data. Becomes the search target for RAG.
  • GAS: Calls the Gemini API and reads/writes to the sheet.
  • AppSheet: Provides an intuitive UI accessible from smartphones and PCs.

Implementation: GAS Code to Auto-Save Gemini History

The following code is a GAS function that receives a request from AppSheet, calls the API of the latest inference model, Gemini 3.1 Pro, generates an answer, and appends the result to a spreadsheet.

const GEMINI_API_KEY = PropertiesService.getScriptProperties().getProperty('GEMINI_API_KEY');
const SHEET_ID = 'YOUR_SPREADSHEET_ID';

function callGemini(prompt, historyContext) {
  // Use Gemini 3.1 Pro Preview
  const url = `https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-pro-preview:generateContent?key=${GEMINI_API_KEY}`;

  // Append past context to prompt if available (Simple RAG)
  const finalPrompt = historyContext 
    ? `Please answer based on the following past context.\nContext: ${historyContext}\n\nQuestion: ${prompt}`
    : prompt;

  const payload = {
    "contents": [
      {
        "parts": [{"text": finalPrompt}]
      }
    ]
  };

  const options = {
    "method": "post",
    "contentType": "application/json",
    "payload": JSON.stringify(payload),
    "muteHttpExceptions": true
  };

  try {
    const response = UrlFetchApp.fetch(url, options);
    const json = JSON.parse(response.getContentText());
    if (json.candidates && json.candidates[0].content) {
      return json.candidates[0].content.parts[0].text;
    } else {
      return "Error: Could not generate a response.";
    }
  } catch (e) {
    return `Communication Error: ${e.toString()}`;
  }
}

// Function executed via Webhook or Trigger from AppSheet
function onUserQuery(e) {
  // * In reality, assumes receiving arguments from AppSheet Automation, etc.
  const userPrompt = e ? e.prompt : "Test Question"; 
  const sheet = SpreadsheetApp.openById(SHEET_ID).getSheetByName('Log');

  // Logic to search related information from past logs (Simplified)
  const lastRow = sheet.getLastRow();
  let context = "";
  if (lastRow > 1) {
    context = sheet.getRange(lastRow, 3).getValue(); // Use previous answer as context
  }

  const aiResponse = callGemini(userPrompt, context);

  // Save timestamp, prompt, and response
  sheet.appendRow([new Date(), userPrompt, aiResponse]);

  return aiResponse;
}

This script works by creating it as a GAS project from Google Sheets extensions and setting the API key in script properties. This allows all your conversations to accumulate in a spreadsheet, making it a searchable database.

Chapter 3: Automated Slide Generation with RTX 5090 and RAG

Once the knowledge base is ready, the next step is output automation. Here, we leverage the power of the latest high-end GPU "RTX 5090" to achieve content generation specialized in professional fields like law and technology, and automated output to Google Slides.

Local LLM Utilizing RTX 5090 (32GB VRAM)

The most significant feature of the RTX 5090 is its vast 32GB VRAM capacity and high inference performance thanks to the Blackwell architecture. This allows loading mid-to-large-scale LLMs like Gemma 3 (27B) and Qwen2.5-32B fully with minor quantization.

In specialized tasks involving complex logical puzzles, hallucinations tend to occur with general API-based models. Thus, it is effective to use a library called Unsloth to fine-tune a local LLM on the RTX 5090.

from unsloth import FastLanguageModel
import torch

def train_local_model():
    # Load model leveraging RTX 5090's 32GB VRAM
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "unsloth/gemma-3-27b-it", # Use Gemma 3 27B model
        max_seq_length = 4096,
        dtype = None, # Auto setup
        load_in_4bit = True, # 4-bit quantization to run 27B model comfortably
    )

    model = FastLanguageModel.get_peft_model(
        model,
        r = 16,
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
        lora_alpha = 16,
        lora_dropout = 0,
        bias = "none",
    )

    # Dataset loading and Trainer configuration go here
    print("Training started: RTX 5090 VRAM 32GB Environment")

Automatic Conversion from Content to Slides

Highly accurate answers and explanations generated by local LLMs or the Gemini API are output structured in JSON format. This is read by Google Apps Script to automatically generate slides.

function generateSlidesFromData() {
  // Slide data to generate (Assuming received in JSON from Python)
  const slidesData = [
    {
      title: "Results",
      body: "Achieved accuracy comparable to Gemini 3.1 Pro in specialized interpretation.",
      points: ["Faster inference", "Lower cost"]
    }
  ];

  const presentation = SlidesApp.create("AI Generated Report_RTX5090 Verification");
  const slides = presentation.getSlides();
  if (slides.length > 0) slides[0].remove();

  slidesData.forEach(data => {
    const slide = presentation.appendSlide(SlidesApp.PredefinedLayout.TITLE_AND_BODY);

    const titleShape = slide.getShapes().find(s => s.getPlaceholderType() === SlidesApp.PlaceholderType.TITLE);
    if (titleShape) titleShape.getText().setText(data.title);

    const bodyShape = slide.getShapes().find(s => s.getPlaceholderType() === SlidesApp.PlaceholderType.BODY);
    if (bodyShape) {
      let textContent = data.body + "\n\n";
      data.points.forEach(p => textContent += `- ${p}\n`);
      bodyShape.getText().setText(textContent);
    }
  });

  Logger.log("Slide generation complete: " + presentation.getUrl());
}

This automation significantly reduces document creation time. Humans can focus on structure and content review, freed from the simple task of layout adjustments.

Chapter 4: Real-time Dashboard Construction (Streamlit + Gemini)

Lastly, we'll build a real-time dashboard useful for daily data analysis and monitoring. With the Python framework "Streamlit", you can create web apps in a few lines of code without HTML or CSS knowledge. By integrating the Gemini API, a dashboard where AI interprets data meanings in real-time is completed.

Implementing the Streamlit Application

The following code takes user input, performs sentiment analysis using Gemini 2.0 Flash (characterized by fast responses), and graphs the results in real-time.

import streamlit as st
import google.generativeai as genai
import os
import pandas as pd
import altair as alt
import json

# Set API Key (Recommended to read from environment variables)
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

if not GOOGLE_API_KEY:
    st.error("API key is not set.")
    st.stop()

genai.configure(api_key=GOOGLE_API_KEY)
# Use a fast model suitable for real-time processing
model = genai.GenerativeModel('gemini-2.0-flash')

def main():
    st.set_page_config(layout="wide", page_title="Gemini AI Dashboard")
    st.title("Gemini Real-time Analysis Dashboard")

    # Initialize session state
    if "messages" not in st.session_state:
        st.session_state.messages = []

    # Sidebar: Display history
    with st.sidebar:
        st.header("Analysis History")
        for msg in reversed(st.session_state.messages):
            st.text(f"{msg['role']}: {msg['content'][:20]}...")

    # Main Area: Input and Display
    user_input = st.text_input("Enter text to analyze:")

    if st.button("Start Analysis") and user_input:
        st.session_state.messages.append({"role": "User", "content": user_input})

        with st.spinner("Gemini is thinking..."):
            try:
                # Prompt forcing JSON output
                prompt = f"""
                Perform sentiment analysis on the following text and output the percentages of positive, negative, and neutral in JSON format.
                Set the key to 'sentiment' and return a list format with {{"category": "...", "percentage": ...}}.
                Text: {user_input}
                """
                response = model.generate_content(prompt)
                response_text = response.text

                # Extract JSON part (Simple processing)
                json_str = response_text.replace("```

json", "").replace("

```", "").strip()
                data = json.loads(json_str)

                # Draw graph
                if "sentiment" in data:
                    df = pd.DataFrame(data["sentiment"])
                    chart = alt.Chart(df).mark_arc().encode(
                        theta=alt.Theta(field="percentage", type="quantitative"),
                        color=alt.Color(field="category", type="nominal"),
                        tooltip=['category', 'percentage']
                    ).properties(title="Sentiment Analysis Results")

                    st.altair_chart(chart, use_container_width=True)
                    st.success("Analysis Complete")

                    st.session_state.messages.append({"role": "AI", "content": str(data)})

            except Exception as e:
                st.error(f"An error occurred: {e}")

if __name__ == "__main__":
    main()

This dashboard can be applied to various uses, such as analyzing customer feedback or detecting the mood in internal chats. Gemini's fast response and Streamlit's convenience accelerate prototyping speeds.

Conclusion: Building a Co-creative Relationship with AI

In this article, we introduced four phases to utilize the Gemini API.

  • Storage: Reliably export dialogue data using Chrome extensions or scripts.
  • Management: Build your own knowledge base using GAS and Google Sheets.
  • Generation: Leverage the power of RTX 5090 (32GB) to generate content via local LLMs and document it via GAS.
  • Visualization: Analyze data in real-time with Streamlit to support decision-making.

These technologies show their true value when combined rather than used alone. A powerful hardware environment like the RTX 5090 serves as a foundation for freely manipulating AI locally.

Start by exporting your history. Within it, you should find useful insights that trace your own thought processes.