I created this piece of content for the purposes of entering the Gemini Live Agent Challenge hackathon by Google/Devpost. #GeminiLiveAgentChallenge
Point your camera at the Colosseum. Within seconds, an AI voice tells you it’s 80 AD, 50,000 Romans are packed into the stands, and a photorealistic image of it at its peak appears on your screen.
That’s Chrono Lens — and here’s exactly how I built it.
The Idea
I wanted to build something that made history visceral. Not a chatbot that answers questions about landmarks — a companion that sees what you see, speaks to you about it, and shows you what it actually looked like centuries ago.
The moment I saw Gemini’s Live API — native audio, real-time vision, natural conversation — I knew that was the missing piece. This wasn’t going to be another text box. This was going to feel like magic.
What Chrono Lens Does
Point your camera at any architectural landmark anywhere in the world:
- 🔍 Identifies the landmark using Gemini 2.5 Flash vision
- 🎨 Reconstructs it in its most historically significant era using Imagen 4
- 🎙️ Narrates its history through Gemini’s native Puck voice
- 💬 Answers questions — tap the mic and ask anything
- 🏛️ Archives every discovery to the Chronos Vault
The output is genuinely interleaved — audio narration, generated historical image, contextual fact cards, and ambient music all arrive together as one cohesive experience.
The Architecture
Chrono Lens runs on a hybrid multimodal pipeline hosted entirely on Google Cloud:
- Next.js 14 frontend — captures camera frames, streams over WebSocket, plays back PCM audio via Web Audio API
- FastAPI on Google Cloud Run — orchestrates the entire pipeline
- Gemini 2.5 Flash — vision analysis and function calling for landmark detection
- Gemini Live API (
gemini-2.5-flash-native-audio-preview) — native audio narration and voice Q&A - Imagen 4 (
imagen-4.0-generate-001) — photorealistic historical reconstructions - Google Secret Manager — secure API key storage
- Google GenAI Python SDK — connects everything together
The Biggest Technical Challenge
The native audio model has a significant undocumented limitation: it only responds to the first send_client_content turn per session. Every subsequent message is silently ignored.
I tested every approach — send_realtime_input, manual VAD with ActivityStart/ActivityEnd, the turns=[] list syntax. I even wrote a standalone test script to confirm it wasn’t my application code causing the issue. It wasn’t. The model simply doesn’t support multi-turn text conversations in a single session.
The solution: open a fresh Live session for each voice question, injecting the full conversation history and landmark context into the prompt:
async def answer_question_fresh_session(
user_question: str,
current_landmark: dict,
conversation_history: list,
websocket: WebSocket
):
context = f"We are discussing {current_landmark['location']} "
context += f"from {current_landmark['era']}. "
for entry in conversation_history[-3:]:
context += f"User: {entry['q']} You said: {entry['a']} "
context += f"User's question: {user_question}"
async with client.aio.live.connect(model=AUDIO_MODEL, config=config) as session:
await session.send_client_content(
turns=[types.Content(
role="user",
parts=[types.Part(text=context)]
)],
turn_complete=True
)
async for response in session.receive():
if response.data:
await websocket.send_bytes(response.data)
if getattr(response.server_content, 'turn_complete', False):
break
Each session opens in ~300ms. The user never notices the seam.
The Interleaved Multimodal Output
When a landmark is identified, four things happen simultaneously:
- Imagen 4 generates the era-accurate historical reconstruction
- Gemini Live narrates the history in Puck’s warm voice
- Gemini 2.5 Flash generates 3 curated fact cards + a poetic 6-word tagline
- Ambient music tag triggers location-appropriate audio on the frontend
All four outputs arrive together — this is what “interleaved multimodal output” actually looks like in practice. Not sequential. Not one-at-a-time. Simultaneous.
Deploying on Google Cloud Run
WebSockets on Cloud Run require specific configuration. The two most important flags:
gcloud run deploy chrono-lens-backend \
--memory 2Gi \
--timeout 3600 \
--min-instances 1 \
--update-secrets GEMINI_API_KEY=GEMINI_API_KEY:latest
--timeout 3600 is critical — without it Cloud Run terminates WebSocket connections after 60 seconds, killing the session mid-narration.
--min-instances 1 keeps one instance warm so there’s no cold start latency.
What I Learned
Work with the streaming model, not against it. The Gemini Live API is designed for continuous audio streaming. Forcing structured request-response patterns onto it causes silent failures. Embrace the streaming nature — fresh sessions, injected context, parallel async tasks.
Gemini 2.5 Flash’s function calling is remarkably reliable. It identified the Fairmont Hairpin at Monaco, the Motherland Calls statue in Volgograd, and Diskit Monastery in Ladakh without hesitation — landmarks that would stump most people.
Imagen 4 is genuinely photorealistic. The prompt engineering matters enormously. Specifying film grain, lighting conditions, and era-appropriate crowds makes the difference between a generic render and something that looks like a recovered photograph.
Windows CMD is a trap. The echo -n flag doesn’t exist on Windows — it stores -n as a literal prefix in Secret Manager. Cost me an hour of invalid API key errors on Cloud Run.
Try It
The app is live and open source:
🌐 Live app: https://chronos-lens.vercel.app/
💻 GitHub: https://github.com/suyogs1/ChronosLens
Built For
This project was built for the Gemini Live Agent Challenge by Google/Devpost — Creative Storyteller category.
