Emerald

In-browser speech recognition with word-level timestamps, powered by Transformers.js and ONNX Runtime.

Emerald is a powerful speech recognition tool that runs entirely in your browser. By leveraging Transformers.js and ONNX Runtime, it provides accurate transcription with word-level timestamps without requiring any server-side processing.

Key Features

🎤 In-browser speech recognition - No server calls, all processing happens locally
⏱️ Word-level timestamps - Navigate to specific parts of your audio/video by clicking on words
🌐 Multi-language support - Transcribe content in various languages
🖥️ WebGPU acceleration - Utilizes your GPU for faster processing when available
🔒 Privacy-focused - Your audio never leaves your device

How It Works

Emerald uses a lightweight version of Whisper speech recognition model converted to ONNX format. The model runs directly in your browser using ONNX Runtime Web, utilizing WebGPU for acceleration when available. The application processes audio in chunks, providing both text transcription and precise timing information for each word spoken.

Usage

Load the model - Click the "Load Model" button when you first open the application
Select your media - Upload an audio/video file or record directly from your microphone
Choose language - Select the language of the audio content
Transcribe - Click the "Transcribe Audio" button to start processing
Navigate - Click on any word in the transcript to jump to that timestamp in the audio/video

Development Journey

This project was inspired by the growing need for accessible, private speech recognition tools that don't require sending sensitive audio data to external servers. By leveraging recent advancements in browser-based ML frameworks, Emerald brings professional-grade speech recognition capabilities directly to the user's device.

Use Cases

Content creators adding captions to videos
Students transcribing lectures and navigating to specific topics
Journalists transcribing interviews with privacy concerns
Accessibility enhancement for audio/video content
Language learners practicing pronunciation and comprehension

Project Links

Technologies

Transformers.jsONNX RuntimeWebGPUReactWeb Audio API