
Emerald
In-browser speech recognition with word-level timestamps, powered by Transformers.js and ONNX Runtime.
Emerald is a powerful speech recognition tool that runs entirely in your browser. By leveraging Transformers.js and ONNX Runtime, it provides accurate transcription with word-level timestamps without requiring any server-side processing.
Key Features
- 🎤 In-browser speech recognition - No server calls, all processing happens locally
- ⏱️ Word-level timestamps - Navigate to specific parts of your audio/video by clicking on words
- 🌐 Multi-language support - Transcribe content in various languages
- 🖥️ WebGPU acceleration - Utilizes your GPU for faster processing when available
- 🔒 Privacy-focused - Your audio never leaves your device
How It Works
Emerald uses a lightweight version of Whisper speech recognition model converted to ONNX format. The model runs directly in your browser using ONNX Runtime Web, utilizing WebGPU for acceleration when available. The application processes audio in chunks, providing both text transcription and precise timing information for each word spoken.
Usage
- Load the model - Click the "Load Model" button when you first open the application
- Select your media - Upload an audio/video file or record directly from your microphone
- Choose language - Select the language of the audio content
- Transcribe - Click the "Transcribe Audio" button to start processing
- Navigate - Click on any word in the transcript to jump to that timestamp in the audio/video
Development Journey
This project was inspired by the growing need for accessible, private speech recognition tools that don't require sending sensitive audio data to external servers. By leveraging recent advancements in browser-based ML frameworks, Emerald brings professional-grade speech recognition capabilities directly to the user's device.
Use Cases
- Content creators adding captions to videos
- Students transcribing lectures and navigating to specific topics
- Journalists transcribing interviews with privacy concerns
- Accessibility enhancement for audio/video content
- Language learners practicing pronunciation and comprehension