AI-powered transcription in 99 languages with speaker identification. Pay only for what you use–no subscriptions, no waste.
Get started in seconds • Pay only for what you use
Get started today • No credit card required upfront
Powered by cutting-edge enterprise-grade AI technology
Transcribe audio in any language including English, Spanish, French, German, Chinese, Arabic, and more.
Automatically identify and separate different speakers in your audio files with timestamps.
Get automatic titles, summaries, and media type detection using advanced AI models.
Translate transcripts to your preferred language automatically with high accuracy.
Watch your transcription progress live with instant notifications and status updates.
Export your transcripts to JSON, TXT, or SRT subtitle format for maximum flexibility.
No subscriptions, no commitments. Pay only for the transcription you actually use.
Industry-leading accuracy with enterprise-grade AI and rapid processing times.
No subscriptions. No commitments. Pay only for what you use.
per hour
Billed per second - no minimum charge
99 languages supported
Speaker diarization included
Pay per second of audio
per hour of content
Based on content length
AI-powered translation
Context-aware accuracy
Multiple output languages
No subscription required
Pay exactly for what you use
Low minimum to start
Balance never expires
Get accurate transcripts in four simple steps
Drop your audio or video file. We support most common formats.
AI processes your file with speaker identification and timestamps.
Get AI-generated summaries, translations, and metadata automatically.
Download in your preferred format: JSON, TXT, or SRT subtitles.
Enterprise Speech-to-Text
State-of-the-art transcription engine
99 Languages Supported
Including major world languages and dialects
Accurate Speaker ID
Precise diarization with speaker timestamps
Everything you need to know about our transcription service
Speech-to-text (STT) is a technology that converts spoken language into written text using automatic speech recognition (ASR). It analyzes audio signals, identifies speech patterns, and transcribes them into text with high accuracy.
Our AI-powered speech-to-text platform is designed to transcribe audio and video content with precision, making it ideal for professional transcription, content creation, and accessibility needs.
Speech-to-text technology is used for:
Podcasts, meetings, and interview transcription
Automated captions and subtitles for video
Accessibility tools for the hearing impaired
Content repurposing and searchable archives
Scriby offers fast, reliable, and highly accurate speech-to-text conversion for 99 languages with automatic speaker identification.
Scriby provides video transcription to convert spoken dialogue into text format, making it easy to create subtitles, captions, and searchable transcripts from your video content.
Steps to transcribe video to text:
Upload your video file – We support MP4, AVI, MKV, MOV, and most common video formats
AI processes the audio – Our speech recognition engine extracts and analyzes the audio track
Get your transcript – A complete transcript is generated automatically with speaker labels and timestamps
Download and export – Get your transcript as JSON, TXT, or SRT subtitle files for editing or publishing
This AI-powered video transcription helps content creators, businesses, and educators quickly convert video speech into accurate text for accessibility, SEO, and content repurposing.
Speaker diarization is an AI process that identifies and separates different speakers in an audio or video recording. It answers the question "who spoke when?" by analyzing voice characteristics and assigning labels to each speaker's segments.
Instead of getting one continuous transcript, you receive a structured conversation where each line is labeled with the speaker (Speaker 1, Speaker 2, etc.) and precise timestamps showing when they spoke.
Why diarization matters:
Clear attribution — Know exactly who said what in meetings, interviews, or podcasts
Better readability — Structured transcripts are easier to follow than wall-of-text
Searchable conversations — Find specific speakers' contributions quickly
Professional output — Perfect for legal depositions, research interviews, and documentation
Scriby automatically performs speaker diarization on all transcriptions, identifying up to 32 different speakers with precise timestamps for each segment.
We support 99 languages with enterprise-grade speech recognition, with 31 languages achieving excellent accuracy (≤5% word error rate).
Excellent (≤5% WER) – 31 languages:
Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Indonesian, Italian, Japanese, Kannada, Malay, Malayalam, Macedonian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Spanish, Swedish, Turkish, Ukrainian, Vietnamese
High Accuracy (>5% to ≤10% WER) – 19 languages:
Bengali, Belarusian, Bosnian, Cantonese, Estonian, Filipino, Gujarati, Hungarian, Kazakh, Latvian, Lithuanian, Mandarin, Marathi, Nepali, Odia, Persian, Slovenian, Tamil, Telugu
Good (>10% to ≤25% WER) – 30 languages:
Afrikaans, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Burmese, Cebuano, Croatian, Georgian, Hausa, Hebrew, Icelandic, Javanese, Kabuverdianu, Korean, Kyrgyz, Lingala, Maltese, Mongolian, Māori, Occitan, Punjabi, Sindhi, Swahili, Tajik, Thai, Urdu, Uzbek, Welsh
Moderate (>25% to ≤50% WER) – 19 languages:
Amharic, Chichewa, Fulah, Ganda, Igbo, Irish, Khmer, Kurdish, Lao, Luxembourgish, Luo, Northern Sotho, Pashto, Shona, Somali, Umbundu, Wolof, Xhosa, Zulu
WER = Word Error Rate. Lower is better.
We support all major audio and video formats for upload.
Audio formats:
MP3, AAC, WAV, FLAC, OGG, Opus, AIFF, M4A, WebM
Video formats:
MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GP
Export formats:
Download your transcripts as JSON, TXT, or SRT subtitle files
Our AI can automatically identify and separate up to 32 different speakers in your audio with precise timestamps for each speaker's segments.
We achieve up to 97% accuracy for English and over 95% accuracy for 25+ languages. Our AI can also detect non-speech sounds like laughter and applause.
Unlike basic word-for-word translators, we use advanced AI language models that understand context, nuance, and meaning. This means you get translations that actually make sense – preserving idioms, cultural references, and the natural flow of conversation. The AI analyzes the entire transcript to deliver professional-quality translations that read like they were written by a native speaker.