Scriby

Turn audio into accurate transcripts

AI-powered transcription in 99 languages with speaker identification. Pay only for what you use–no subscriptions, no waste.

Get started in seconds • Pay only for what you use

Get started today • No credit card required upfront

Everything you need for accurate transcription

99 Languages

Transcribe audio in any language including English, Spanish, French, German, Chinese, Arabic, and more.

Speaker Diarization

Automatically identify and separate different speakers in your audio files with timestamps.

AI-Powered Metadata

Get automatic titles, summaries, and media type detection using advanced AI models.

Auto-Translation

Translate transcripts to your preferred language automatically with high accuracy.

Real-Time Updates

Watch your transcription progress live with instant notifications and status updates.

Multiple Formats

Export your transcripts to JSON, TXT, or SRT subtitle format for maximum flexibility.

Pay-As-You-Go

No subscriptions, no commitments. Pay only for the transcription you actually use.

Fast & Accurate

Industry-leading accuracy with enterprise-grade AI and rapid processing times.

Simple, transparent pricing

No subscriptions. No commitments. Pay only for what you use.

Transcription

$0.55

per hour

Billed per second - no minimum charge

99 languages supported

Speaker diarization included

Pay per second of audio

Translation

~$0.09

per hour of content

Based on content length

AI-powered translation

Context-aware accuracy

Multiple output languages

Why pay-as-you-go?

No subscription required

Pay exactly for what you use

Low minimum to start

Balance never expires

Get Started

How it works

Get accurate transcripts in four simple steps

Upload

Drop your audio or video file. We support most common formats.

Transcribe

AI processes your file with speaker identification and timestamps.

Review

Get AI-generated summaries, translations, and metadata automatically.

Export

Download in your preferred format: JSON, TXT, or SRT subtitles.

Powered by industry-leading AI

Enterprise Speech-to-Text

State-of-the-art transcription engine

99 Languages Supported

Including major world languages and dialects

Accurate Speaker ID

Precise diarization with speaker timestamps

Frequently Asked Questions

Everything you need to know about our transcription service

What is speech-to-text and how does it work?

Speech-to-text (STT) is a technology that converts spoken language into written text using automatic speech recognition (ASR). It analyzes audio signals, identifies speech patterns, and transcribes them into text with high accuracy.

Our AI-powered speech-to-text platform is designed to transcribe audio and video content with precision, making it ideal for professional transcription, content creation, and accessibility needs.

Speech-to-text technology is used for:

Podcasts, meetings, and interview transcription

Automated captions and subtitles for video

Accessibility tools for the hearing impaired

Content repurposing and searchable archives

Scriby offers fast, reliable, and highly accurate speech-to-text conversion for 99 languages with automatic speaker identification.

How do I transcribe video to text?

Scriby provides video transcription to convert spoken dialogue into text format, making it easy to create subtitles, captions, and searchable transcripts from your video content.

Steps to transcribe video to text:

Upload your video file – We support MP4, AVI, MKV, MOV, and most common video formats

AI processes the audio – Our speech recognition engine extracts and analyzes the audio track

Get your transcript – A complete transcript is generated automatically with speaker labels and timestamps

Download and export – Get your transcript as JSON, TXT, or SRT subtitle files for editing or publishing

This AI-powered video transcription helps content creators, businesses, and educators quickly convert video speech into accurate text for accessibility, SEO, and content repurposing.

What is speaker diarization?

Speaker diarization is an AI process that identifies and separates different speakers in an audio or video recording. It answers the question "who spoke when?" by analyzing voice characteristics and assigning labels to each speaker's segments.

Instead of getting one continuous transcript, you receive a structured conversation where each line is labeled with the speaker (Speaker 1, Speaker 2, etc.) and precise timestamps showing when they spoke.

Why diarization matters:

Clear attribution — Know exactly who said what in meetings, interviews, or podcasts

Better readability — Structured transcripts are easier to follow than wall-of-text

Searchable conversations — Find specific speakers' contributions quickly

Professional output — Perfect for legal depositions, research interviews, and documentation

Scriby automatically performs speaker diarization on all transcriptions, identifying up to 32 different speakers with precise timestamps for each segment.

What languages do you support?

We support 99 languages with enterprise-grade speech recognition, with 31 languages achieving excellent accuracy (≤5% word error rate).

View all languages by accuracy

Excellent (≤5% WER) – 31 languages:

Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Indonesian, Italian, Japanese, Kannada, Malay, Malayalam, Macedonian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Spanish, Swedish, Turkish, Ukrainian, Vietnamese

High Accuracy (>5% to ≤10% WER) – 19 languages:

Bengali, Belarusian, Bosnian, Cantonese, Estonian, Filipino, Gujarati, Hungarian, Kazakh, Latvian, Lithuanian, Mandarin, Marathi, Nepali, Odia, Persian, Slovenian, Tamil, Telugu

Good (>10% to ≤25% WER) – 30 languages:

Afrikaans, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Burmese, Cebuano, Croatian, Georgian, Hausa, Hebrew, Icelandic, Javanese, Kabuverdianu, Korean, Kyrgyz, Lingala, Maltese, Mongolian, Māori, Occitan, Punjabi, Sindhi, Swahili, Tajik, Thai, Urdu, Uzbek, Welsh

Moderate (>25% to ≤50% WER) – 19 languages:

Amharic, Chichewa, Fulah, Ganda, Igbo, Irish, Khmer, Kurdish, Lao, Luxembourgish, Luo, Northern Sotho, Pashto, Shona, Somali, Umbundu, Wolof, Xhosa, Zulu

WER = Word Error Rate. Lower is better.

What file formats can I upload?

We support all major audio and video formats for upload.

Audio formats:

MP3, AAC, WAV, FLAC, OGG, Opus, AIFF, M4A, WebM

Video formats:

MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GP

Export formats:

Download your transcripts as JSON, TXT, or SRT subtitle files

How many speakers can you detect?

Our AI can automatically identify and separate up to 32 different speakers in your audio with precise timestamps for each speaker's segments.

How accurate is the transcription?

We achieve up to 97% accuracy for English and over 95% accuracy for 25+ languages. Our AI can also detect non-speech sounds like laughter and applause.

How does your translation feature work?

Unlike basic word-for-word translators, we use advanced AI language models that understand context, nuance, and meaning. This means you get translations that actually make sense – preserving idioms, cultural references, and the natural flow of conversation. The AI analyzes the entire transcript to deliver professional-quality translations that read like they were written by a native speaker.