SourceFiles.org - Use the Source, Luke
Home | Register | News | Forums | Guide | MyLinks | Bookmark

Related Sites

Latest News
  General News
  Reviews
  Press Releases
  Software
  Hardware
  Security
  Tutorials
  Off Topic


Back to files

ccextractor, v0.01

Author: cfsmp3@gmail.com . Most credit goes to other people, though: McPoodle (author of the original SCC_RIP), Neuron2, and others (see source code).

License

GPL 2.0.

Description

ccextractor is mostly a mildly optimized C port of McPoodle's excellent but painfully slow Perl script SCC_RIP. It lets you rip the raw closed captions (read: subtitles) data from a number of sources, such as DVD or replay TV.

As an added bonus compared to the original SCC_RIP, ccextractor can extract subtitles from the HDTV transport streams that are becoming more common.

At this point ccextractor extracts the line 21 captions (which must legally be present for a number of years until the transition to digital is complete). Note that in most .ts you can find, there will be subtitle data for both analog (EIA-608) decoders and digital (EIA-708). AFAIK there are not freely available EIA-708 rippers.

Anyway, since line 21 captions will be available for some time, we have time to build a decent 708 ripper.

Basic Usage

For details on CC, please go to McPoodle's page:

http://www.geocities.com/mcpoodle43/SCC_TOOLS/DOCS/SCC_TOOLS.HTML

You will need his tools to use ccextrator's output.

The basic idea is that you get the raw closed caption dump from ccextractor. Then you need other tools (which vary depending on what you want to do) to continue processing.

To get a transcript from a .ts file in .srt (I assume this will be the most common use) do this:

ccextractor -12 input_file

-12 means "extract both subtitle tracks" (actually technical names are fields but tracks is easier to understand). 1 is almost always English. 2 is Spanish in HBO (at least in the few samples I've seen) but could be anything. Just extract both of them and check.

Example: cctractor -12 house315.ts

ccextractor will create two files, called house315_1.bin and _2.

Then use McPoodle's RAW2SCC to create a temporary SCC file (means Scenerist, which is originally the native format for some program, it's not important here).

raw2scc house315_1.bin

This creates house315_1.scc

From this .scc file, you can get the final .srt by using McPoodle's CCASDI:

ccasdi -s house315_1.srt

Which looks like this (just 3 random lines shown).

514
00:24:07,400 --> 00:24:09,300
They've got another trial
going on at Duke.

515
00:24:09,367 --> 00:24:12,567
15% extend their lives
beyond five years.

516
00:24:12,634 --> 00:24:13,701
If you're positive
for protein PHF--

Known issues

Some times times sucks. Text extractions seems to work perfect consistently, but the timing usually get screwed. No idea why yet, and no idea if I will try to fix it or focus on the EIA-708 extraction, which is needed anyway.


Other Sites

Discussion Groups
  Beginners
  Distributions
  Networking / Security
  Software
  PDAs

About | FAQ | Privacy | Awards | Contact
Comments to the webmaster are welcome.
Copyright 2006 Sourcefiles.org All rights reserved.