No Examples Found
TalkNet-ASD
This is a heavily modified version of the TalkNet model from this repository. All credit goes to the author.
Pricing
We price per minute of video. We bucket pricing into standard definition (<=720p), high definition (<=1080p), and 4k (>1080p) videos. If face boxes are supplied, we multiply the price by 0.7x since we can skip that step, which is expensive. Additionally, we multiply the price by 1.5x if a debug visual is generated due to the rendering time. We've listed the pay-as-you-go rates below.
Resolution | Price / Minute | Price / Minute with Debug Visualization | Price / Minute with Face Boxes | Price / Minute with Debug Visualization and Face Boxes |
---|---|---|---|---|
> 1080p (4k) | $0.13 | $0.195 | $0.091 | $0.1365 |
> 720p (up to 1080p) | $0.065 | $0.0975 | $0.0455 | $0.06825 |
≤ 720p | $0.052 | $0.078 | $0.0364 | $0.0546 |
Note: If your video is poorly encoded, we will re-encode it for you as it would otherwise cause the pipeline to be prohibitively slow. For this, we charge $0.01 per compute minute to re-encode the video.
Notes
Face Boxes
In the event you've previously detected bounding boxes and you just want to perform speaker detection, you can skip the S3FD face detection step by supplying your own bounding boxes in string format frame_1,x0,y0,x1,y1,confidence
with newlines in between each box. Here is an example of a valid input.
10,767.00,219.00,1060.00,654.00,0.9
11,753.00,218.00,1064.00,651.00,0.9
...
In Memory Threshold
For the in_memory_threshold
param, we recommend a value of less than or equal to 3000, as any more than this will cause memory overload in the response. Keeping frames in memory is a great way to make your request process faster. We've set it to 3000 by default, there shouldn't be a need to change this value.