Visual QA

Sieve’s Visual-QA app is a versatile tool for developers to analyze and extract insights from images, videos, and audio files. It can provide structured responses in JSON format, allowing for detailed analysis and customization of prompts. The app supports cost-effective processing with parameters like backend selection and fps to optimize for various use cases.

Key Features

  • Image, Video, and Audio Analysis: The app can process different types of media files to answer questions.
  • Structured Responses: Supports text responses aswell as structured JSON responses using the function_json param. For details check out the notes section below.
  • Customizable Prompts: Allows users to specify prompts for tailored use-cases.
  • Cost Effective Parameters: Allows developers to specify different parameters such as backend, frames-per-second, start_time, end_time to optimize for their use case.

Use Cases

Visual-QA can be used for a variety of different tasks such as:

  • Video Summarizations: Summarize videos for developing recommendation systems, categorization and tagging systems etc.
  • Multimodal Sentiment Analysis: Analyse the sentiments expressed in different Images, Audios and videos.
  • Visual Search: Develop Visual search systems to enable users to search different objects in a video or image.
  • Optical Character Recognition (OCR): Extract text from images and videos for digitizing documents, automating data entry, or enabling searchable media archives.
  • Educational Tool: Visual-QA system can be used to explain graphs, diagrams and provide explanations for videos and lectures.
  • Product Information: Help users find products by answering questions about images of items, such as identifying details or comparing products.

Pricing

The pricing varies based on the selected backend and the amount of content processed.

BackendMedia Type< 128k tokens> 128k tokens
gemini-1.5-flashImage$0.00002/image$0.00004/image
Video$0.00002/second$0.00004/second
Audio$0.000002/second$0.000004/second
Text Input$0.00001875/1k chars$0.0000375/1k chars
Text Output$0.000075/1k chars$0.00015/1k chars
gemini-1.5-proImage$0.00032875/image$0.0006575/image
Video$0.00032875/second$0.0006575/second
Audio$0.00003125/second$0.0000625/second
Text Input$0.0003125/1k chars$0.000625/1k chars
Text Output$0.00125/1k chars$0.0025/1k chars

Note: You are charged for the number of images passed + the characters in the prompt given. Each modality is charged at a separate rate. For videos, you will be charged according to the fps param set. If you set audio_context to True, the audio is also used to make a prediction. A small processing fee of 0.4$/hour is charged for compute.

Example: Rick Roll

Let's Analyze Rick Astley's Never Gonna Give You Up music video. The video is 3:33 minutes long. Let's set the fps param to be 1 since there aren't instant changes in between the frames. We'll enable the audio_context to be true and ask the Visual QA to give the summary of the video. We'll use gemini-flash-1.5 as the backend. Here's how much it will cost us.

duration_in_seconds = 213s
total_frames = 1 frame per second * 213 = 213 frames
audio_duration = 213s

# assuming the prompt was 100 chars
text_input = 100 characters
# assuming output summary was 250 chars
text_output = 250 characters

image_cost = 213 * $0.00002 = $0.00426
audio_cost = 213 * $0.000002 = $0.000426

text_input_cost = 100 * (0.00001875/1000) = $0.000001875
text_output_cost = 250 * (0.000075/1000) = $0.0000075
# assume processing took 50s
processing_cost = 50 * (0.4/3600) = 0.005555555556

Total_cost = image_cost + audio_cost + text_input_cost + text_output_cost + processing_cost
Total_cost = $0.010

Thus a video of 3:33s with audio costs us only $0.010. For image/video only task we do not explicitly need audio context which further reduces the cost! check out the video summary here.

Notes

Parameter Usage

Function JSON

function_json parameter allows user to specify a json to get a structured response. Function parameters must be provided in a format that's compatible with the OpenAPI schema. If function_json param is not provided the response is in plain text.

For Example: Let's suppose we have images from an online-shopping store and we want to categorize them into pants, shoes, bags, shirts etc, with brief descriptions here is how we would do that!

{
    "type": "object",
    "properties": {
        "product_type": {
            "type": "string",
            "enum": ["shirt", "bag", "pants", "belt", "shoes"],
            "description": "The type of the product."
        },
        "color": {
            "type": "string",
            "description": "The color of the product."
        },
        "description": {
            "type": "string",
            "description": "A brief description of the product."
        }
    },
    "required": ["product_type", "color", "description"]
}

If we pass an image of a brown bag, here's how the output would look like:

{
  "product_type": "bag",
  "color": "brown",
  "description": "A brown leather tote bag."
}

Just like this, we can easily classify millions of unlabeled images. check out the example here!