最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How Can I Send Files to Google's Gemini Models via API Call? - Stack Overflow

programmeradmin1浏览0评论

Overview

Currently, I use the GoogleGenerativeAI library to handle generative AI prompt generation requests in my application. Gemini promises to be a multi-modal AI model, and I'd like to enable my users to send files (e.g. PDFs, images, .xls files) in line with their AI prompts.

I was using the following workflow to enable people to upload a file and use it in a prompt:

  • Enable file selection from their local machine (e.g. PDFs, .doc, .xls formatted files).
  • Upload the file to Google Cloud Storage, get an accessible link to the newly-uploaded file.
  • Send the request to Gemini with the link to the file included in the prompt (where appropriate).

However, now I'm finding that this solution doesn't work as it was. Instead, I'm seeing responses like this:

I lack the ability to access external websites or specific files from given URLs, including the one you provided from Google Cloud Storage. Therefore, I'm unable to summarize the content of the file.

What I've Considered

  • Using multiple libraries to handle document types client-side to convert them into text (e.g. pdf-parser for PDFs) and using Gemini's image-handling model when there's an image involved. However, this involves lots of libraries, and it seems that Gemini is promising to handle this for me / my users.
  • Pre-processing the uploaded files server-side (for example, sending them to Google's Document AI), turning their document into some type of consistently-structured data, then using that data with the GoogleGenerativeAI library. Document AI calls are expensive though and it seems that Gemini is meant to handle this kind of thing.

My App's Stack (In Case it Matters)

  • Firebase / Google Cloud Functions
  • Vercel
  • Next.js

Can you help with an approach that will enable the user to include files in their requests made (via the web) to Gemini?

Thanks in advance!

Overview

Currently, I use the GoogleGenerativeAI library to handle generative AI prompt generation requests in my application. Gemini promises to be a multi-modal AI model, and I'd like to enable my users to send files (e.g. PDFs, images, .xls files) in line with their AI prompts.

I was using the following workflow to enable people to upload a file and use it in a prompt:

  • Enable file selection from their local machine (e.g. PDFs, .doc, .xls formatted files).
  • Upload the file to Google Cloud Storage, get an accessible link to the newly-uploaded file.
  • Send the request to Gemini with the link to the file included in the prompt (where appropriate).

However, now I'm finding that this solution doesn't work as it was. Instead, I'm seeing responses like this:

I lack the ability to access external websites or specific files from given URLs, including the one you provided from Google Cloud Storage. Therefore, I'm unable to summarize the content of the file.

What I've Considered

  • Using multiple libraries to handle document types client-side to convert them into text (e.g. pdf-parser for PDFs) and using Gemini's image-handling model when there's an image involved. However, this involves lots of libraries, and it seems that Gemini is promising to handle this for me / my users.
  • Pre-processing the uploaded files server-side (for example, sending them to Google's Document AI), turning their document into some type of consistently-structured data, then using that data with the GoogleGenerativeAI library. Document AI calls are expensive though and it seems that Gemini is meant to handle this kind of thing.

My App's Stack (In Case it Matters)

  • Firebase / Google Cloud Functions
  • Vercel
  • Next.js

Can you help with an approach that will enable the user to include files in their requests made (via the web) to Gemini?

Thanks in advance!

Share Improve this question edited Jan 4, 2024 at 14:24 Davis Jones asked Jan 4, 2024 at 12:09 Davis JonesDavis Jones 1,8924 gold badges20 silver badges30 bronze badges 4
  • 1 Did you try to send the image in b64 encoded format? – guillaume blaquiere Commented Jan 4, 2024 at 13:00
  • No, when we initially used this library, we were able to use the links inline with the prompt, but we can’t do that anymore. So, I’m developing this approach. Do you think this would work for all the different file types? – Davis Jones Commented Jan 4, 2024 at 14:08
  • Yes, a standard way to submit images to an API is a b64 encoded. It's the case for several API on Google Cloud when the native connection with GCS is not implemented. – guillaume blaquiere Commented Jan 4, 2024 at 15:04
  • related? cloud.google./vertex-ai/generative-ai/docs/model-reference/… – starball Commented Oct 29, 2024 at 9:14
Add a ment  | 

3 Answers 3

Reset to default 1

The documentation on generating text from text-and-image input (multimodal) has an example of how to include image data in a request.

As Guillaume mented, this requires that you include your image data as a base64 encoded part in your request. While I didn't test the JavaScript bindings myself yet, this matches with my experience of using Dart bindings - where I also included the images as base64 encoded parts.

The only problem I have so far is video upload. Images are ok though

import { google } from "googleapis";
import dotenv from "dotenv";
import stream from "stream";
dotenv.config({ override: true });

const model = "gemini-1.5-pro-latest";
const GENAI_DISCOVERY_URL = `https://generativelanguage.googleapis./$discovery/rest?version=v1beta&key=${process.env.GEMINI_KEY}`;

export async function getTextGemini(prompt, temperature, imageBase64, fileType) {
    const genaiService = await google.discoverAPI({ url: GENAI_DISCOVERY_URL });
    const auth = new google.auth.GoogleAuth().fromAPIKey(process.env.GEMINI_KEY);

    let file_data;
    if (imageBase64) {
        const bufferStream = new stream.PassThrough();
        bufferStream.end(Buffer.from(imageBase64, "base64"));
        const media = {
            mimeType: fileType === "mp4" ? "video/mp4" : "image/png",
            body: bufferStream,
        };
        console.log(media);
        let body = { file: { displayName: "Uploaded Image" } };
        const createFileResponse = await genaiService.media.upload({
            media: media,
            auth: auth,
            requestBody: body,
        });
        const file = createFileResponse.data.file;
        console.log(file);
        file_data = { file_uri: file.uri, mime_type: file.mimeType };
    }

    const contents = {
        contents: [
            {
                role: "user",
                parts: [{ text: prompt }, file_data && { file_data }],
            },
        ],
        generation_config: {
            maxOutputTokens: 4096,
            temperature: temperature || 0.5,
            topP: 0.8,
        },
    };

    const generateContentResponse = await genaiService.models.generateContent({
        model: `models/${model}`,
        requestBody: contents,
        auth: auth,
    });

    return generateContentResponse?.data?.candidates?.[0]?.content?.parts?.[0]?.text;
}

Gemini promises to be a multi-modal AI model

The multi-modal capabilities of Gemini are currently limited, and they are slightly different if you are using the Google AI Studio version of the library or the Google Cloud Vertex AI version of the library.

  • The Google AI Studio version only supports text and images that are jpeg, png, hiec, heif, or webp images. These can only be inline. See https://ai.google.dev/api/rest/v1/Content#part
  • The Google Cloud Vertex AI version also supports these, but has a couple of additions:
    • URL references to documents in Google Cloud Storage are allowed
    • While inline data is still permitted, you're only allowed one inline image
    • Only png and jpeg image files are supported
    • Video files (mov, mpeg, mp4, mpg, avi, wmv, mpegps, and flv) up to two minutes in length are supported
    • See the field definitions at Generate content with the Gemini Enterprise API > Parameter list > Request body for more details

Currently, neither library supports other modalities, including PDFs, doc files, spreadsheets, etc. While these may be available in the future, they're not available today.

发布评论

评论列表(0)

  1. 暂无评论