Be aware that your content will be transformed. When you provide content to the assistant, it always goes through a source extractor that converts the content into plain text, depending on the type of source (file, URL, etc.).
The text is extracted from the document. Currently, no page or formatting information is taken into account.
The text form is an automatically generated description of the image.
This will be converted into text.
For supported formats (e.g., mp4), we try to transcribe the audio, but we do not extract any information from the video itself.