Paper - Pytorch
Project description
Video Vit
Open source implementation of a vision transformer that can understand Videos using max vit as a foundation. This uses max vit as the backbone vit and then packs the video tensor into a 4d tensor which is the input to the maxvit model. Implementing this because the new McVit came out and I need more practice. This is fully ready to train and I believe would perform amazingly.
Installation
$ pip install video-vit
Usage
import torch
from video_vit.main import VideoViT
# Instantiate the VideoViT model with the specified parameters
model = VideoViT(
num_classes=10, # Number of output classes
dim=64, # Dimension of the token embeddings
depth=(2, 2, 2), # Depth of each stage in the model
dim_head=32, # Dimension of the attention head
window_size=7, # Size of the attention window
mbconv_expansion_rate=4, # Expansion rate of the Mobile Inverted Bottleneck block
mbconv_shrinkage_rate=0.25, # Shrinkage rate of the Mobile Inverted Bottleneck block
dropout=0.1, # Dropout rate
channels=3, # Number of input channels
)
# Create a random tensor with shape (batch_size, channels, frames, height, width)
x = torch.randn(1, 3, 10, 224, 224)
# Perform a forward pass through the model
output = model(x)
# Print the shape of the output tensor
print(output.shape)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
video_vit-0.0.4.tar.gz
(7.1 kB
view hashes)
Built Distribution
Close
Hashes for video_vit-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84966886036bb6e4d81f14db3b42283004699ab274b91849b070470be5d2943a |
|
MD5 | 5dcc6ce561f5b7625a68f4d0b3c8273c |
|
BLAKE2b-256 | fd018fc5570c1429f9f8c91bdb93cc435c452d4019aa2741c0d1b47145ba5d99 |