svsaurav95
/

Action-Detection-Vit-LSTM

Video Classification

Model card Files Files and versions

svsaurav95 commited on Feb 27

Commit

3fff5e8

·

verified ·

1 Parent(s): 0d78926

Update README.md

Files changed (1) hide show

README.md +76 -3

README.md CHANGED Viewed

@@ -1,3 +1,76 @@
----
-license: mit
----

+---
+license: mit
+---
+**ViT-LSTM Action Recognition**
+Overview
+This project implements an Action Recognition Model using a ViT-LSTM architecture. It takes a short video as input and predicts the action performed in the video. The model extracts frame-wise ViT features and processes them using an LSTM to capture temporal dependencies.
+**Model Details**
+Base Model: ViT-Base-Patch16-224
+Architecture: ViT (Feature Extractor) + LSTM (Temporal Modeling)
+Number of Classes: 5
+Dataset: Custom dataset with the following action categories:
+ BaseballPitch
+ Basketball
+ BenchPress
+ Biking
+ Billiards
+**Working**
+Extract Frames – The model extracts up to 16 frames from the uploaded video.
+Feature Extraction – Each frame is passed through ViT, and feature vectors are obtained.
+Temporal Processing – The LSTM processes these features to capture motion information.
+Prediction – The final output is classified into one of the 5 action categories.
+Model Training Details
+Feature Dimension: 768
+LSTM Hidden Dimension: 512
+Number of LSTM Layers: 2 (Bidirectional)
+Dropout: 0.3
+Optimizer: Adam
+Loss Function: Cross-Entropy Loss
+ Example Usage (Code Snippet)
+If you want to use this model locally:
+````
+import torch
+from transformers import ViTImageProcessor, ViTModel
+from PIL import Image
+import cv2
+# Load Pretrained ViT
+vit_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
+vit_model = ViTModel.from_pretrained("google/vit-base-patch16-224")
+# Load Custom ViT-LSTM Model
+model = torch.load("Vit-LSTM.pth")
+model.eval()
+# Process an Example Video
+video_path = "example.mp4"
+cap = cv2.VideoCapture(video_path)
+frames = []
+while cap.isOpened():
+    ret, frame = cap.read()
+    if not ret:
+        break
+    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+    frames.append(Image.fromarray(frame))
+cap.release()
+# Extract Features
+inputs = vit_processor(images=frames, return_tensors="pt")["pixel_values"]
+features = vit_model(inputs).last_hidden_state.mean(dim=1)
+# Predict
+features = features.unsqueeze(0)  # Add batch dimension
+output = model(features)
+predicted_class = torch.argmax(output, dim=1).item()
+LABELS = ["BaseballPitch", "Basketball", "BenchPress", "Biking", "Billiards"]
+print("Predicted Action:", LABELS[predicted_class])
+````
+**Contributors**
+Saurav Dhiani – Model Development & Deployment
+ViT & LSTM – Core ML Architecture