Pose Estimation in 7 Minutes - 30fps on CPU

by Augmented Startups in Circuits > Software

684 Views, 4 Favorites, 0 Comments

Pose Estimation in 7 Minutes - 30fps on CPU

Pose Estimation in 7 minutes - 30 FPS on CPU Tutorial
Snapshot_29.png
Snapshot_2.png

So some many years ago I was quite intrigued when X-Box released their first version of the Kinect Sensor, which is a motion-sensing device produced by Microsoft.

When the X-Box 360 launched at the time, I hurried to buy it bundled with a Kinect Sensor. Now, I don’t know what intrigued me about it so much. I guess I was super excited about the possibilities of using my body to control the software. Well, in this case, games. This excitement was brought about after watching movies like minority report, as well as seeing input devices evolved from mouse and keyboard to touch screens, and now body gestures.

The Game

Snapshot_3.png

So one of the first games I played with my wife was this Harry Potter game, where you can wave your hands in a particular way to cast spells. I can tell you now it is already a lot of fun. Especially with my wife and I accidentally hitting each other in the small room that we were playing in.

The Kinect

Snapshot_4.png

So, the Kinect performed pose estimation to approximate where people are in the room and segment their joints and occupants using 3D depth sensors, combined with an RGB camera. However, how can we do this using just a single camera rather than specialized hardware or a multi-camera setup?

Pose Estimation

Snapshot_5.png

This brings us to Pose Estimation, which is the topic of this Instructables. We’re going to explore:

  • What it is?
  • Why & where’d you’d use it?
  • How it works using a single camera? and
  • How to implement 30 frames per second Pose Estimation on CPU?

All in 7 minutes.

Computer Vision

Snapshot_6.1.png
Snapshot_8.png

So, one of the most sought-after aspects of Computer Vision has been to understand human appearance from images and videos.

Anyways, Pose Estimation refers to a computer vision technique that can detect human figures from a camera for body posture and gesture recognition.
Pose Estimation technology enables the following applications:

  1. Assisted living, in the case of fall detection and Yoga Pose Identification
  2. Character Animation as well as
  3. Drone Control, like I’ve implemented in my Autonomous Drone video.

Pose Framework

Snapshot_9.png

In a nutshell, it works by detecting critical body joints which can be achieved using a variety of methods, which are as follows:

  1. OpenPose
  2. AlphaPose
  3. TF Pose Estimation, and
  4. DensePose amongst many many others.

I have a link to a blog that compares the Best Human-Pose Estimation Projects.

Blaze Pose

Snapshot_10.png
Snapshot_11.png

For this implementation, however, we’ll be using BlazePose which is a lightweight CNN Architecture for human pose estimation that is tailored for real-time inference on mobile platforms. What’s really cool is that during inference, the network produces 33 body key points for a single person and runs at over 30 frames per second, crazy right. The Author’s approach used both heatmaps and regression to acquire key-point coordinates. This makes it particularly suited for real-time use-cases like fitness tracking and recognition.

We’ll be implementing BlazePose via the MediaPipe Framework, mainly because it’s fast, lightweight, accurate, and super simple to implement, as you will see now in a bit. I’ll have a link down below to articles that go into a bit more details of how BlazePose works on a deeper level.

COCO Topology

Snapshot_12.png

So the COCO topology is the standard for human body pose. It consists of 17 key points, which are located in the middle of the torso, arms, and legs. However, these key points are only used for the ankle and wrist points and do not include hands and feet which is quite limiting. With BlazePose however, they’ve extended the existing body key-point set to 33 points and this method allows us to predict the body semantics from just pose predictions alone, cool.

Vision Store

Snapshot_14.png

To implement this let’s go over to the Vision Store, click on Pose Estimation, ensure that you are logged in, otherwise you can just sign up.

Requirements

Snapshot_15.png
Snapshot_16.png

Ensure that you have all of the requirements, in this case, it can be any PC or laptop, you need a webcam as well as Python installed along with the PyCharm Community Development Environment.

You can also download all of the files right here.

PyCharm Community

Snapshot_17.png

So, Step1, once PyCharm Community has been installed, create a new project, and let’s call it BlazePose. You can delete the main.py file as you won’t be using this.

Dependencies

Snapshot_18.png

You’ll be required to install only 2 dependencies which are OpenCV and MediaPipe.

It’s highly recommended to use the exact same versions mentioned here for compatibility with the code.

The Code

Snapshot_19.png

And then lastly, you can either import the downloaded code or copy and paste it here from the Vision Model Page.

Run the Code

Snapshot_20.png

To run the code, either on a single image or on a video that came with the downloaded files, you just need to click the play button.

Run on Webcam

Snapshot_21.png
Snapshot_23.png
Snapshot_24.png

To run this on a webcam, you simply need to change the name of the file here to 0 and you’ll be able to detect poses using your webcam.

So, this code is broken up into 2 sections, the first section is for inference on single images and the second is for inference on video.

Static Image Mode

Snapshot_25.png

So for static image mode, we set this to true. It tells our model whether to treat the input images as a batch of static and possibly unrelated images, or a video stream. Model complexity is to set the complexity of the pose landmark model, we’ll set this to 2. Next, we set our detector sensitivity to 50% or 0.5. So now we just redeem our image, let the model process and predict potential human poses, and then we draw the landmark coordinates over the image.

Video Mode

Snapshot_26.png

The same occurs in the second section but with video. Each image is saved in a container called results, which we use to process and display the video with human pose landmarks displayed. We also overlay the frame rate using the calculated inference time between frames.

Conclusion

Snapshot_27.png
Snapshot_28.png

For now, if you enjoyed the 5 minute tutorial in computer vision, then check out our other models on the Augmented Startups Vision Store — Only on Augmented Startups.com.

So this Vision Store is where you can download and implement projects without wasting too much time and is meant for rapid prototyping and quicker time to market.

Otherwise check out our comprehensive courses in Pose Estimation, Computer Vision, AI, and Robotics.

Also, be sure to like and subscribe to these Instructables. Comment on what you would like for us to cover next.

Thank you for reading.