Extracting Links From PDF

Here’s a walk-through where I extract video link in a PDF and ascertain the length of the videos, their YouTube address, and their title.

The 291 page Voron Assembly manual has links to helpful videos within it.  Here’s a picture of such a link, https://voron.link/onjwmcd, from page 10 (intended to be read by a phone’s scanner).  The links are short links to content staged on YouTube, so “https://voron.link/onjwmcd” points to “https://www.youtube.com/watch?t=466&v=2dvbn0rWA60&feature=youtu.be” which is a 10:23 entitled “Blind Joint Basics”.

I wanted to know how many links there are the PDF, where they pointed to, and what the length of the target video is and it’s title.  With ChatGPT, we developed this process.  The PDF file is Assembly_Manual_Trident_Nov_28_2025.pdf

1) command to mine the links:

pdfgrep -n -o 'https?://[^[:space:])"]+' \ 
Assembly_Manual_Trident_Nov_28_2025.pdf \ 
| sort -u \ 
> voron_manual_links_page_url.tsv

Example:

2) command to sort by page number:

sort -t: -k1,1n -k2,2 voron_manual_links_page_url.tsv

Example:

Then I decided I only wanted the links the contained “voron.link”, so:

3) command to isolate only voron.links:

awk '/voron\.link/ { sub(/^[0-9]+:/,""); print }' voron_manual_links_page_url.tsv \ 
| sort -u \ 
> voron_shortlinks.txt

Example:

4) Determine the length and title of each video:

command:

# 1) unique voron.link URLs 
awk '/voron\.link/ { sub(/^[0-9]+:/,""); print }' voron_manual_links_page_url.tsv \ 
| sort -u > voron_shortlinks.txt 

# 2) enrich 
while read -r short; do 
 final=$(curl -sL -o /dev/null -w '%{url_effective}' "$short") 
 yt-dlp --no-warnings --skip-download \ 
   --print '%(duration_string)s\t%(title)s' \ 
   "$final" 2>/dev/null \ 
 | awk -v s="$short" -v f="$final" 'BEGIN{OFS="\t"} {print s,f,$0}' \ 
 || printf "%s\t%s\t-\t(METADATA_FAIL)\n" "$short" "$final" 
done < voron_shortlinks.txt \ 
> voron_videos_manifest.tsv

Example:

Where the numbers do not have colons, those represent seconds only.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *