{"id":523,"date":"2026-02-26T08:14:42","date_gmt":"2026-02-26T16:14:42","guid":{"rendered":"https:\/\/salemdata.net\/johnpress\/?p=523"},"modified":"2026-02-26T08:14:42","modified_gmt":"2026-02-26T16:14:42","slug":"extracting-links-from-pdf","status":"publish","type":"post","link":"https:\/\/salemdata.net\/johnpress\/?p=523","title":{"rendered":"Extracting Links From PDF"},"content":{"rendered":"<p>Here&#8217;s a walk-through where I extract video link in a PDF and ascertain the length of the videos, their YouTube address, and their title.<\/p>\n<p>The 291 page Voron Assembly manual has links to helpful videos within it.\u00a0 Here&#8217;s a picture of such a link, https:\/\/voron.link\/onjwmcd, from page 10 (intended to be read by a phone&#8217;s scanner).\u00a0 The links are short links to content staged on YouTube, so &#8220;https:\/\/voron.link\/onjwmcd&#8221; points to &#8220;https:\/\/www.youtube.com\/watch?t=466&amp;v=2dvbn0rWA60&amp;feature=youtu.be&#8221; which is a 10:23 entitled &#8220;Blind Joint Basics&#8221;.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-524\" src=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260225_233535_Wed.png\" alt=\"\" width=\"1571\" height=\"725\" srcset=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260225_233535_Wed.png 1571w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260225_233535_Wed-300x138.png 300w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260225_233535_Wed-768x354.png 768w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260225_233535_Wed-1536x709.png 1536w\" sizes=\"auto, (max-width: 1571px) 100vw, 1571px\" \/><\/p>\n<p>I wanted to know how many links there are the PDF, where they pointed to, and what the length of the target video is and it&#8217;s title.\u00a0 With ChatGPT, we developed this process.\u00a0 The PDF file is <strong>Assembly_Manual_Trident_Nov_28_2025.pdf<\/strong><\/p>\n<p>1) <strong>command<\/strong> to mine the links:<\/p>\n<pre>pdfgrep -n -o 'https?:\/\/[^[:space:])\"]+' \\ \r\nAssembly_Manual_Trident_Nov_28_2025.pdf \\ \r\n| sort -u \\ \r\n&gt; voron_manual_links_page_url.tsv<\/pre>\n<p><strong>Example:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-525\" src=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080128_Thu.png\" alt=\"\" width=\"782\" height=\"546\" srcset=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080128_Thu.png 782w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080128_Thu-300x209.png 300w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080128_Thu-768x536.png 768w\" sizes=\"auto, (max-width: 782px) 100vw, 782px\" \/><\/p>\n<p>2) <strong>command<\/strong> to sort by page number:<\/p>\n<pre>sort -t: -k1,1n -k2,2 voron_manual_links_page_url.tsv<\/pre>\n<p><strong>Example:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-526\" src=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080514_Thu.png\" alt=\"\" width=\"790\" height=\"487\" srcset=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080514_Thu.png 790w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080514_Thu-300x185.png 300w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080514_Thu-768x473.png 768w\" sizes=\"auto, (max-width: 790px) 100vw, 790px\" \/><\/p>\n<p>Then I decided I only wanted the links the contained &#8220;voron.link&#8221;, so:<\/p>\n<p>3) <strong>command<\/strong> to isolate only voron.links:<\/p>\n<pre>awk '\/voron\\.link\/ { sub(\/^[0-9]+:\/,\"\"); print }' voron_manual_links_page_url.tsv \\ \r\n| sort -u \\ \r\n&gt; voron_shortlinks.txt<\/pre>\n<p><strong>Example:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-527\" src=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080747_Thu.png\" alt=\"\" width=\"1016\" height=\"308\" srcset=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080747_Thu.png 1016w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080747_Thu-300x91.png 300w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080747_Thu-768x233.png 768w\" sizes=\"auto, (max-width: 1016px) 100vw, 1016px\" \/><\/p>\n<p>4) Determine the length and title of each video:<\/p>\n<p><strong>command:<\/strong><\/p>\n<pre># 1) unique voron.link URLs \r\nawk '\/voron\\.link\/ { sub(\/^[0-9]+:\/,\"\"); print }' voron_manual_links_page_url.tsv \\ \r\n| sort -u &gt; voron_shortlinks.txt \r\n\r\n# 2) enrich \r\nwhile read -r short; do \r\n\u00a0final=$(curl -sL -o \/dev\/null -w '%{url_effective}' \"$short\") \r\n\u00a0yt-dlp --no-warnings --skip-download \\ \r\n\u00a0\u00a0\u00a0--print '%(duration_string)s\\t%(title)s' \\ \r\n\u00a0\u00a0\u00a0\"$final\" 2&gt;\/dev\/null \\ \r\n\u00a0| awk -v s=\"$short\" -v f=\"$final\" 'BEGIN{OFS=\"\\t\"} {print s,f,$0}' \\ \r\n\u00a0|| printf \"%s\\t%s\\t-\\t(METADATA_FAIL)\\n\" \"$short\" \"$final\" \r\ndone &lt; voron_shortlinks.txt \\ \r\n&gt; voron_videos_manifest.tsv<\/pre>\n<p><strong>Example:<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-528\" src=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080854_Thu.png\" alt=\"\" width=\"1089\" height=\"428\" srcset=\"https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080854_Thu.png 1089w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080854_Thu-300x118.png 300w, https:\/\/salemdata.net\/johnpress\/wp-content\/uploads\/2026\/02\/20260226_080854_Thu-768x302.png 768w\" sizes=\"auto, (max-width: 1089px) 100vw, 1089px\" \/><\/p>\n<p>Where the numbers do not have colons, those represent seconds only.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here&#8217;s a walk-through where I extract video link in a PDF and ascertain the length of the videos, their YouTube address, and their title. The 291 page Voron Assembly manual has links to helpful videos within it.\u00a0 Here&#8217;s a picture of such a link, https:\/\/voron.link\/onjwmcd, from page 10 (intended to be read by a phone&#8217;s [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[102,103],"class_list":["post-523","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-linux-tip","tag-pdf"],"_links":{"self":[{"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=\/wp\/v2\/posts\/523","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=523"}],"version-history":[{"count":1,"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=\/wp\/v2\/posts\/523\/revisions"}],"predecessor-version":[{"id":529,"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=\/wp\/v2\/posts\/523\/revisions\/529"}],"wp:attachment":[{"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=523"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=523"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salemdata.net\/johnpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=523"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}