Awwwards
TopDesignKing

WordPress Media Cleaning

FIRST, DO A BACKUP! Just watch out what you are doing as delete means delete.

If you’ve been managing a website for a while, you might be wondering how to get rid of unnecessary images in WordPress. The pictures are kept in the Media Library even after a post or page is deleted.

As a result, many outdated images may be stored on older websites.

To make matters worse, your website’s file system likely contains three to six different versions of each thumbnail image you see in the Media Library.

A significant amount of resources can be wasted by storing and backing up numerous unnecessary files. For this reason, many WordPress users regularly remove any unused images. In addition, each one uses an inode on your web server.

This article will help you define useless media correctly and get rid of it.

So let’s not beat around the bush, and look at how we can automatically remove unused images from the WordPress Media Library.

To make use of a Cron Job, I will construct actions for these functions. I would suggest using a plugin called WP Control to add/run the events, but you could also do this:

if( ! wp_next_scheduled( 'your_cron_job_here' ) ) {
    // Delete the event wp_unschedule_event()
    wp_schedule_event( time(), 'twicedaily', 'your_cron_job_here' );
}

If you wish to disable WP-Cron from running automatically on every page load, we can program it to run directly by making a web-request to our wp-cron.php file. First add this to wp-config.php

define('DISABLE_WP_CRON', true);

Log in to cPanel and locate the Cron Jobs option, add new cron job and insert this command:

wget -q -O – https://your-website-url.com/wp-cron.php?doing_wp_cron >/dev/null 2>&1

If for some reason you don’t want to use these with a Cron Job, you can fire them with action hooks like init, wp_head, etc or create a shortcode bind to the function and run it. You weirdo.

Delete unattached media (orphaned images that are not attached to any WordPress post, page or custom post)

In case you have images added as url, it is not considered attached! Do this if you are sure there are no images like that.

PS: you can also see these images going to Media -> All media items -> Unattached

add_action( 'weszty_web_delete_unattached_attachments', 'weszty_web_delete_unattached_attachments' );
function weszty_web_delete_unattached_attachments() {

    set_time_limit( 0 ); // this may slow down your site

    $attachments = get_posts( array(
        'post_type' => 'attachment',
        'numberposts' => -1, // probably better with 1000
        'fields' => 'ids',
        'post_parent' => 0,
    ));

    if ( $attachments ) {
        foreach ( $attachments as $attachmentID ) {
            $attachment_path = get_attached_file( $attachmentID );
            // Delete attachment from database only, not file
            $delete_attachment = wp_delete_attachment( $attachmentID, true );
            // Delete attachment file from disk
            $delete_file = wp_delete_file( $attachment_path ); // or unlink()
        }
    }
}

Delete all attachments related to a specific custom post type

add_action( 'weszty_web_delete_cpt_attachments', 'weszty_web_delete_cpt_attachments' );
function weszty_web_delete_cpt_attachments() {

    set_time_limit( 0 ); // this may slow down your site
	
    $delete_from_cpt = 'your custom post type name';
    $attachments = get_posts( array(
        'post_type' => 'attachment',
        'numberposts' => 2000,
    ));

    if ( $attachments ) {
        foreach ( $attachments as $attachment ) {
            $parent_id = $attachment->post_parent;
            if ( $delete_from_cpt == get_post_type( $parent_id ) ) {
                $attachmentID = $attachment->ID;
                $attachment_path = get_attached_file( $attachmentID );
                // Delete attachment from database only, not file
                $delete_attachment = wp_delete_attachment( $attachmentID, true );
                // Delete attachment file from disk
                $delete_file = wp_delete_file( $attachment_path ); // or unlink()
            }
        }
    }
}

Delete all attachments which files no longer present on web server and give 404 error

add_action( 'weszty_web_delete_404_attachments', 'weszty_web_delete_404_attachments' );
function weszty_web_delete_404_attachments() {

    set_time_limit( 0 ); // this may slow down your site

    $attachments = get_posts( array(
        'post_type' => 'attachment',
        'numberposts' => 500,
        'fields' => 'ids'
    ));

    // Caution: resource greedy
    if ( $attachments ) {
        foreach ( $attachments as $attachmentID ) {
            $file_url = wp_get_attachment_url( $attachmentID );
            $file_headers = @get_headers( $file_url );
            if( $file_headers[0] == 'HTTP/1.1 404 Not Found' ) {
                $deleted = wp_delete_attachment( $attachmentID, true );
            }
        }
    }
}

PS: this will not delete links to those images in posts content, sorry. I will post that here as soon as I get back on that task.

How do we know how to use it correctly?

Let’s see some use cases.

Case 1: All images on your site are stored as WP attachments

Good media: all WordPress attachments

Bad media: anything stored in uploads directory that is not WP attachment

This function will free up disk space on your hosting account. It will scan your wp-content/uploads directory (recursively) and check every found file whether it is WP attachment or not.

All files that are not WP attachments will be deleted. If you got rather large uploads folder (say 5 gb) you might want to split process into parts by folders.

Delete orphaned files in wp-content/uploads that are not WP attachments

add_action( 'weszty_web_clean_uploads_from_nonattachments', 'weszty_web_clean_uploads_from_nonattachments' );
function weszty_web_clean_uploads_from_nonattachments() {

    set_time_limit( 0 ); // this may slow down your site

    $uploads_dir = wp_upload_dir();
    $search = $uploads_dir['basedir'];
    $replace = $uploads_dir['baseurl'];
    // You may want to take it by folder if your uploads is rather large (over 5 gb for example)
    // $uploads_dir = ( $uploads_dir['basedir'] . '/2023/' );
    $uploads_dir = ( $uploads_dir['basedir'] );
    $root = $uploads_dir;
    // Going through directory recursively
    $iterator = new RecursiveIteratorIterator(
        new RecursiveDirectoryIterator( $root, RecursiveDirectoryIterator::SKIP_DOTS ),
        RecursiveIteratorIterator::SELF_FIRST,
        RecursiveIteratorIterator::CATCH_GET_CHILD // Ignore "Permission denied"
    );

    foreach ( $iterator as $fileinfo ) {
        // get files only
        if ( $fileinfo->isFile() ) {
            $image = $fileinfo->getPathname();
            $image_url = str_replace( $search, $replace, $image );
            // Core WP function to retrieve attachment ID by URL
            $attachment_id = attachment_url_to_postid( $image_url );
            // Not found - then delete file
            if ( !$attachment_id ) {
                unlink( $image ); // or wp_delete_file()
            } else {
                // List of found attachments
                echo $attachment_id.': '.$image;
            }
        }
    }
}

Case 2: images are not stored as WP attachments at all but are still used on your website

Good media: all images saved in specified custom fields

Bad media: anything else in uploads directory

First, we need to get the combined list of all “good” images that are used on the website, so later we can delete all files that are not used.  

I will give a small example, I recently had a client who got courses, podcasts and celebrities where images for every post type were stored in a separate custom field. (2M images in total).

Images were imported from external sources like YT, forbes api, simplecast api, etc. To make it worse, there were no checks if these images exist or not, so the Cron duplicated a lot of them. There were no optimizations if some posts have the same image to not create another one on the server… hmm… so anyway, back to the cleaning.

Step 1: Define all good images

We’ll obtain the list of images used for each CPT and combine them together to get a full list of good images.

function get_good_images( $cpt, $meta_field ) {
    $args = array( 
        'post_type' => $cpt, 
        'posts_per_page' => -1, 
        'post_status' => 'any', 
        'fields' =>'ids'
    );
    $myposts = get_posts( $args );
    
    $all_posts = array();
    foreach ( $myposts as $mypostid ) {  
        $post_field = get_field( $meta_field, $mypostid );
        $all_posts[] = $post_field;  
    }

    return $all_posts;
}

$all_podcast_posts = get_good_images( 'courses', 'courses_image' );
$all_celebrity_posts = get_good_images( 'podcast', 'podcast_image' );
$all_courses_posts = get_good_images( 'celebrity', 'celebrity_image' );

$all_good_pictures = array_merge( $all_podcast_posts, $all_celebrity_posts, $all_courses_posts );
$all_good_pictures = array_filter( $all_good_pictures );

Step 2: Delete images not used in WordPress

So we’ve got the complete list of good images that are used on the website in variable $all_good_pictures. Now we’ll  go through the wp-content/uploads directory and check every file whether it’s on list or not. If it’s not listed we’ll just delete it.

add_action( 'weszty_web_clean_uploads_from_bad_images', 'weszty_web_clean_uploads_from_bad_images' );
function weszty_web_clean_uploads_from_bad_images() {

    set_time_limit( 0 ); // this may slow down your site

    // TODO: this can be stored as a transient to not do the query over and over :)
    $all_podcast_posts = get_good_images( 'courses', 'courses_image' );
    $all_celebrity_posts = get_good_images( 'podcast', 'podcast_image' );
    $all_courses_posts = get_good_images( 'celebrity', 'celebrity_image' );

    $all_good_pictures = array_merge( $all_podcast_posts, $all_celebrity_posts, $all_courses_posts );
    $all_good_pictures = array_filter( $all_good_pictures );

    $uploads_dir = wp_upload_dir();
    $search = $uploads_dir['basedir'];
    $replace = $uploads_dir['baseurl'];
    // You may want to take it by folder if your uploads is rather large (over 5 gb for example)
    // $uploads_dir = ( $uploads_dir['basedir'] . '/2023/' );
    $uploads_dir = ( $uploads_dir['basedir'] );
    $root = $uploads_dir;
    // Going through directory recursively
    $iterator = new RecursiveIteratorIterator(
        new RecursiveDirectoryIterator( $root, RecursiveDirectoryIterator::SKIP_DOTS ),
        RecursiveIteratorIterator::SELF_FIRST,
        RecursiveIteratorIterator::CATCH_GET_CHILD // Ignore "Permission denied"
    );

    foreach ( $iterator as $fileinfo ) {
        if ( $fileinfo->isFile() ) {
            $image = $fileinfo->getPathname();
            $image_url = str_replace( $search, $replace, $image );
            // Delete if file is not found in list of good images
            if ( !in_array( $image_url, $all_good_pictures ) ) {
                wp_delete_file( $image ); // or unlink( $image );
            }
        }
    }
}

Step 3: The “Nuclear Option”!

Here we will check the entire site from scratch and clean all broken images from content. We will cover all you need to know using SQL and PHP.

Locate the wp_posts table in your WordPress database. This table stores the content of your posts and pages. Run the following SQL query to find all image attachments that are not associated with any existing post or page:

SELECT * FROM wp_posts WHERE post_type = 'attachment' AND post_parent NOT IN (SELECT ID FROM wp_posts);

This query selects all rows from the wp_posts table where the post_type is ‘attachment‘ (representing images) and the post_parent (the ID of the associated post or page) is not present in the wp_posts table. These are likely to be orphaned image attachments. (pls don’t take this for granted)

Once you are confident about deleting the images, modify the query to delete the selected rows:

DELETE FROM wp_posts WHERE post_type = 'attachment' AND post_parent NOT IN (SELECT ID FROM wp_posts);

This query will delete the rows representing the images from the wp_posts table. It’s the same thing as above, just faster.

How can we know how many images we have in the content to remove them?

Unfortunately a query will not help you enough, in WordPress there are several ways images can be stored in content: Image tags, Hyperlink tags, Gallery shortcode, Background images, and other shortcodes from plugins. Let’s keep it simple and make a guess using this query (you can also alter this to search all sorts):

SELECT * FROM wp_posts WHERE post_content LIKE '%src="%';

So let’s discuss a bit the process, we want to delete images that are 404, and delete tags and shortcodes + remove/fix shortcodes, deleting the shortcode if contains one image with one 404 image id, and deleting image ids from the shortcode that are 404.

First let’s see how can we check if a url is broken:

Regular expressions can help in identifying potential broken URLs based on patterns or common issues, but they do not provide a foolproof method for determining whether an attachment URL is genuinely broken.

To accurately determine if an attachment URL is broken or not, you typically need to make an HTTP request to the URL and analyze the response. If the response status is in the 4xx or 5xx range (e.g., 404 Not Found, 500 Internal Server Error), it indicates that the image URL is broken.

function is_url_broken( $url ) {
    $headers = @get_headers( $url );
    return !$headers || strpos( $headers[0], '404' ) !== false;
}

We will use expressions to match patterns in text, if an image URL is found, it makes an HTTP request and removes the tag if the URL is broken. If a gallery shortcode is found, then check each image id for broken links and only the specific broken image id within the gallery shortcode will be removed, preserving the rest of the gallery shortcode in the post content. We will apply the same principle for captions (remove the caption shortcode when it contains a link but the link is 404).

function remove_broken_links_from_content( $post_content ) {
    // Pattern to match images and gallery/caption shortcodes
    $pattern = '/<(img|a)[^>]+(src|href)=[\'"]([^\'"]+)[\'"][^>]*>|\[gallery ids=[\'"]([^\'"]+)[\'"]\]|\[caption[^>]*\](.*?)\[\/caption\]/i';

    // Find all images
    preg_match_all( $pattern, $post_content, $matches );

    // Iterate through each matched items
    foreach ( $matches[0] as $key => $shortcode ) {
        $tagType = $matches[1][$key];
        $attribute = $matches[2][$key];
        $url = $matches[3][$key];
        $galleryIds = $matches[4][$key];
        $captionContent = $matches[5][$key];

        if ( $tagType === "img" && $attribute === "src" ) {
            if ( is_url_broken( $url ) ) {
                // URL is invalid or returns a 404 status code
                $post_content = str_replace( $shortcode, '', $post_content );
            }
        } elseif ( $tagType === "a" && $attribute === "href" ) {
            if ( is_url_broken( $url ) ) {
                // URL is invalid or returns a 404 status code
                $post_content = str_replace( $shortcode, '', $post_content );
            }
        } elseif ( strpos( $shortcode, '[gallery' ) === 0 ) {
            // Gallery shortcode
            $imageIds = explode( ',', $galleryIds );
            $validImageIds = [];

            foreach ( $imageIds as $imageId ) {
                $image = wp_get_attachment_image_src( $imageId, 'full' );

                if ( $image && strpos( $image[0], '404' ) === false ) {
                    // Image URL is valid and does not return a 404 status code
                    $validImageIds[] = $imageId;
                }
            }

            if ( count( $validImageIds ) === 0 ) {
                // All images in the gallery are broken, remove the entire gallery shortcode
                $post_content = str_replace( $shortcode, '', $post_content );
            } else {
                // Some images in the gallery are valid, construct a new gallery shortcode with only the valid image IDs
                $newGalleryIds = implode( ',', $validImageIds );
                $newShortcode = str_replace( $galleryIds, $newGalleryIds, $shortcode );
                $post_content = str_replace( $shortcode, $newShortcode, $post_content );
            }
        } elseif ( strpos( $shortcode, '[caption' ) === 0 ) {
            // Caption shortcode
            if ( preg_match( '/<a[^>]*href=[\'"](.*?)[\'"][^>]*>/', $captionContent ) ) {
                if ( is_url_broken( $url ) ) {
                    // URL is invalid or returns a 404 status code
                    $post_content = str_replace( $shortcode, '', $post_content );
                }
            } else {
                // Caption does not contain a link, remove the entire caption shortcode
                $post_content = str_replace( $shortcode, '', $post_content );
            }
        }
    }

    return $post_content;
}

I’m testing on a 70k+ images database with over 7k posts, processing a large number of posts with multiple images in the content can be resource-intensive and may take a significant amount of time.

First, it is important to test it on a single post to ensure it works as expected before applying it to your entire database. I created a second editor to clean and save the data there on save/update event.

function clean_data( $post_id ) {
    // Bail if we're doing an auto save
    if( defined( 'DOING_AUTOSAVE' ) && DOING_AUTOSAVE ) return;

    // Retrieve the post object
    $post = get_post( $post_id );

    // Get the post content
    $post_content = $post->post_content;

    // Remove broken links from the post content
    $updated_content = remove_broken_links_from_content( $post_content );

    update_post_meta( $post_id, 'second_editor_data', $updated_content );
}
add_action( 'save_post', 'clean_data', 10 );
add_action( 'update_post', 'clean_data', 10 );

Here you can test all cases you want before running on the full site. Safety first!

To optimize the process we will use something called Batch Processing, instead of processing all 70k+ posts in a single execution, you can break it down into smaller batches. This helps prevent timeouts and reduces memory usage. You can use WordPress functions like get_posts() or WP_Query to retrieve a batch of posts at a time, process them, and move on to the next batch.

Implement proper logging and error handling mechanisms to track the progress and handle any errors that may occur during the processing. This will help you identify and resolve any issues more efficiently.

function remove_broken_links_cron_job() {
    error_log( 'Cron job started.' ); // Log entry

    $to = 'your email'; // Replace with your email address
    $subject = 'Cron job started'; // Email subject
    $message = 'The cron job has started successfully.'; // Email message
    $headers = array( 'Content-Type: text/html; charset=UTF-8' );
    wp_mail( $to, $subject, $message, $headers );

    // Query posts in batches
    $batch_size = 5; // Number of posts to process in each batch
    $current_offset = 0; // Starting offset

    // Loop through batches until all posts are processed
    while ( true ) {
        // Retrieve a batch of posts
        $args = array(
            'post_type' => 'post',
            'post_status' => 'any'
            'posts_per_page' => $batch_size,
            'offset'         => $current_offset,
            // Add any additional query parameters as needed
        );
        $posts = get_posts( $args );

        // Break the loop if no more posts are found
        if ( empty( $posts ) ) {
            break;
        }

        // Process each post in the batch
        foreach ( $posts as $post ) {
            // Get the post content
            $post_content = $post->post_content;

            // Remove broken links and images from the post content
            $processed_content = remove_broken_links_from_content( $post_content );

            // Update the post content if changes were made
            if ( $processed_content !== $post_content ) {
                $post->post_content = $processed_content;
                wp_update_post( $post );
            }
        }

        // Increment the offset to retrieve the next batch
        $current_offset += $batch_size;
        error_log( 'Cron job posts: '. $current_offset ); // Log entry
    }

    error_log( 'Cron job completed.' ); // Log entry

    $to = 'your email'; // Replace with your email address
    $subject = 'Cron job completed'; // Email subject
    $message = 'The cron job has completed successfully.'; // Email message
    $headers = array( 'Content-Type: text/html; charset=UTF-8' );
    wp_mail( $to, $subject, $message, $headers );

}
add_action( 'remove_broken_links_cron_job', 'remove_broken_links_cron_job' );

I let it run for 1h, since there are unknown number of requests/post I use chunks of 5. Here are the logs:

[12-Jun-2023 15:06:40 UTC] Cron job started.
[12-Jun-2023 15:07:41 UTC] Cron job posts: 5
[12-Jun-2023 15:08:30 UTC] Cron job posts: 10
[12-Jun-2023 15:09:17 UTC] Cron job posts: 15
[12-Jun-2023 15:10:08 UTC] Cron job posts: 20
[12-Jun-2023 15:11:16 UTC] Cron job posts: 25
[12-Jun-2023 15:12:06 UTC] Cron job posts: 30
[12-Jun-2023 15:13:04 UTC] Cron job posts: 35
[12-Jun-2023 15:14:00 UTC] Cron job posts: 40
[12-Jun-2023 15:15:08 UTC] Cron job posts: 45
[12-Jun-2023 15:15:57 UTC] Cron job posts: 50
[12-Jun-2023 15:16:15 UTC] Cron job posts: 55
[12-Jun-2023 15:16:34 UTC] Cron job posts: 60
[12-Jun-2023 15:16:53 UTC] Cron job posts: 65
[12-Jun-2023 15:17:13 UTC] Cron job posts: 70
[12-Jun-2023 15:17:32 UTC] Cron job posts: 75
[12-Jun-2023 15:17:50 UTC] Cron job posts: 80
[12-Jun-2023 15:18:08 UTC] Cron job posts: 85
[12-Jun-2023 15:18:27 UTC] Cron job posts: 90
[12-Jun-2023 15:18:46 UTC] Cron job posts: 95
[12-Jun-2023 15:20:02 UTC] Cron job posts: 100
[12-Jun-2023 15:21:32 UTC] Cron job posts: 105
[12-Jun-2023 15:21:59 UTC] Cron job posts: 110
[12-Jun-2023 15:26:18 UTC] Cron job posts: 115
[12-Jun-2023 15:27:06 UTC] Cron job posts: 120
[12-Jun-2023 15:28:39 UTC] Cron job posts: 125
[12-Jun-2023 15:30:13 UTC] Cron job posts: 130
[12-Jun-2023 15:31:21 UTC] Cron job posts: 135
[12-Jun-2023 15:32:08 UTC] Cron job posts: 140
[12-Jun-2023 15:33:07 UTC] Cron job posts: 145
[12-Jun-2023 15:36:42 UTC] Cron job posts: 150
[12-Jun-2023 15:37:53 UTC] Cron job posts: 155
[12-Jun-2023 15:38:32 UTC] Cron job posts: 160
[12-Jun-2023 15:39:20 UTC] Cron job posts: 165
[12-Jun-2023 15:41:50 UTC] Cron job posts: 170
[12-Jun-2023 15:42:40 UTC] Cron job posts: 175
[12-Jun-2023 15:43:41 UTC] Cron job posts: 180
[12-Jun-2023 15:44:51 UTC] Cron job posts: 185
[12-Jun-2023 15:49:28 UTC] Cron job posts: 190
[12-Jun-2023 15:51:40 UTC] Cron job posts: 195
[12-Jun-2023 15:54:40 UTC] Cron job posts: 200
[12-Jun-2023 15:55:34 UTC] Cron job posts: 205
[12-Jun-2023 15:56:38 UTC] Cron job posts: 210
[12-Jun-2023 15:57:49 UTC] Cron job posts: 215
[12-Jun-2023 15:58:52 UTC] Cron job posts: 220
[12-Jun-2023 15:59:44 UTC] Cron job posts: 225
[12-Jun-2023 16:00:41 UTC] Cron job posts: 230
[12-Jun-2023 16:01:31 UTC] Cron job posts: 235
[12-Jun-2023 16:03:45 UTC] Cron job posts: 240
[12-Jun-2023 16:05:00 UTC] Cron job posts: 245
[12-Jun-2023 16:06:13 UTC] Cron job posts: 250

Based on this report we could estimate how long will it take to do a full cleanup on the site for N number of posts. For these kinds of features, I typically construct a non-repeating cron event that I can follow using plugins like wp-cron.

Following these steps, we can organize and clean up our WordPress installation, minimize data usage and CPU load, and even switch to a less expensive hosting plan since

Don't be weird.

Would you like more information or do you have a question?

scroll
10%
Drag View Close play