Nobody here but us chickens! #6

Closed
opened 2024-04-19 01:06:24 +00:00 by david · 5 comments

I get this error:

Nobody here but us chickens!

And also the scraper error:

This scraper returned an error:
Google returned an unsupported page format (will fix)

Maybe 5-10% of the time I search. If I click search again (or a few times) eventually it will return results. I know the scraper error is due to google changes, but the "chickens" error seems to be because no results are returned.

Do you think that is also scraper-related, or could there be something else going on that could cause that error?

Also, a feature request, or idea, for consideration, would be an option to display the proxy being used somewhere in the search results. If there's an issue, it's helpful to know which proxy is being used, and it's interesting info to know. I added this to backend.php and frontend.php for myself, but I know it's not the best way to do it.

Added to lib/backend.php assign_proxy function before switch($type):

// Test to output proxy information
switch ($port) {
    case "8888":
        $GLOBALS["proxy_display"] = "Proxy: Home";
        break;
    case "8310":
        $GLOBALS["proxy_display"] = "Proxy: Tokyo";
        break;
    case "8320":
        $GLOBALS["proxy_display"] = "Proxy: Singapore";
        break;
// ... 
    case "":
        $GLOBALS["proxy_display"] = "Proxy: Direct";
        break;
    default:
        $GLOBALS["proxy_display"] = "Proxy: Port $port";
}

Added to lib/frontend.php to display proxy:

// Test to display proxy
                      $replacements["timetaken"] = '<div class="timetaken">Took ' . number_format(microtime(true) - $replacements["timetaken"], 2) . 's ' . $GLOBALS["proxy_display"] . '</div>';

I get this error: Nobody here but us chickens! And also the scraper error: This scraper returned an error: Google returned an unsupported page format (will fix) Maybe 5-10% of the time I search. If I click search again (or a few times) eventually it will return results. I know the scraper error is due to google changes, but the "chickens" error seems to be because no results are returned. Do you think that is also scraper-related, or could there be something else going on that could cause that error? Also, a feature request, or idea, for consideration, would be an option to display the proxy being used somewhere in the search results. If there's an issue, it's helpful to know which proxy is being used, and it's interesting info to know. I added this to backend.php and frontend.php for myself, but I know it's not the best way to do it. Added to lib/backend.php assign_proxy function before switch($type): ``` // Test to output proxy information switch ($port) { case "8888": $GLOBALS["proxy_display"] = "Proxy: Home"; break; case "8310": $GLOBALS["proxy_display"] = "Proxy: Tokyo"; break; case "8320": $GLOBALS["proxy_display"] = "Proxy: Singapore"; break; // ... case "": $GLOBALS["proxy_display"] = "Proxy: Direct"; break; default: $GLOBALS["proxy_display"] = "Proxy: Port $port"; } ``` Added to lib/frontend.php to display proxy: ``` // Test to display proxy $replacements["timetaken"] = '<div class="timetaken">Took ' . number_format(microtime(true) - $replacements["timetaken"], 2) . 's ' . $GLOBALS["proxy_display"] . '</div>'; ```
Owner

Yes.. The Google scraper is due for a rewrite. Let me explain the errors you've been getting:

chicken error

this is caused by Google returning some miscellaneous error. I don't detect it, so I attempt to get all search results in the body even though the page doesn't have any.

unsupported page

Google has started doing A/B testing and sometimes returns a newer interface, which has not been parsed yet. My scraper works by getting nodes by their CSS style attributes, so it is very prone to breakage. I do this because most of the nodenames on the Google page are random strings on every page load. I also opted to scrape the mobile version of Google because it returns more sublinks despite not returning sublink descriptions, and it also shows results faster. Historically, the mobile page for old browsers had not changed in almost a decade, but it seems that the old interface is slowly getting replaced.

SearxNG uses some fucky API that returns different results from my testing; Their method also doesn't allow me to scrape word definitions so will not be using that.

In a next update, I will be spoofing the user agent to use a newer android tablet running android 4.2, since it uses the new layout and even returns (albeit small) video descriptions.

Proxy stuff you mentioned

Yes... I also had to implement my own methods to debug some of my proxies. A complete rewrite of the user interface is coming. An admin interface to check proxies will also be made at some point, it's just been really hard to find some time to work on this stuff lately as I'm pretty burnt out from working at my dead end job xoxo

Don't expect the google scraper to be fixed during this month, although I might fix this during the month of May. I also have a week off in august so expect some movement then.

Thank you for your time.

Yes.. The Google scraper is due for a rewrite. Let me explain the errors you've been getting: >chicken error this is caused by Google returning some miscellaneous error. I don't detect it, so I attempt to get all search results in the body even though the page doesn't have any. >unsupported page Google has started doing A/B testing and sometimes returns a newer interface, which has not been parsed yet. My scraper works by getting nodes by their CSS style attributes, so it is very prone to breakage. I do this because most of the nodenames on the Google page are random strings on every page load. I also opted to scrape the mobile version of Google because it returns more sublinks despite not returning sublink descriptions, and it also shows results faster. Historically, the mobile page for old browsers had not changed in almost a decade, but it seems that the old interface is slowly getting replaced. SearxNG uses some fucky API that returns different results from my testing; Their method also doesn't allow me to scrape word definitions so will not be using that. In a next update, I will be spoofing the user agent to use a newer android tablet running android 4.2, since it uses the new layout and even returns (albeit small) video descriptions. >Proxy stuff you mentioned Yes... I also had to implement my own methods to debug some of my proxies. A complete rewrite of the user interface is coming. An admin interface to check proxies will also be made at some point, it's just been really hard to find some time to work on this stuff lately as I'm pretty burnt out from working at my dead end job xoxo Don't expect the google scraper to be fixed during this month, although I might fix this during the month of May. I also have a week off in august so expect some movement then. Thank you for your time.
Author

No worries. Thanks for the info.

No worries. Thanks for the info.
Owner

Hey, just wanted to keep you updated. I'm working on a new version of the scraper which scrapes the desktop version. Their webpage is a clusterfuck so it will take time, but expect some movement next weekend.

Hey, just wanted to keep you updated. I'm working on a new version of the scraper which scrapes the desktop version. Their webpage is a clusterfuck so it will take time, but expect some movement next weekend.
Owner

Sorry for the wait, the update is here. Please let me know of any issues you encounter!

Sorry for the wait, the update is here. Please let me know of any issues you encounter!
Author

I updated my instance and I've been using it with google yesterday and today, with several different proxies. I've only gotten the "chickens" error once. I do get other scraper errors sometimes, though, like:

Failed to grep result div
Failed to get HTML

It is a good improvement, though. Thanks!

I updated my instance and I've been using it with google yesterday and today, with several different proxies. I've only gotten the "chickens" error once. I do get other scraper errors sometimes, though, like: Failed to grep result div Failed to get HTML It is a good improvement, though. Thanks!
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: lolcat/4get#6
No description provided.