still missing things on google scraper

pull/1/head
lolcat 2023-07-22 14:41:14 -04:00
commit bca265aea6
90 changed files with 17559 additions and 0 deletions

72
README.md 100644
View File

@ -0,0 +1,72 @@
# 4get
4get is a metasearch engine that doesn't suck (they live in our walls!)
## About 4get
https://4get.ca/about
## Try it out
https://4get.ca
# Setup
Login as root.
```sh
apt install apache2 certbot php-dom php-imagick imagemagick php-curl curl php-apcu git libapache2-mod-php python3-certbot-apache
service apache2 start
a2enmod rewrite
```
For all of the files in `/etc/apache2/sites-enabled/`, you must apply the following changes:
- Uncomment `ServerName` directive, put your domain name there
- Change `ServerAdmin` to your email
- Change `DocumentRoot` to `/var/www/html/4get`
- Change `ErrorLog` and `CustomLog` directives to log stuff out to `/dev/null/`
Now open `/etc/apache2/apache2.conf` and change `ErrorLog` and `CustomLog` directives to have `/dev/null/` as a value
This *should* disable logging completely, but I'm not 100% sure since I sort of had to troubleshoot alot of shit while writing this. So after we're done check if `/var/log/apache2/*` contains any personal info, and if it does, call me retarded trough email exchange.
Blindly run the following shit
```sh
cd /var/www/html
git clone https://git.lolcat.ca/lolcat/4get
cd 4get
mkdir icons
chmod 777 -R icons/
```
Restart the service for good measure... `service apache2 restart`
## Setup encryption
I'm schizoid (as you should) so I'm gonna setup 4096bit key encryption. To complete this step, you need a domain or subdomain in your possession. Make sure that the DNS shit for your domain has propagated properly before continuing, because certbot is a piece of shit that will error out the ass once you reach 5 attempts under an hour.
```sh
certbot --apache --rsa-key-size 4096 -d www.yourdomain.com -d yourdomain.com
```
When it asks to choose a vhost, choose the option with "HTTPS" listed. Don't setup HTTPS for tor, we don't need it (it doesn't even work anyways with let's encrypt)
Edit `000-default-le-ssl.conf`
Add this at the end:
```xml
<Directory /var/www/html/4get>
RewriteEngine On
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule (.*) $1.php [L]
Options Indexes FollowSymLinks
AllowOverride All
Require all granted
</Directory>
```
Now since this file is located in `/etc/apache2/sites-enabled/`, you must change all of the logging shit as to make it not log anything, like we did earlier.
Restart again
```sh
service apache2 restart
```
You'll probably want to setup a tor address at this point, but I'm too lazy to put instructions here.
Ok bye!!!

130
about.php 100644
View File

@ -0,0 +1,130 @@
<?php
include "lib/frontend.php";
$frontend = new frontend();
echo
'<!DOCTYPE html>' .
'<html lang="en">' .
'<head>' .
'<meta http-equiv="Content-Type" content="text/html;charset=utf-8">' .
'<title>About</title>' .
'<link rel="stylesheet" href="/static/style.css">' .
'<meta name="viewport" content="width=device-width,initial-scale=1">' .
'<meta name="robots" content="index,follow">' .
'<link rel="icon" type="image/x-icon" href="/favicon.ico">' .
'<meta name="description" content="4get.ca: About">' .
'<link rel="search" type="application/opensearchdescription+xml" title="4get" href="/opensearch.xml">' .
'</head>' .
'<body class="' . $frontend->getthemeclass(false) . 'about">';
$left =
'<a href="/" class="link">&lt; Go back</a>
<h1>Set as default search engine</h1>
<a href="#firefox"><h2 id="firefox">On Firefox and other Gecko based browsers</h2></a>
To set this as your default search engine on Firefox, right click the URL bar and select <div class="code-inline">Add "4get"</div>. Then, visit <a href="about:preferences#search" target="_BLANK" class="link">about:preferences#search</a> and select <div class="code-inline">4get</div> in the dropdown menu.
<a href="#chrome"><h2 id="chrome">On Chromium and Blink based browsers</h2></a>
Right click the URL bar and click <div class="code-inline">Manage search engines and site search</div>, or visit <a href="chrome://settings/searchEngines" target="_BLANK" class="link">chrome://settings/searchEngines</a>. Then, create a new entry under <div class="code-inline">Search engines</div> and fill in the following details:
<table>
<tr>
<td><b>Field</b></td>
<td><b>Value</b></td>
</tr>
<tr>
<td>Search engine</td>
<td>4get</td>
</tr>
<tr>
<td>Shortcut</td>
<td>4get.ca</td>
</tr>
<tr>
<td>URL with %s in place of query</td>
<td>https://4get.ca/web?q=%s</td>
</tr>
</table>
Once that\'s done, click <div class="code-inline">Save</div>. Then, on the right handside of the newly created entry, open the dropdown menu and select <div class="code-inline">Make default</div>.
<a href="#other-browsers"><h2 id="other-browsers">Other browsers</h2></a>
Get a real browser.
<h1>Frequently asked questions</h1>
<a href="#what-is-this"><h2 id="what-is-this">What is this?</h2></a>
This is a metasearch engine that gets results from other engines, and strips away all of the tracking parameters and Microsoft/globohomo bullshit they add. Most of the other alternatives to Google jack themselves off about being ""privacy respecting"" or whatever the fuck but it always turns out to be a total lie, and I just got fed up with their shit honestly. Alternatives like Searx or YaCy all fucking sucks so I made my own thing.
<a href="#goal"><h2 id="goal">My goal</h2></a>
Provide users with a privacy oriented, extremely lightweight, ad free, free as in freedom (and free beer!) way to search for documents around the internet, with minimal, optional javascript code. My long term goal would be to build my own index (that doesn\'t suck) and provide users with an unbiased search engine, with no political inclinations.
<a href="#logs"><h2 id="logs">Do you keep logs?</h2></a>
I store data temporarly to get the next page of results. This might include search queries, tokens and other parameters. These parameters are encrypted using <div class="code-inline">aes-256-gcm</div> on the serber, for which I give you a key (also known internally as <div class="code-inline">npt</div> token). When you make a request to get the next page, you supply the token, the data is decrypted and the request is fulfilled. This encrypted data is deleted after 7 minutes, or after it\'s used, whichever comes first.<br><br>
I <b>don\'t</b> log IP addresses, user agents, or anything else. The <div class="code-inline">npt</div> tokens are the only thing that are stored (in RAM, mind you), temporarly, encrypted.
<a href="#information-sharing"><h2 id="information-sharing">Do you share information with third parties?</h2></a>
Your search queries and supplied filters are shared with the scraper you chose (so I can get the search results, duh). I don\'t share anything else (that means I don\'t share your IP address, location, or anything of this kind). There is no way that site can know you\'re the one searching for something, <u>unless you send out a search query that de-anonymises you.</u> For example, a search query like "hello my full legal name is jonathan gallindo and i want pictures of cloacas" would definitively blow your cover. 4get doesn\'t contain ads or any third party javascript applets or trackers. I don\'t profile you, and quite frankly, I don\'t give a shit about what you search on there.<br><br>
TL;DR assume those websites can see what you search for, but can\'t see who you are (unless you\'re really dumb).
<a href="#hosting"><h2 id="hosting">Where is this website hosted?</h2></a>
This website is hosted on a Contabo shitbox in the United States.
<a href="#keyboard-shortcuts"><h2 id="keyboard-shortcuts">Keyboard shortcuts?</h2></a>
Use <div class="code-inline">/</div> to focus the search box.<br><br>
When the image viewer is open, you can use the following keybinds:<br>
<div class="code-inline">Up</div>, <div class="code-inline">Down</div>, <div class="code-inline">Left</div>, <div class="code-inline">Right</div> to rotate the image.<br>
<div class="code-inline">CTRL+Up</div>, <div class="code-inline">CTRL+Down</div>, <div class="code-inline">CTRL+Left</div>, <div class="code-inline">CTRL+Right</div> to mirror the image.<br>
<div class="code-inline">Escape</div> to exit the image viewer.
<a href="#instances"><h2 id="instances">Instances</h2></a>
4get is open source, anyone can create their own 4get instance! If you wish to add your website to this list, please <a href="https://lolcat.ca/contact">contact me</a>.
<table>
<tr>
<td>Name</td>
<td>Address</td>
</tr>
<tr>
<td>4get</td>
<td><a href="https://4get.ca">4get.ca</a><a href="http://4getwebfrq5zr4sxugk6htxvawqehxtdgjrbcn2oslllcol2vepa23yd.onion/">(tor)</a></td>
</tr>
</table>
<a href="#schizo"><h2 id="schizo">How can I trust you?</h2></a>
You just sort of have to take my word for it right now. If you\'d rather trust yourself instead of me (I believe in you!!), all of the code on this website is available trough my <a href="https://git.lolcat.ca/lolcat" class="link">git page</a> for you to host on your own machines. Just a reminder: if you\'re the sole user of your instance, it doesn\'t take immense brain power for Microshit to figure out you basically just switched IP addresses. Invite your friends to use your instance!
<a href="#contact"><h2 id="contact">I want to report abuse or have erotic roleplay trough email</h2></a>
I don\'t know about that second part but if you want to talk to me, just drop me an email...<br><br>
<b>Message to all DMCA enforcers:</b> I don\'t host any of the content. Everything you see here is <u>proxied</u> trough my shitbox with no moderation. Please reach out to the people hosting the infringing content instead.<br><br>
<a href="https://lolcat.ca/contact" rel="dofollow" class="link">Click here to contact me!</a><br><br>
<a href="https://validator.w3.org/nu/?doc=https%3A%2F%2F4get.ca" title="W3 Valid!">
<img src="/static/icon/w3html.png" alt="Valid W3C HTML 4.01" width="88" height="31">
</a>';
// trim out whitespace
$left = explode("\n", $left);
$out = "";
foreach($left as $line){
$out .= trim($line);
}
echo
$frontend->load(
"search.html",
[
"class" => "",
"right-left" => "",
"right-right" => "",
"left" => $out
]
);

289
api.txt 100644
View File

@ -0,0 +1,289 @@
__ __ __
/ // / ____ ____ / /_
/ // /_/ __ `/ _ \/ __/
/__ __/ /_/ / __/ /_
/_/ \__, /\___/\__/
/____/
+ Welcome to the 4get API documentation +
+ Terms of use
Do NOT misuse the API. Misuses can include... ::
1. Serp SEO scanning
2. Intensive scraping
3. Any other activity that isn't triggered by a human
4. Illegal activities in Canada
5. Constant "test" queries while developping your program
(please cache the API responses!)
Examples of good uses of the API ::
1. A chatroom bot that presents users with search results
2. Personal use
3. Any other activity that is initiated by a human
If you wish to engage in the activities listed under "misuses", feel
free to download the source code of the project and running 4get
under your own terms. Please respect the terms of use listed here so
that this website may be available to all in the far future.
Get your instance running here ::
https://git.lolcat.ca/lolcat/4get
Thanks!
+ Decode the data
All payloads returned by the API are encoded in the JSON format. If
you don't know how to tackle the problem, maybe programming is not
for you.
All of the endpoints use the GET method.
+ Check if an API call was successful
All API responses come with an array index named "status". If the
status is something else than the string "ok", something went wrong.
The HTTP code will always be 200 as to not cause issues with CORS.
+ Get the next page of results
All API responses come with an array index named "nextpage". To get
the next page of results, you must make another API call with &npt.
Example ::
+ First API call
/api/v1/web?s=higurashi
+ Second API call
/api/v1/web?npt=ddg1._rJ2hWmYSjpI2hsXWmYajJx < ... >
You shouldn't specify the search term, only the &npt parameter
suffices.
The first part of the token before the dot (ddg1) refers to an
array position on the serber's memory. The second part is an
encryption key used to decode the data at that position. This way,
it is impossible to supply invalid pagination data and it is
impossible for a 4get operator to peek at the private data of the
user after a request has been made.
The tokens will expire as soon as they are used or after a 7 minutes
inactivity period, whichever comes first.
+ Beware of null values!
Most fields in the API responses can return "null". You don't need
to worry about unset values.
+ API Parameters
To construct a valid request, you can use the 4get web interface
to craft a valid request, and replace "/web" with "/api/v1/web".
+ "date" and "time" parameters
"date" always refer to a calendar date.
"time" always refer to the duration of some media.
They are both integers that uses seconds as its unit. The "date"
parameter specifies the number of seconds that passed since January
1st 1970.
______ __ _ __
/ ____/___ ____/ /___ ____ (_)___ / /______
/ __/ / __ \/ __ / __ \/ __ \/ / __ \/ __/ ___/
/ /___/ / / / /_/ / /_/ / /_/ / / / / / /_(__ )
/_____/_/ /_/\__,_/ .___/\____/_/_/ /_/\__/____/
/_/
+ /api/v1/web
+ &extendedsearch
When using the ddg(DuckDuckGo) scraper, you may make use of the
&extendedsearch parameter. If you need rich answer data from
additional sources like StackOverflow, music lyrics sites, etc.,
you need to specify the value of (string)"true".
The default value is "false" for API calls.
+ Parse the "spelling"
The array index named "spelling" contains 3 indexes ::
spelling:
type: "including"
using: "4chan"
correction: '"4cha"'
The "type" may be any of these 3 values. When rendering the
autocorrect text inside your application, it should look like
what follows right after the parameter value ::
no_correction <Empty>
including Including results for %using%. Did you mean
%correction%?
not_many Not many results for %using%. Did you mean
%correction%?
As of right now, the "spelling" is only available on
"/api/v1/web".
+ Parse the "answer"
The array index named "answer" may contain a list of multiple
answers. The array index "description" contains a linear list of
nodes that can help you construct rich formatted data inside of
your application. The structure is similar to the one below:
answer:
0:
title: "Higurashi"
description:
0:
type: "text"
value: "Higurashi is a great show!"
1:
type: "quote"
value: "Source: my ass"
Each "description" node contains an array index named "type".
Here is a list of them:
text
+ title
italic
+ quote
+ code
inline_code
link
+ image
+ audio
Each individual node prepended with a "+" should be prepended by
a newline when constructing the rendered description object.
There are some nodes that differ from the type-value format.
Please parse them accordingly ::
+ link
type: "link"
url: "https://lolcat.ca"
value: "Visit my website!"
+ image
type: "image"
url: "https://lolcat.ca/static/pixels.png"
+ audio
type: "audio"
url: "https://lolcat.ca/static/whatever.mp3"
The array index named "table" is an associative array. You can
loop over the data using this PHP code, for example ::
foreach($table as $website_name => $url){ // ...
The rest of the JSON is pretty self explanatory.
+ /api/v1/images
All images are contained within "image". The structure looks like
below ::
image:
0:
title: "My awesome Higurashi image"
source:
0:
url: "https://lolcat.ca/static/profile_pix.png"
width: 400
height: 400
1:
url: "https://lolcat.ca/static/pixels.png"
width: 640
height: 640
2:
url: "https://tse1.mm.bing.net/th?id=OIP.VBM3BQg
euf0-xScO1bl1UgHaGG"
width: 194
height: 160
The last image of the "source" array is always the thumbnail, and is
a good fallback to use when other sources fail to load. There can be
more than 1 source; this is especially true when using the Yandex
scraper, but beware of captcha rate limits.
+ /api/v1/videos
The "time" parameter for videos may be set to "_LIVE". For live
streams, the amount of people currently watching is passed in
"views".
+ /api/v1/news
Just make a request to "/api/v1/news?s=elon+musk". The payload
has nothing special about it and is very self explanatory, just like
the endpoint above.
+ /favicon
Get the favicon for a website. The only parameter is "s", and must
include the protocol.
Example ::
/favicon?s=https://lolcat.ca
If we had to revert to using Google's favicon cache, it will throw
an error in the X-Error header field. If Google's favicon cache
also failed to return an image, or if you're too retarded to specify
a valid domain name, a default placeholder image will be returned
alongside the "404" HTTP error code.
+ /proxy
Get a proxied image. Useful if you don't want to leak your user's IP
address. The parameters are "i" for the image link and "s" for the
size.
Acceptable "s" parameters:
portrait 90x160
landscape 160x90
square 90x90
thumb 236x180
cover 207x270
original <Original resolution>
You can also ommit the "s" parameter if you wish to view the
original image. When an error occurs, an "X-Error" header field
is set.
+ /audio
Get a proxied audio file. Does not support "Range" headers, as it's
only used to proxy small files.
The parameter is "s" for the audio link.
+ Appendix
If you have any questions or need clarifications, please send an
email my way to will at lolcat.ca

10
api/index.php 100644
View File

@ -0,0 +1,10 @@
<?php
header("Content-Type: application/json");
http_response_code(404);
echo json_encode(
[
"status" => "Unknown endpoint"
]
);

25
api/v1/images.php 100644
View File

@ -0,0 +1,25 @@
<?php
header("Content-Type: application/json");
chdir("../../");
include "lib/frontend.php";
$frontend = new frontend();
[$scraper, $filters] = $frontend->getscraperfilters(
"images",
isset($_GET["scraper"]) ? $_GET["scraper"] : null
);
$get = $frontend->parsegetfilters($_GET, $filters);
try{
echo json_encode(
$scraper->image($get)
);
}catch(Exception $e){
echo json_encode(["status" => $e->getMessage()]);
}

10
api/v1/index.php 100644
View File

@ -0,0 +1,10 @@
<?php
header("Content-Type: application/json");
http_response_code(404);
echo json_encode(
[
"status" => "Unknown endpoint"
]
);

25
api/v1/news.php 100644
View File

@ -0,0 +1,25 @@
<?php
header("Content-Type: application/json");
chdir("../../");
include "lib/frontend.php";
$frontend = new frontend();
[$scraper, $filters] = $frontend->getscraperfilters(
"news",
isset($_GET["scraper"]) ? $_GET["scraper"] : null
);
$get = $frontend->parsegetfilters($_GET, $filters);
try{
echo json_encode(
$scraper->news($get)
);
}catch(Exception $e){
echo json_encode(["status" => $e->getMessage()]);
}

25
api/v1/videos.php 100644
View File

@ -0,0 +1,25 @@
<?php
header("Content-Type: application/json");
chdir("../../");
include "lib/frontend.php";
$frontend = new frontend();
[$scraper, $filters] = $frontend->getscraperfilters(
"videos",
isset($_GET["scraper"]) ? $_GET["scraper"] : null
);
$get = $frontend->parsegetfilters($_GET, $filters);
try{
echo json_encode(
$scraper->video($get)
);
}catch(Exception $e){
echo json_encode(["status" => $e->getMessage()]);
}

30
api/v1/web.php 100644
View File

@ -0,0 +1,30 @@
<?php
header("Content-Type: application/json");
chdir("../../");
include "lib/frontend.php";
$frontend = new frontend();
[$scraper, $filters] = $frontend->getscraperfilters(
"web",
isset($_GET["scraper"]) ? $_GET["scraper"] : null
);
$get = $frontend->parsegetfilters($_GET, $filters);
if(!isset($_GET["extendedsearch"])){
$get["extendedsearch"] = "no";
}
try{
echo json_encode(
$scraper->web($get)
);
}catch(Exception $e){
echo json_encode(["status" => $e->getMessage()]);
}

19
audio.php 100644
View File

@ -0,0 +1,19 @@
<?php
if(!isset($_GET["s"])){
http_response_code(404);
header("X-Error: No SOUND(s) provided!");
die();
}
include "lib/curlproxy.php";
$proxy = new proxy();
try{
$proxy->stream_linear_audio($_GET["s"]);
}catch(Exception $error){
header("X-Error: " . $error->getMessage());
}

BIN
banner/aves.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

BIN
banner/aves_2.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.1 KiB

BIN
banner/deek.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 KiB

BIN
banner/deekchat.gif 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.6 KiB

BIN
banner/eagle.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

BIN
banner/eagle2.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.1 KiB

BIN
banner/eagle3.jpg 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

BIN
banner/eddd_1.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

BIN
banner/eddd_2.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

BIN
banner/eddd_3.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

BIN
banner/gnuwu.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

BIN
banner/gnuwu_2.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

BIN
banner/horse.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

BIN
banner/linucks.jpg 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

BIN
banner/sec.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 59 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

BIN
favicon.ico 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 393 B

362
favicon.php 100644
View File

@ -0,0 +1,362 @@
<?php
if(!isset($_GET["s"])){
header("X-Error: Missing parameter (s)ite");
die();
}
new favicon($_GET["s"]);
class favicon{
public function __construct($url){
header("Content-Type: image/png");
if(substr_count($url, "/") !== 2){
header("X-Error: Only provide the protocol and domain");
$this->defaulticon();
}
$filename = str_replace(["https://", "http://"], "", $url);
header("Content-Disposition: inline; filename=\"{$filename}.png\"");
include "lib/curlproxy.php";
$this->proxy = new proxy(false);
$this->filename = parse_url($url, PHP_URL_HOST);
/*
Check if we have the favicon stored locally
*/
if(file_exists("icons/" . $filename . ".png")){
$handle = fopen("icons/" . $filename . ".png", "r");
echo fread($handle, filesize("icons/" . $filename . ".png"));
fclose($handle);
return;
}
/*
Scrape html
*/
try{
$payload = $this->proxy->get($url, $this->proxy::req_web, true);
}catch(Exception $error){
header("X-Error: Could not fetch HTML (" . $error->getMessage() . ")");
$this->favicon404();
}
//$payload["body"] = '<link rel="manifest" id="MANIFEST_LINK" href="/data/manifest/" crossorigin="use-credentials" />';
// get link tags
preg_match_all(
'/< *link +(.*)[\/]?>/Uixs',
$payload["body"],
$linktags
);
/*
Get relevant tags
*/
$linktags = $linktags[1];
$attributes = [];
/*
header("Content-Type: text/plain");
print_r($linktags);
print_r($payload);
die();*/
for($i=0; $i<count($linktags); $i++){
// get attributes
preg_match_all(
'/([A-Za-z0-9]+) *= *("[^"]*"|[^" ]+)/s',
$linktags[$i],
$tags
);
for($k=0; $k<count($tags[1]); $k++){
$attributes[$i][] = [
"name" => $tags[1][$k],
"value" => trim($tags[2][$k], "\" \n\r\t\v\x00")
];
}
}
unset($payload);
unset($linktags);
$href = [];
// filter out the tags we want
foreach($attributes as &$group){
$tmp_href = null;
$tmp_rel = null;
$badtype = false;
foreach($group as &$attribute){
switch($attribute["name"]){
case "rel":
$attribute["value"] = strtolower($attribute["value"]);
if(
(
$attribute["value"] == "icon" ||
$attribute["value"] == "manifest" ||
$attribute["value"] == "shortcut icon" ||
$attribute["value"] == "apple-touch-icon" ||
$attribute["value"] == "mask-icon"
) === false
){
break;
}
$tmp_rel = $attribute["value"];
break;
case "type":
$attribute["value"] = explode("/", $attribute["value"], 2);
if(strtolower($attribute["value"][0]) != "image"){
$badtype = true;
break;
}
break;
case "href":
// must not contain invalid characters
// must be bigger than 1
if(
filter_var($attribute["value"], FILTER_SANITIZE_URL) == $attribute["value"] &&
strlen($attribute["value"]) > 0
){
$tmp_href = $attribute["value"];
break;
}
break;
}
}
if(
$badtype === false &&
$tmp_rel !== null &&
$tmp_href !== null
){
$href[$tmp_rel] = $tmp_href;
}
}
/*
Priority list
*/
/*
header("Content-Type: text/plain");
print_r($href);
die();*/
if(isset($href["icon"])){ $href = $href["icon"]; }
elseif(isset($href["apple-touch-icon"])){ $href = $href["apple-touch-icon"]; }
elseif(isset($href["manifest"])){
// attempt to parse manifest, but fallback to []
$href = $this->parsemanifest($href["manifest"], $url);
}
if(is_array($href)){
if(isset($href["mask-icon"])){ $href = $href["mask-icon"]; }
elseif(isset($href["shortcut icon"])){ $href = $href["shortcut icon"]; }
else{
$href = "/favicon.ico";
}
}
$href = $this->proxy->getabsoluteurl($href, $url);
/*
header("Content-type: text/plain");
echo $href;
die();*/
/*
Download the favicon
*/
//$href = "https://git.lolcat.ca/assets/img/logo.svg";
try{
$payload =
$this->proxy->get(
$href,
$this->proxy::req_image,
true,
$url
);
}catch(Exception $error){
header("X-Error: Could not fetch the favicon (" . $error->getMessage() . ")");
$this->favicon404();
}
/*
Parse the file format
*/
$image = null;
$format = $this->proxy->getimageformat($payload, $image);
/*
Convert the image
*/
try{
/*
@todo: fix issues with avif+transparency
maybe using GD as fallback?
*/
if($format !== false){
$image->setFormat($format);
}
$image->setBackgroundColor(new ImagickPixel("transparent"));
$image->readImageBlob($payload["body"]);
$image->resizeImage(16, 16, imagick::FILTER_LANCZOS, 1);
$image->setFormat("png");
$image = $image->getImageBlob();
// save favicon
$handle = fopen("icons/" . $this->filename . ".png", "w");
fwrite($handle, $image, strlen($image));
fclose($handle);
echo $image;
}catch(ImagickException $error){
header("X-Error: Could not convert the favicon: (" . $error->getMessage() . ")");
$this->favicon404();
}
return;
}
private function parsemanifest($href, $url){
if(
// check if base64-encoded JSON manifest
preg_match(
'/^data:application\/json;base64,([A-Za-z0-9=]*)$/',
$href,
$json
)
){
$json = base64_decode($json[1]);
if($json === false){
// could not decode the manifest regex
return [];
}
}else{
try{
$json =
$this->proxy->get(
$this->proxy->getabsoluteurl($href, $url),
$this->proxy::req_web,
false,
$url
);
$json = $json["body"];
}catch(Exception $error){
// could not fetch the manifest
return [];
}
}
$json = json_decode($json, true);
if($json === null){
// manifest did not return valid json
return [];
}
if(
isset($json["start_url"]) &&
$this->proxy->validateurl($json["start_url"])
){
$url = $json["start_url"];
}
if(!isset($json["icons"][0]["src"])){
// manifest does not contain a path to the favicon
return [];
}
// horay, return the favicon path
return $json["icons"][0]["src"];
}
private function favicon404(){
// fallback to google favicons
// ... probably blocked by cuckflare
try{
$image =
$this->proxy->get(
"https://t0.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&fallback_opts=TYPE,SIZE,URL&url=http://{$this->filename}&size=16",
$this->proxy::req_image
);
}catch(Exception $error){
$this->defaulticon();
}
// write favicon from google
$handle = fopen("icons/" . $this->filename . ".png", "w");
fwrite($handle, $image["body"], strlen($image["body"]));
fclose($handle);
echo $image["body"];
die();
}
private function defaulticon(){
// give 404 and fuck off
http_response_code(404);
$handle = fopen("lib/favicon404.png", "r");
echo fread($handle, filesize("lib/favicon404.png"));
fclose($handle);
die();
}
}

BIN
icons/lolcat.ca.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 KiB

99
images.php 100644
View File

@ -0,0 +1,99 @@
<?php
/*
Initialize random shit
*/
include "lib/frontend.php";
$frontend = new frontend();
[$scraper, $filters] = $frontend->getscraperfilters("images");
$get = $frontend->parsegetfilters($_GET, $filters);
$frontend->loadheader(
$get,
$filters,
"images"
);
$payload = [
"images" => "",
"nextpage" => ""
];
try{
$results = $scraper->image($get);
}catch(Exception $error){
echo
$frontend->drawerror(
"Shit",
'This scraper returned an error:' .
'<div class="code">' . htmlspecialchars($error->getMessage()) . '</div>' .
'Things you can try:' .
'<ul>' .
'<li>Use a different scraper</li>' .
'<li>Remove keywords that could cause errors</li>' .
'<li>Use another 4get instance</li>' .
'</ul><br>' .
'If the error persists, please <a href="/about">contact the administrator</a>.'
);
die();
}
if(count($results["image"]) === 0){
$payload["images"] =
'<div class="infobox">' .
"<h1>Nobody here but us chickens!</h1>" .
'Have you tried:' .
'<ul>' .
'<li>Using a different scraper</li>' .
'<li>Using fewer keywords</li>' .
'<li>Defining broader filters (Is NSFW turned off?)</li>' .
'</ul>' .
'</div>';
}
foreach($results["image"] as $image){
$domain = htmlspecialchars(parse_url($image["url"], PHP_URL_HOST));
$c = count($image["source"]) - 1;
if(
preg_match(
'/^data:/',
$image["source"][$c]["url"]
)
){
$src = htmlspecialchars($image["source"][$c]["url"]);
}else{
$src = "/proxy?i=" . urlencode($image["source"][$c]["url"]) . "&s=thumb";
}
$payload["images"] .=
'<div class="image-wrapper" title="' . htmlspecialchars($image["title"]) .'" data-json="' . htmlspecialchars(json_encode($image["source"])) . '">' .
'<div class="image">' .
'<a href="' . htmlspecialchars($image["source"][0]["url"]) . '" rel="noreferrer nofollow" class="thumb">' .
'<img src="' . $src . '" alt="thumbnail">' .
'<div class="duration">' . $image["source"][0]["width"] . 'x' . $image["source"][0]["height"] . '</div>' .
'</a>' .
'<a href="' . htmlspecialchars($image["url"]) . '" rel="noreferrer nofollow">' .
'<div class="title">' . htmlspecialchars($domain) . '</div>' .
'<div class="description">' . $frontend->highlighttext($get["s"], $image["title"]) . '</div>' .
'</a>' .
'</div>' .
'</div>';
}
if($results["npt"] !== null){
$payload["nextpage"] =
'<a href="' . $frontend->htmlnextpage($get, $results["npt"], "images") . '" class="nextpage img">Next page &gt;</a>';
}
echo $frontend->load("images.html", $payload);

14
index.php 100644
View File

@ -0,0 +1,14 @@
<?php
include "lib/frontend.php";
$frontend = new frontend();
$images = glob("banner/*");
echo $frontend->load(
"home.html",
[
"body_class" => $frontend->getthemeclass(false),
"banner" => $images[rand(0, count($images) - 1)]
]
);

View File

@ -0,0 +1,144 @@
<?php
// https://www.bing.com/search?q=url%3Ahttps%3A%2F%2Flolcat.ca
// https://cc.bingj.com/cache.aspx?q=url%3ahttps%3a%2f%2flolcat.ca&d=4769685974291356&mkt=en-CA&setlang=en-US&w=tEsWuE7HW3Z5AIPQMVkDH4WaotS4LrK-
// <div class="b_attribution" u="0N|5119|4769685974291356|tEsWuE7HW3Z5AIPQMVkDH4WaotS4LrK-" tabindex="0">
new bingcache();
class bingcache{
public function __construct(){
if(
!isset($_GET["s"]) ||
$this->validate_url($_GET["s"]) === false
){
var_dump($this->validate_url($_GET["s"]));
$this->do404("Please provide a valid URL.");
}
$url = $_GET["s"];
$curlproc = curl_init();
curl_setopt(
$curlproc,
CURLOPT_URL,
"https://www.bing.com/search?q=url%3A" .
urlencode($url)
);
curl_setopt($curlproc, CURLOPT_ENCODING, ""); // default encoding
curl_setopt(
$curlproc,
CURLOPT_HTTPHEADER,
["User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0",
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language: en-US,en;q=0.5",
"Accept-Encoding: gzip",
"DNT: 1",
"Connection: keep-alive",
"Upgrade-Insecure-Requests: 1",
"Sec-Fetch-Dest: document",
"Sec-Fetch-Mode: navigate",
"Sec-Fetch-Site: none",
"Sec-Fetch-User: ?1"]
);
curl_setopt($curlproc, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curlproc, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($curlproc, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($curlproc, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($curlproc);
if(curl_errno($curlproc)){
$this->do404("Failed to connect to bing servers. Please try again later.");
}
curl_close($curlproc);
preg_match(
'/<div class="b_attribution" u="(.*)" tabindex="0">/',
$data,
$keys
);
print_r($keys);
if(count($keys) === 0){
$this->do404("Bing has not archived this URL.");
}
$keys = explode("|", $keys[1]);
$count = count($keys);
//header("Location: https://cc.bingj.com/cache.aspx?d=" . $keys[$count - 2] . "&w=" . $keys[$count - 1]);
echo("Location: https://cc.bingj.com/cache.aspx?d=" . $keys[$count - 2] . "&w=" . $keys[$count - 1]);
}
public function do404($text){
include "lib/frontend.php";
$frontend = new frontend();
echo
$frontend->load(
"error.html",
[
"title" => "Shit",
"text" => $text
]
);
die();
}
public function validate_url($url){
$url_parts = parse_url($url);
// check if required parts are there
if(
!isset($url_parts["scheme"]) ||
!(
$url_parts["scheme"] == "http" ||
$url_parts["scheme"] == "https"
) ||
!isset($url_parts["host"])
){
return false;
}
if(
// if its not an RFC-valid URL
!filter_var($url, FILTER_VALIDATE_URL)
){
return false;
}
$ip =
str_replace(
["[", "]"], // handle ipv6
"",
$url_parts["host"]
);
// if its not an IP
if(!filter_var($ip, FILTER_VALIDATE_IP)){
// resolve domain's IP
$ip = gethostbyname($url_parts["host"] . ".");
}
// check if its localhost
return filter_var(
$ip,
FILTER_VALIDATE_IP, FILTER_FLAG_NO_PRIV_RANGE | FILTER_FLAG_NO_RES_RANGE
);
}
}

BIN
lib/classic.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.4 KiB

652
lib/curlproxy.php 100644
View File

@ -0,0 +1,652 @@
<?php
class proxy{
public const req_web = 0;
public const req_image = 1;
public function __construct($cache = true){
$this->cache = $cache;
}
public function do404(){
http_response_code(404);
header("Content-Type: image/png");
$handle = fopen("lib/img404.png", "r");
echo fread($handle, filesize("lib/img404.png"));
fclose($handle);
die();
return;
}
public function getabsoluteurl($path, $relative){
if($this->validateurl($path)){
return $path;
}
if(substr($path, 0, 2) == "//"){
return "https:" . $path;
}
$url = null;
$relative = parse_url($relative);
$url = $relative["scheme"] . "://";
if(
isset($relative["user"]) &&
isset($relative["pass"])
){
$url .= $relative["user"] . ":" . $relative["pass"] . "@";
}
$url .= $relative["host"];
if(isset($relative["path"])){
$relative["path"] = explode(
"/",
$relative["path"]
);
unset($relative["path"][count($relative["path"]) - 1]);
$relative["path"] = implode("/", $relative["path"]);
$url .= $relative["path"];
}
if(