PDA

View Full Version : Search Sphider


DaGoN
12-06-2005, 01:30 PM
Hi to all,
after three days of work, i've implement Sphider engine v1.2.7c in Subdreamer.
Sphider is a freeware and professional website engine like google (i've styled it like google :) ).

More info here: http://www.cs.ioc.ee/~ando/sphider/
Please read Sphider manual before use it.

If you discover some bug please post here.

Download (http://www.subdreamer.org/forum/index.php?act=Attach&type=post&id=224)


ps: Thanx to Simplicity for the idea.


Features

Spidering

* Can index both static and dynamic pages.
* Finds links in <a href=...>, <frame ...>, <area ...> and <meta ...> tags, and can also follow links given in javascript as strings via window.location and window.open.
* Respects robots.txt protocol.
* Follows server side redirections.
* Allows spidering to be limited by depth (ie maximum number of clicks from the starting page), by (sub)domain or by directory.
* Supports indexing of pdf and doc files (using external binaries for file conversion).
* Allows resuming paused spidering.

Indexing

* Full text indexing.
* Possbility to exclude common words from being indexed.
* Option to define your custom page ranking function, which can depend on the number of times a given word occurs in the webpage, whether the word occurs in the domain name, path, or title of the document and also the relative "deepness" of the url (so that the same page in www.domain.com/ is ranked higher than in www.domain.com/dir1/dir2/foo.html)

Searching

* Uses AND operator by default, if more than one query word is used, it finds pages that include all the query words.
* Supports phrase searching.
* Supports excluding words (by putting a &#39;-&#39; in front of a word, any page including the word will be omitted from the results).
* Option to add and group sites into categories
* Possibility to limit searching to a given category and its subcategories.

Size and speed

Sphider uses regular expressions to extract links from webpages, so indexing is not particularly fast. Searching is quite fast, if the database size is reasonable. It is very small, its source code being around 100kb in size, probably making it the smallest search engine with such functionality out there.


Byez,
DaGoN

abcohen
12-06-2005, 03:16 PM
This looks great... what a good tool this will be for subdreamer...
When you think its ready please submit it to the download manager.

IGGY
12-06-2005, 06:53 PM
Wow, I was just reading about this just the other day. I was going to try and implement this into my site since the google crawling thing was not working.

It&#39;s awesome that you decided to make this a plugin. Very cool.

jblackburn
12-07-2005, 02:51 PM
Nice, but doesn&#39;t work on my site (a private extranet). I don&#39;t see anything in the manual that sticks out. I did allow it in my robots.txt file as normally the extranet would be excluded from public spiders.

The spider kicks off but freezes with "spidering." I have a lot of content on the site, including a vb3.5 forum. I&#39;ll try excluding the forum and see if that helps.

DaGoN
12-07-2005, 08:48 PM
Click to "Clean tables" and remove all: Clean keywords, Clean links, Clear temp tables, Clear search log.
Remove all sites in list.
Now if you want to index your subdreamer site try to add your url therefore:
www.yoursite.com/index.php?categoryid=1
and NOT
www.yoursite.com

I&#39;ve test it with big site and it works fine... Try to disable .htaccess files (if you have it) when you spidering your site.


ps: I&#39;ve fix some design bugs and a bit of minors bugs.

jblackburn please report if it works in this way...

Byez,
DaGoN

jblackburn
12-08-2005, 03:22 AM
Hi
DaGoN,

It still did not work. It tried with and w/o disabling .htaccess and with your suggestion of www.yoursite.com/index.php?categoryid=1 as well as www.yoursite.com/Home.

Do you know the actual name of the spider? In robots.tx I allowed the following, which should allow it to spider (if the agent name is correct):

User-Agent: Sphider
Disallow:

Jim

DaGoN
12-08-2005, 07:33 PM
Hi,
i&#39;ve update spider package to 1.3b.

Remove old version (uninstall from subdreamer admin panel) and intall (overwrite all files) this new version.

Byez,
DaGoN

jblackburn
12-08-2005, 11:15 PM
<div class='quotetop'>QUOTE(DaGoN &#064; Dec 8 2005, 12&#58;33 PM) 2381</div>
Hi,
i&#39;ve update spider package to 1.3b.

Remove old version (uninstall from subdreamer admin panel) and intall (overwrite all files) this new version.

Byez,
DaGoN
[/b]

Hi DaGoN,

First, don&#39;t forget to bump your version number, it still shows 1.0 :)

Well, it is not crashing now, but it is stopping after "1 site and 2 links." Do you think the .htaccess file still causes problems? I can try disabling that temporarily.

DaGoN
12-09-2005, 07:48 AM
@First, don&#39;t forget to bump your version number, it still shows 1.0
yes, my plugin version is still v1.0 , Sphider engine has been updated to v1.3b. :)

@Do you think the .htaccess file still causes problems? I can try disabling that temporarily.
Pheraps... make that test and if it doesn&#39;t work, try to install Sphider v1.3b from here (http://www.cs.ioc.ee/~ando/sphider/download.php) and try again with only Sphider.
I would want to know if this problem depend from my integration or is a Sphider&#39;s problem.

Thanks,
DaGoN

jblackburn
12-10-2005, 03:02 AM
Hi DaGoN,

First I want to say that it is VERY cool of you to be working on this for the SD community. I&#39;ve actually purchased an external program, Wrensoft&#39;s Zoom for our site, but this program seems to also have a lot of nice features.

Here&#39;s my preliminary report on loading the program directly as a test:

I think most problems are related to permissions on the server. It basically worked from loading the program directly (but.... I&#39;ll elaborate below).

The tmp and log directories need to be writeable; not sure if these need to be chmod 777 or ???

conf.php needs to be at least 666

Even with all this, it did not spider all directories. It also gave an errror:

SELECT * FROM categories WHERE parent_num=0 ORDER BY category
Unknown column &#39;parent_num&#39; in &#39;where clause&#39;

...the very first time hitting search.php. A second click removed this error.

Finally, which is my problem, IT WILL NOT work on a password protected site. This was on their forum:

"Currently the script does not support indexing password protected sites.
It is considered for the future, but its not on the top of the priorities list."

Sorry for the long-winded report&#33;

Jim

Terminator1138
12-10-2005, 06:52 AM
does one need to install this first or is it all contained within the plugin files...

jblackburn
12-10-2005, 04:28 PM
<div class='quotetop'>QUOTE(Terminator1138 &#064; Dec 9 2005, 11&#58;52 PM) 2413</div>
does one need to install this first or is it all contained within the plugin files...
[/b]

Everything you need is in the plugin files.

Terminator1138
12-12-2005, 02:12 PM
not database to install either? I thought I read on original spider site they stated making a db.?

Thanks so far, thinking this will be very good for large sites etc. Getting ready to test soon.

DaGoN
12-12-2005, 09:03 PM
@jblackburn
First I want to say that it is VERY cool of you to be working on this for the SD community.
- i like Subdreamer... it&#39;s a very good CMS ;).

Thanks so much for your time jblackburn.

I&#39;ve implemented Sphider 1.3 RC2 but i&#39;ll wait a few of hours to release it (i&#39;m testing it).

Byez,
DaGoN

DaGoN
12-13-2005, 08:50 AM
I&#39;ve update my plugin to Sphider v1.3 RC2

Changes from author:
Indexing words with more than 30 characters does not produce a "duplicate entry" warning any more
Multiple searches with an apostrophe in keywords now possible
Bug with highlighting words with a &#39;+&#39; in front of them fixed

My changes:
Minor design bugs fixed.


Please report here eventual bugs.

ps: remove the plugin from admin panel and reinstall it.


Byez,
DaGoN

72dpi
12-13-2005, 10:50 AM
Hi Dagon,

Awesome work mate.

One problem i am having (I havent found this in settings...
I have SEF&#39;s turned on. The links cause the "page Not Found"

i am getting urls that don&#39;t work in the "pagination", like:
http://mywebsite.com/About_Us&query=image&...arch=1&results= (http://mywebsite.com/About_Us&query=image&start=9&search=1&results=)

Any thouights on how to fix this?

DaGoN
12-13-2005, 01:50 PM
<div class='quotetop'>QUOTE(72dpi &#064; Dec 13 2005, 12&#58;50 PM) 2439</div>
Hi Dagon,

Awesome work mate.

One problem i am having (I havent found this in settings...
I have SEF&#39;s turned on. The links cause the "page Not Found"

i am getting urls that don&#39;t work in the "pagination", like:
http://mywebsite.com/About_Us&query=image&...arch=1&results= (http://mywebsite.com/About_Us&query=image&start=9&search=1&results=)

Any thouights on how to fix this?
[/b]

mmm... strange.
What is "SEF&#39;s"?
Please send me your url...

Byez,
DaGoN

72dpi
12-13-2005, 03:37 PM
hey Mate,

just PM&#39;d ya.
SEF (Search Engine friendly) URL&#39;s use the rewrite to change the query string, to a Search engine friendly version.
Great for google &spiders.
I will look into the mod (although i am sure u will beat me to it =)

also, not sure if it is justme, but the plugin is killing any other plugins below it in that table....

I have plugins in the nav table to the right, but they are fine. Strange.

Fully love this tho mate. Is very nice =)

DaGoN
12-13-2005, 06:34 PM
@SEF (Search Engine friendly) URL&#39;s use the rewrite to change the query string, to a Search engine friendly version.
Ops :unsure: . I think that it is a Sphider problem, but if u disable SEF it works fine... Sphider also doesn&#39;t support a password protected web site.

@I have plugins in the nav table to the right, but they are fine. Strange.
Yes, i&#39;ve fixed it. :)

Please download newest version and thanks for your test.

Byez,
DaGoN

72dpi
12-13-2005, 10:51 PM
Good work mate (U are fast&#33;).

Is a shame about the SEF url rewrite, but I see your point. I guess it would be insanely hard to apply the rewrite, especially if you are pulling up thousands of searches.

Cheers&#33;

Terminator1138
12-14-2005, 12:18 AM
hmm why does it not support the rewrite? Hmm, too bad, might have been a good use if was able to use SEF....perhaps I will turn them off. :)

DaGoN
12-14-2005, 07:39 AM
<div class='quotetop'>QUOTE(Terminator1138 &#064; Dec 14 2005, 02&#58;18 AM) 2446</div>
hmm why does it not support the rewrite? Hmm, too bad, might have been a good use if was able to use SEF....perhaps I will turn them off. :)
[/b]

@72dpi, Terminator1138
I&#39;ve test SEF in my localhost and IT WORKS&#33; Spidering is okay.
The problem is that some links doesn&#39;t work, like categories.

Anyone can test SEF and post the results here?

Thanx so much,
DaGoN

72dpi
12-14-2005, 08:19 AM
hey Mate,

yeah, links are perfect, but the problem is on the rewrite for the:

"Result page: 1 2 3 4 5 6 7 8 9 10 Next "

When using pagination, the urls need to rewrite similar to the other mods. i just recently updated a plugin which had a similar issue. i believe this can be done. Hopefully abcohen or Subduck or someone may see time time ponder this.

i will try & look at the code, but am currently working on my stytle changer (which will be a tutorial here asap)

Great work mate, I Like your dedication =)

DaGoN
12-14-2005, 10:31 AM
<div class='quotetop'>QUOTE(72dpi &#064; Dec 14 2005, 10&#58;19 AM) 2449</div>
hey Mate,

yeah, links are perfect, but the problem is on the rewrite for the:

"Result page: 1 2 3 4 5 6 7 8 9 10 Next "

When using pagination, the urls need to rewrite similar to the other mods. i just recently updated a plugin which had a similar issue. i believe this can be done. Hopefully abcohen or Subduck or someone may see time time ponder this.

i will try & look at the code, but am currently working on my stytle changer (which will be a tutorial here asap)

Great work mate, I Like your dedication =)
[/b]

Hi 72dpi,
Finally i&#39;ve understand the problem :D, now rewrite links work perfect&#33;

I&#39;ve a bit of time to dedicated to Subdreamer, my job allowing... :)

Byez,
DaGoN

Terminator1138
12-14-2005, 01:34 PM
Nice&#33; Great work there DaGoN.....

Will give it a full test.....

I think this is a wonder for large sites that have tons of content....

72dpi
12-14-2005, 02:45 PM
U champ,

Well done&#33;

Push this up to subdreamer.com for sure&#33;

72dpi
12-14-2005, 11:15 PM
I was having trouble with my layout being broken from the text string "&#036;url2" being too long.

This is found in: includes > searchfuncs.php approx line 444

Replace:
&#036;url2 = &#036;url;
while &#40;@eregi&#40;&#34;&#91;^&#092;&#62;&#93;&#40;&#34;.&#036;change.&#34;&#41;&#91;^&#092;&#60;&#93;&#34;, &#036;url2, &#036;regs&#41;&#41; {
&#036;url2 = eregi_replace&#40;&#036;regs&#91;1&#93;, &#34;&#60;b&#62;&#34;.&#036;regs&#91;1&#93;.&#34;&#60;/b&#62;&#34;, &#036;url2&#41;;

with:
// limit string output
&#036;newstring = &#036;url;
&#036;message_parts = explode&#40;&#39; &#39;,&#036;newstring&#41;; foreach&#40;&#036;message_parts as &#036;k=&#62;&#036;part&#41;{ &#036;part_length = strlen&#40;&#036;part&#41;; if&#40;&#036;part_length &#62; 15&#41;{ &#036;part = substr&#40;&#036;part,0,50&#41;.&#39;...&#39;; } &#036;new_part&#91;&#036;k&#93; = &#036;part; } &#036;newstring = implode&#40;&#39; &#39;,&#036;new_part&#41;;
********************
&#036;url2 = &#036;newstring;
while &#40;@eregi&#40;&#34;&#91;^&#092;&#62;&#93;&#40;&#34;.&#036;change.&#34;&#41;&#91;^&#092;&#60;&#93;&#34;, &#036;url2, &#036;regs&#41;&#41; {
&#036;url2 = eregi_replace&#40;&#036;regs&#91;1&#93;, &#34;&#60;b&#62;&#34;.&#036;regs&#91;1&#93;.&#34;&#60;/b&#62;&#34;, &#036;url2&#41;;

This will limit the text to 50 characters & add "..." after it.
substr(&#036;part,0,50).&#39;...&#39;;

I would love this to be included in the settings, but can&#39;t figure out how to do it (same as u can limit the main text).

Any chance of adding this?,

I also changed mine to have a new style class added to the keywords founfd:
Includes > searchfuncs.php
Line 442:
Change:
&#036;fulltxt = eregi_replace&#40;&#036;regs&#91;1&#93;, &#34;&#60;b&#62;&#34;.&#036;regs&#91;1&#93;.&#34;&#60;/b&#62;&#34;, &#036;fulltxt&#41;;
To:
&#036;fulltxt = eregi_replace&#40;&#036;regs&#91;1&#93;, &#34;&#60;b&#62;&#60;span class=&#092;&#34;keyword&#092;&#34;&#62;&#34;.&#036;regs&#91;1&#93;.&#34;&#60;/span&#62;&#60;/b&#62;&#34;, &#036;fulltxt&#41;;

And add the following in:
css > search.css
add:

.keyword {
background-color&#58;#F71700;
}

this adds a nice red background to the found text (I personally like, but not neccesary)



Seriously, thanks for your hard work on this one. it is FANTASTIC&#33;

72dpi
12-15-2005, 01:13 AM
Hate to do this,
but there is a problem with multiple word searches when using SEF&#39;s

When viewing:
Result page: 1 2 3 4 5 6 7 8 9 10 Next

This works perfect:
http://www.mywebsite.com/About_Us/query/student/start/4/search/1/results/

This breaks it:
http://www.mywebsite.com/About_Us/query/student+photo/start/5/search/1/results/


You get:
The search "Student+photo" did not match any documents


Played around with admin settings, but didn&#39;t change it.

hmm, sooo close&#33; Hope this helps U debug (although you are probably well over this with all you hard work already..... :(

DaGoN
12-15-2005, 02:39 PM
Hi 72dpi,
you have dicovered some interesting bugs... :) nice work&#33;

I&#39;ve discovered a bug in the phrase search option, now it&#39;s been fixed.

New fixes and changes:

1) I&#39;ve add a new parameter in the setting&#39;s panel (Link Weight: trim link if it&#39;s too long). Thx to 72dpi for the idea
2) I&#39;ve add keyword in the css style. Again thx to 72dpi for the idea
3) I&#39;ve fixed a problem in SEF&#39;s
4) I&#39;ve fixed a problem in the "phrase" search option... now it works fine
5) I&#39;ve add "Advanced search" phrase below the search form

Thanks again for testing the plugin,
DaGoN

72dpi
12-16-2005, 12:03 AM
DaGoN,
Well done,

thank you so much for implementing the changes.
you are a champ&#33;

The search functionality works perfectly. Thanks for your hard work =)

Oraos
12-16-2005, 05:22 AM
Thanks for the great search engine&#33;

I have one question/problem:

I&#39;ve installed the search engine with special interest in indexing .pdf and .doc files through Subdreamer. I&#39;ve successfully installed the sphider outside of Subdream and it indexed those files with no problem (I installed the required binaries .exe to support .pdf and .doc translation). Doing so through Subdream, however, seems to present a problem.

I add pdf and doc files using the plug-in "download manager" through subdreamer. I cannot find a way to index .pdf or .doc files through the links that are created through the "download manager" when a file is uploaded in that manner. The links are references to the MySQL database (it doesn&#39;t make a difference if the files are set to be stored in the database or in the file system). The links look like this:

http&#58;//127.0.0.1/sha/plugins/p13_download_manager/getfile.php?categoryid=11&p13_sectionid=1&p13_fileid=10

When I add a link with the full path to the .pdf or .doc file using Subdreamer&#39;s link directory plug-in, the Sphider is able to index their contents. Those links, of course, look like this:

http&#58;//127.0.0.1/sha/downloads/cpt.pdf

Does anyone have an idea how to index .pdf or .doc files using the download manager?

Thanks

DaGoN
12-16-2005, 09:35 AM
@Oraos
Does anyone have an idea how to index .pdf or .doc files using the download manager?
I&#39;ve remove momentarily the possibility to index pdf, doc, xsl and ppt.

I have not installed in my system nothing of these, so please test if my changes work.

New changes:
1) Add possibility to index pdf, doc, xsl and ppt (it works only with files stored in the system. it doesn&#39;t work with files stored in mysql)


Byez,
DaGoN

Oraos
12-16-2005, 06:27 PM
<div class='quotetop'>QUOTE(DaGoN &#064; Dec 16 2005, 05&#58;35 AM) 2484</div>
@Oraos
Does anyone have an idea how to index .pdf or .doc files using the download manager?
I&#39;ve remove momentarily the possibility to index pdf, doc, xsl and ppt.

I have not installed in my system nothing of these, so please test if my changes work.

New changes:
1) Add possibility to index pdf, doc, xsl and ppt (it works only with files stored in the system. it doesn&#39;t work with files stored in mysql)
Byez,
DaGoN
[/b]

Thanks for the quick changes&#33;

Now I can&#39;t seem to get the Sphider to recognize the pdf or doc files found using even standard links now which don&#39;t reference the database.

I see the changes you made where you can now input the path to the pdftotext.exe binary in the plug-in setting. The default values there assume a unix type path - so I changed it to my windows path of

C&#58;&#092;pdftotext&#092;pdftotext.exe

I tried adding this value with no quotes, single quotes, and double quotes - none seem to work. What worked before in the Sphider settings was to have the following set in the p107_settings.php and p107_dgn_search.php

//executable path to pdf converter
&#036;pdftotext_path****= &#39;C&#58;&#092;&#092;pdftotext&#092;&#092;pdftotext.exe&#39;;

//executable path to doc converter
&#036;catdoc_path****= &#39;C&#58;&#092;&#092;catdoc&#092;&#092;catdoc.exe&#39;;

//executable path to xls converter
&#036;xls2csv_path****= &#39;C&#58;&#092;&#092;xlhtml&#092;&#092;xlhtml.exe&#39;;

//executable path to ppt converter
&#036;catppt_path****= &#39;C&#58;&#092;&#092;ppthtml&#092;&#092;ppthtml.exe&#39;;


I&#39;m running subdreamer on windows XP Pro with Apache installed.

What can be done to fix this do you think?

Thanks&#33;

DaGoN
12-16-2005, 10:33 PM
@Oraos
Strange, it works

Download latest version of xpdf from here:

http://www.foolabs.com/xpdf/

My subdreamer root: d:&#092;subdreamer
My pdf coverter root: d:&#092;pdf&#092;pdftotext.exe

In plugin settings set &#39;Index pdf&#39; to yes and in &#39;Executable path to pdf converter:&#39; write : d:/pdf/pdftotext.exe

Now i&#39;ve copy a pdf file in d:&#092;subdreamer&#092;test.pdf

Go to &#39;add site&#39; and write: http://127.0.0.1/test.pdf and press Start Indexing button.
It works fine.

Byez,
DaGoN

Oraos
12-17-2005, 05:09 AM
<div class='quotetop'>QUOTE(DaGoN &#064; Dec 16 2005, 06&#58;33 PM) 2498</div>
@Oraos
Strange, it works

Download latest version of xpdf from here:

http://www.foolabs.com/xpdf/

My subdreamer root: d:&#092;subdreamer
My pdf coverter root: d:&#092;pdf&#092;pdftotext.exe

In plugin settings set &#39;Index pdf&#39; to yes and in &#39;Executable path to pdf converter:&#39; write : d:/pdf/pdftotext.exe

Now i&#39;ve copy a pdf file in d:&#092;subdreamer&#092;test.pdf

Go to &#39;add site&#39; and write: http://127.0.0.1/test.pdf and press Start Indexing button.
It works fine.

Byez,
DaGoN
[/b]

I did what you said and when I index the pdf file, it says this

1. Retrieving&#58; http&#58;//127.0.0.1/cpt.pdf at 01&#58;05&#58;21.
Size of page&#58; 178.73kb. Starting indexing at 01&#58;05&#58;21. Page contains less than words
Links found&#58; 0. New links&#58; 0

The path to the pdftotext.exe file is C:&#092;pdftotext&#092;pdftotext.exe which is set in the new section you added in the plug-in setting.

It worked before when i manually set the path to the pdftotext.exe program in the p107_dgn_search.php and p107_settings.php files.

//executable path to pdf converter
&#036;pdftotext_path****= &#39;C&#58;&#092;&#092;pdftotext&#092;&#092;pdftotext.exe&#39;;

I suppose I will return to the older version of the plug-in with these manual changes if the only thing you changed in the most recent version of the plug were the plug-in settings options.

Strange that it isn&#39;t working. I&#39;ll keep working on it.

I assume you tested this on a windows system too.

Thanks.

DaGoN
12-17-2005, 07:36 AM
Are you sure that in the filed &#39;Executable path to pdf converter:&#39; you have entered : C:/pdftotext/pdftotext.exe and not C:&#092;&#092;pdftotext&#092;&#092;pdftotext.exe or C:&#092;pdftotext&#092;pdftotext.exe ?
Remember that you have to set &#39;Index pdf&#39; to &#39;yes&#39;.
Set spider/tmp permission to 777

I have not change nothing...

if it doesn&#39;t work try to change these:
file p107_settings.php

replace
&#036;index_pdf = &#036;settings[&#39;Index pdf&#39;];
to
&#036;index_pdf = 1;
and
&#036;pdftotext_path = &#036;settings[&#39;Pdf converter&#39;];
to
&#036;pdftotext_path = &#39;C:/pdftotext/pdftotext.exe&#39;;

Byez,
DaGoN

Oraos
12-17-2005, 10:30 AM
<div class='quotetop'>QUOTE(DaGoN &#064; Dec 17 2005, 03&#58;36 AM) 2509</div>
Are you sure that in the filed &#39;Executable path to pdf converter:&#39; you have entered : C:/pdftotext/pdftotext.exe and not C:&#092;&#092;pdftotext&#092;&#092;pdftotext.exe or C:&#092;pdftotext&#092;pdftotext.exe ?
Remember that you have to set &#39;Index pdf&#39; to &#39;yes&#39;.
Set spider/tmp permission to 777

I have not change nothing...

if it doesn&#39;t work try to change these:
file p107_settings.php

replace
&#036;index_pdf = &#036;settings[&#39;Index pdf&#39;];
to
&#036;index_pdf = 1;
and
&#036;pdftotext_path = &#036;settings[&#39;Pdf converter&#39;];
to
&#036;pdftotext_path = &#39;C:/pdftotext/pdftotext.exe&#39;;

Byez,
DaGoN
[/b]

I made sure of the above settings. I even installed another subdreamer site on the same machine and tested. I uninstalled the plug-in, reinstalled it - same problem.

I added the code changes to the .php file you suggested after installing the plug-in and no change.
I uninstalled the plug-in, added the code changes to the .php file you suggested, re-installed plug-in and no change:

11. Retrieving&#58; http&#58;//127.0.0.1/sha/downloads/cpt.pdf at 05&#58;58&#58;31.
Size of page&#58; 178.73kb. Starting indexing at 05&#58;58&#58;31. Page contains less than words
Links found&#58; 0. New links&#58; 0


I was looking at the p107_dgn_search.php file in your latest release and it reads this:

//path to pdf converter &#40;including the file name itself&#41;
&#036;pdftotext_path = &#39;pdftotext.exe&#39;;

I then looked in the p107_settings.php file where it reads this:

//executable path to pdf converter
&#036;pdftotext_path****= &#036;settings&#91;&#39;Pdf converter&#39;&#93;;

Should any changes need to be made to p107_dgn_search.php file as well?

In any case - I know it&#39;s not a problem with my pdftotext.exe file or location since it works when I manually set the path in your older version of the plug-in.

Here is what did work:

1. Install your latest plug-in.
2. Replace the p107_settings.php file with the one I have attached here (taken from your older plug-in version before this latest one with my added paths)
3. I do not touch the plug-in settings after it is insalled in subdreamer (they are left at default)
4. I add my site http://127.0.0.1/sha/ and index it.
5. And it successfully indexes a link to the pdf:

11. Retrieving&#58; http&#58;//127.0.0.1/sha/downloads/cpt.pdf at 06&#58;17&#58;48.
Size of page&#58; 178.73kb. Starting indexing at 06&#58;17&#58;48.
Indexed

Again, the exact modification I made to p107_settings.php is :

//executable path to pdf converter
&#036;pdftotext_path****= &#39;C&#58;&#092;pdftotext&#092;pdftotext.exe&#39;;

I worry that these problems have to do with these forward and backward slashes used in the executable path (i.e. "/" versus "&#092;"). You say that I need "/" slashes when a windows path uses "&#092;". Could this be the problem elsewhere in the code?

I have no idea - I&#39;m no programmer :D

Thanks for helping out&#33;

Mark

DaGoN
12-17-2005, 03:49 PM
Very very strange... when you get the message &#39;Page contains less than words&#39;, it means that spider doesn&#39;t find pdftotext.exe.

Please make another test:
1. Remove my plugin from Settings->plugin.
2. Delete p107_dgn_search folder

Download this version: i&#39;ve add a print of indexed file. if pdftotext.exe is been find then you will see this print in the spidering panel (Debug)File -----&#62; [file contents]. if pdftotext.exe is not been find you will see (Debug)File -----&#62; Starting indexing at 17:47:55[...]

Byez,
DaGoN

Oraos
12-17-2005, 07:40 PM
<div class='quotetop'>QUOTE(DaGoN &#064; Dec 17 2005, 11&#58;49 AM) 2514</div>
Very very strange... when you get the message &#39;Page contains less than words&#39;, it means that spider doesn&#39;t find pdftotext.exe.

Please make another test:
1. Remove my plugin from Settings->plugin.
2. Delete p107_dgn_search folder

Download this version: i&#39;ve add a print of indexed file. if pdftotext.exe is been find then you will see this print in the spidering panel (Debug)File -----&#62; [file contents]. if pdftotext.exe is not been find you will see (Debug)File -----&#62; Starting indexing at 17:47:55[...]

Byez,
DaGoN
[/b]

I&#39;m not sure I understand where you want me to look (spidering panel?). Where is this (Debug) file ----&#62; [file contents] found?

I followed step 1 and 2, downloaded the plugin you added in the previous post, and installed it.

This plug-in version you posted is like the older one - no settings for adding the executable path to the pdftotext.exe program in the plug-in setting. I left the .php files as they were.

When indexing, it only shows that the 127.0.0.1/sha/cpt.pdf "contains no text or html" instead of "page contains less than words".

But this is because the 107_settings.php file doesn&#39;t enable "//index pdf files" - it is set to "0". When I set it to "1" i get the usual "page contains less than words" output.

Do you want me to edit any of the files in that latest plug-in you attached?

Regards,

Oraos

Terminator1138
12-17-2005, 09:27 PM
Just to update, installed and so far no errors running SEF ....will report back more later

DaGoN
12-18-2005, 09:11 AM
I&#39;ve send to you an old version of sphider... sorry. :D

Please download this version with my last changes

Byez,
DaGoN

Oraos
12-18-2005, 07:25 PM
<div class='quotetop'>QUOTE(DaGoN &#064; Dec 17 2005, 11&#58;49 AM) 2514</div>
Very very strange... when you get the message &#39;Page contains less than words&#39;, it means that spider doesn&#39;t find pdftotext.exe.

Please make another test:
1. Remove my plugin from Settings->plugin.
2. Delete p107_dgn_search folder

Download this version: i&#39;ve add a print of indexed file. if pdftotext.exe is been find then you will see this print in the spidering panel (Debug)File -----&#62; [file contents]. if pdftotext.exe is not been find you will see (Debug)File -----&#62; Starting indexing at 17:47:55[...]

Byez,
DaGoN
[/b]

Ok - here is what I did:

1. uninstalled old plug-in, deleted the files.
2. downloaded your latest one and installed it
3. set the path to c:&#092;pdftotext&#092;pdftotext.exe and c:&#092;catdoc&#092;catdoc.exe (I also tried with "/" instead)
4. enabled both pdf and doc indexing in the plug-in settings
3. added my site and indexed it
4. here is the output for the pdf and doc files:


11. Retrieving&#58; http&#58;//127.0.0.1/sha/downloads/cpt.pdf at 15&#58;19&#58;36.
Size of page&#58; 178.73kb. &#40;Debug&#41;File -----&#62; Starting indexing at 15&#58;19&#58;36. Page contains less than 10 words
Links found&#58; 0. New links&#58; 0
12. Retrieving&#58; http&#58;//127.0.0.1/sha/downloads/report.doc at 15&#58;19&#58;36.
Size of page&#58; 31.00kb. &#40;Debug&#41;File -----&#62; Starting indexing at 15&#58;19&#58;39. Page contains less than 10 words
Links found&#58; 0. New links&#58; 0

Both these pages contain more than 10 words :P and index fine if I add the path manually like I&#39;ve mentioned before.

Thanks again for trying to make this work :D

I bet it has something to do with the windows environment i&#39;m working in - are you also testing on a windows machine? Just curious.

Mark

DaGoN
12-19-2005, 08:14 AM
Yes, i&#39;m using windows xp sp1, PHP Version 4.3.10, mysql version 4.1.9.
I&#39;ve add two new prints.

/////////
echo "Pdf enable: " . &#036;index_pdf . "
";
echo "Pdf path: " . &#036;pdftotext_path;
//////////

Please send me the results, one when you set the parameters with the &#39;Search Settings&#39; panel and then when you set the parameters manually.

ps: if both results are identical there are no reason because it doesn&#39;t work.

Here there is p107_settings.php file:

DaGoN
12-20-2005, 10:14 AM
Hi to all betatesters,
i&#39;ve migrate puginid from 107 to 130. It has been necessary because the plugin number that I have chosen is in collision with Radio.Blog SD popup button by oktam. I haven&#39;t see this before, sorry.
I think that the plugin is now stable. I have release a last beta version, tomorrow i&#39;ll release a final version and i&#39;ll post it in download manager.

I&#39;ve tested in my machine to indexing a pdf file and it works fine.
Sorry Oraos but until there aren&#39;t more feedbacks i can&#39;t resolve your problem :(.

Byez,
DaGoN

ps: Remove old plugin from admin panel and delete folder p107_dgn_search before instal this one.

here you go:

Oraos
12-20-2005, 12:39 PM
<div class='quotetop'>QUOTE(DaGoN &#064; Dec 20 2005, 06&#58;14 AM) 2547</div>
Hi to all betatesters,
i&#39;ve migrate puginid from 107 to 130. It has been necessary because the plugin number that I have chosen is in collision with Radio.Blog SD popup button by oktam. I haven&#39;t see this before, sorry.
I think that the plugin is now stable. I have release a last beta version, tomorrow i&#39;ll release a final version and i&#39;ll post it in download manager.

I&#39;ve tested in my machine to indexing a pdf file and it works fine.
Sorry Oraos but until there aren&#39;t more feedbacks i can&#39;t resolve your problem :(.

Byez,
DaGoN

ps: Remove old plugin from admin panel and delete folder p107_dgn_search before instal this one.

here you go:
[/b]

No problem DaGoN. I used your latest modification - and it displayed the path at the top of the screen as "c:&#092;pdftotext&#092;pdftotext.exe" with the same problems as before. I haven&#39;t had time to do further testing. But if it works on your windows machine - then something must be messed up with mine :)

I&#39;ll keep trying, and thanks again for helping - it&#39;s a great plug-in that i&#39;ll use one or another&#33; With my manual hack it works - it&#39;s not a big deal.

Thanks,

Oraos

descds
01-18-2006, 11:59 AM
I&#39;ve just installed this on my local test bed machine for evaluation. And hats of to the coder for an Exceptional import. I&#39;m very excited ...

I&#39;ve set it all up and its currently indexing away happily.

I allowed it full access to our forums currently (with no robots.txt rules in place to test) and of its going indexing away.

Now my question is we have 63000 + articles in the forums and obviously more added daily. How does this repeat index a site ? Obviously i do expect the first crawl to take sometime but how about a second ?

Also i am seeing only a manual update for updated crawls, can this not be run as a cronjob to automate it ? I&#39;m very interested to know how the plugin keeps a fresh search engine and whether there are any potential problems from crawling the forums as well as the rest of the content. It may, of course, be better for us to deny it on the forums but it would be great if we could have it in place to index the entire site.

Keep up the good work ...

DaGoN
01-19-2006, 10:31 AM
Hi descds,
my advice is to use Search Sphider to index all excluding forums, because all forums (vb, ipb ect.) just have a search engine.
My plugin is only an integration of Sphider.
For more information about it you can check in Sphider&#39;s official site:
http://www.cs.ioc.ee/~ando/sphider/
Im working to integrate on Sphider the possibility to search in forum board using the forum&#39;s search engine.
See this discussion:
duplicate pages that get indexed
http://www.cs.ioc.ee/~ando/sphider/forum/b...try.php?id=1087 (http://www.cs.ioc.ee/~ando/sphider/forum/board_entry.php?id=1087)

Byez,
DaGoN