Fog Creek Software
Discussion Board




Is Postgres/MySQL ready for VLDB solutions?

Should we stay away from Postgres or MySQL for VLDB problems?

I have some questions for all you db experts, especially regarding postgres..

Basically our team have been doing scientific marketing, and the load is getting too much for our current configuration. We wish to take sales, web access data, email clickthrough, personal profile data from hundreds of data sources for a dozen clients... plug them into a cheap 5+ server offline db farm.. and be able pull reports (doesn't have to be OLAP.. but that would be a bonus) on large data sets that increases by the gigabytes per week.
The required dataset that make up the input of the regular reports we generate are becoming more than a hundred million rows.

We'll be using major analytic tools to pull from these databases so it's necessary it supports ODBC and friends.

We don't have the budget of large companies. Who can easily justify entire multi-million Siebel installations. And we can't afford wonderful platforms like Oracle on distributed Solaris or SQL Server enterprise on clustered Xeons.

I have heard that Postgres recently acquired the ability to do replication.. that's very exciting.. but will I be able to do selects over partitioned tables spread over multiple databases? I like to learn about what you guys know about free database's limits and near future developments that we can look forward to.

Is there anything in the OSS community that resembles EMC's drive farms?

Any aid or pointers would be much appreciated.

And many apologies to those who feel this thread is off-topic.

Database on the cheap
Thursday, September 25, 2003

From my research on Postgres, it's not quite ready for prime time, especially regarding data integrity.

I'll make a strong suggestion for checking out SQL Server - find out *exactly* what your needs are, and talk to Microsoft or Microsoft Consulting about what the costs would be.

If what you're doing is so very important, you MUST make an informed decision, not a vague "it's too expensive, let's look elsewhere" one.

Look at it another way - let's say that Postgres might be approaching where you need it to be. From my experience with cutting-edge Linux clusters, going to that place means you will be investing a LOT of time and manpower on managing your solution instead of using the results of that solution.

What is your time worth? How much time will be risked/invested/lost on finding out if Postgres is "where you need it to be"? Can you afford to spend a month building a Postgres solution to see *if* it works? And if it doesn't, who do you have on tap to help you out? What's the Postgres talent pool like? What's it like for bleeding-edge solutions?

Only you know how important your work is. Only you know how calendar-critical it is. Only you can budget your licensing and manpower dollars.

Therefore you should get quotes and make it an informed decision.

Philo

Philo
Thursday, September 25, 2003

Avoid postgres - it has good SQL support and transactions, but it is a dog for performance and it is not very good in production.  I've not heard that their replication is anything past experimental or beta stages.

MySQL is solid and good for small, simple database needs, and has good support, but it won't scale to your needs.

You might want to look at SapDB or Firebird - both free, and they have better scalability characteristics than PostGres or MySQL.

If you are running a database that large, you may find that you spend more money on hardware than you would spend on an Oracle license in order to get it running quickly and reliably.

Colin Evans
Thursday, September 25, 2003

Nah - even at those sizes the database licensing generally outstrips hardware costs pretty quickly. But regarding Oracle vs. SQL Server, don't forget that SQL Server runs 25-50% of the cost of Oracle, and that Oracle costs don't include any of the tools that SQL Server ships with.

Philo

Philo
Thursday, September 25, 2003

What about http://firebird.sourceforge.net ?

GiorgioG
Thursday, September 25, 2003

Postgres isn't a dog anymore as its seen significant speed improvements since turning v7.  It may be sufficient to meet your needs. 

One problem with asking IT people about these open source solutions is that they have a habit of getting better when you're not looking.  People still complain about MySQL not having referential integrity (well now it does with InnoDB and I think by default in v4).

The PostreSQL people have made a lot of changes in the last year and its worth asking their user groups a few pointed questions.  You might be surprised by how candid some of the lead devlopers are about their product's limitations.

Lou
Thursday, September 25, 2003

From the Firebird homepage:
"Firebird is a relational database offering many ANSI SQL-92 features"

Doesn't exactly inspire confidence, IMHO. More importantly, SQL-92 is the thing that lets you leverage experience across databases - that gives you a chance of getting an Oracle guy to help with MySQL or whatever. If you lose that, then you're in the realm of looking for "Firebird experts," which I suspect is a vanishingly small talent pool.

The point I've been alluding to, and I'll say it directly: license costs are often the cheapest part of a database installation. Your big problem is ongoing cost of operation. This isn't just how much you pay for a DBA, but how much does it cost you to, say, lose a month's data? Or to have queries take 10x as long to run? How about if the data is unavailable one day a week due to maintenance?

These are all costs, whether they're direct or indirect, and the impact of those costs are best evaluated by the original poster. If the cash savings up front means he's willing to have a database available 12/3 instead of 24/7, then that may be the way to go. I'm just helping the original poster understand what he's trying to bargain for, and suggesting he make darn sure he knows the costs he's bargaining against.

Philo

Philo
Thursday, September 25, 2003

Firebird is in fact InterBase.

software enthusiast
Thursday, September 25, 2003

If you all don't mind, I would like to add Firebird and SAPDB into this mix of the discussion.. (thanks to whoever pointed these two RDBMS out) will someone who is knowledible about these two additional databases please provide examples of replications and scalability?

Database on the cheap
Thursday, September 25, 2003

PostgreSQL is hardly slow.  In fact, if you are going to be doing a large number of INSERTs PostgreSQL will most likely provide much better performance than MySQL (and has in past tests I have performed).  As far as whether PostgreSQL is up to large tasks, the entire .org registry runs on PostgreSQL.

Anthony Rubin
Thursday, September 25, 2003

> As far as whether PostgreSQL is up to large tasks, the entire .org registry runs on PostgreSQL.

IMO, this is the kind of data you need: verifiable success stories (no, I haven't verified this one, but it should be easy enough to check).

The points about tool support, and the ability to find experienced DBAs are well taken; but if you can find a number of other shops that are operating successfully with the toolset that you use and the same data size, than you can be fairly sure that the tech is up to spec.

Portabella
Thursday, September 25, 2003

Here is the information about the .org database:

http://www.linuxworld.com.au/index.php?id=1166688643

Case studies are available here:

http://advocacy.postgresql.org/casestudies/

I also want to point out pgAdmin once again for those looking for a decent administration tool for Windows:

http://www.pgadmin.org/

Anthony Rubin
Thursday, September 25, 2003

I also wanted to point out that one option for replication in PostgreSQL is a well tested commercial product that was recently contributed to the project.  Here is the information:

http://archives.postgresql.org/pgsql-announce/2003-08/msg00022.php

The following paragraph from the announcement is probably the most relevant to this discussion:

"eRServer(c) has proven itself by supporting large mission critical systems, including the .INFO and .ORG registry management system." observes PostgreSQL, Inc.'s President Marc Fournier. "Version 1.0 was released commercially in October 2001 and today I am proud to
announce that an enhanced version of that source code is being released to the community under the BSD Open Source Software License.

Anthony Rubin
Thursday, September 25, 2003

While I'm not a huge MySQL fan I also wanted to point out that large portions of Yahoo, including historical financial data for finance.yahoo.com, rely on MySQL.  I really have to question any "research" performed by some of the people doubting MySQL and/or PostgreSQL in this thread.  Both RDBMS are already being used for very large projects, you just don't always hear about it.

Anthony Rubin
Thursday, September 25, 2003

Here is the information on my last claim:

http://www.mysql.com/press/user_stories/yahoo_finance.html

The following paragraph should be of interest:

Prior to its switch to MySQL in 2000, Yahoo! Finance managed its database needs through homegrown flat files and Berkeley DB databases. But those solutions proved to be inflexible and not scalable enough for their needs, with the volume of data nearly doubling every few months. "Our needs vary considerably, and that is one reason that MySQL works well with us. It handles the high-volume, 260 million record tables just as well as the small, low-volume ones. One of our servers handled over a quarter of a billion queries in a month-and-a-half, and it still has capacity to spare."

Anthony Rubin
Thursday, September 25, 2003

My research was simply surfing the mysql and postgre communities - I've found they're each amply capable of pointing out the others' shortcomings.

I'd be interested to do a costing analysis on Yahoo's finance solution - obviously there was coding work that went into making MySQL do what they needed - maybe the cost was break-even with implementing a major RDBMS, or maybe it cost more?

In addition, the original poster said he would like analysis services - OLAP cubes are in the box with SQL Server; again a matter of "spend a little money up front, save lots of time down the road"

I'm just suggesting a well-informed decision.

Philo

Philo
Thursday, September 25, 2003

Hi, a busy day at the office,

Just to follow up with some more details.

At the moment our team need a good data warehouse for primarily offline reporting of very large data sets, we do have other tools which handles OLAP tasks very well which can connect to most ODBC-compatible databases. Although we might create final reports using just sql queries, we are starting to use 3rd-party professional OLAP tools more and more to serve the need of analysis and reporting.

It's critical though that our future new database meet certain expectations. We depend on the database (farm?) to serve the required data in a timely manner (despite the large datasets and transformations) and in a form friendly to OLAP tools.

Database on the cheap
Thursday, September 25, 2003

Many thanks to everyone here (and all the readers) for your time and insights.

I hope our team can gather up even more information from the readers of Joel On Software before judging on the merits of each free RDBMS.

Database on the cheap
Thursday, September 25, 2003

One other note - this fall MS is going to roll out SQL Server Reporting Services, which will provide direct reporting from SQL Server.

Based on MS' recent products, it promises to be a pretty neat setup. You might want to try to get on the beta program or wait until it rolls to check it out, if reporting is a fairly important part of your business.

Philo

Philo
Thursday, September 25, 2003

PostgreSQL performance is a dog if you don't tune it. Basically, in order for it to run on stock unix kernels, it's set to use 4megs of shared memory. That's because it's the lowest common denominator available without kernel parameters being altered run-time (ie: in /proc for Linux) or compile time (most BSDs.)

I've had very good success with 7.2 on a dual p3-933 using a gigabyte of ram for cache-ing. In my situation, it was more then twice as fast as MySQL.

Spreading the tables and log files accross multiple volume is possible (by symlinking the files from the default location to a different volume, the files are only opened once so there's no drawbacks.) I found that my limiting factor on large data was the speed of my disk I/O (only an adaptec RAID-5 with 6 seagate cheetah 18GB disks, 80MB/sec sustained and half that on writes.)

Saruman
Thursday, September 25, 2003

Correct me if I'm wrong, but I don't believe Mysql has views.

Philo, for what Yahoo uses Mysql for it probably can't be beat.  I don't think they are doing much in the way of transactions.  Mysql is terribly fast for reading data which is why websites use it.  I know my own company might use it in the place of SQL Server to have available from the web.  Our strategy, to save money is load mysql from sqlserver thereby getting around the internet connection license.  Our website on Apache/Linux then queries the mysql database.  The database gets reloaded every night from SQL Server.  For us this looks cheaper and more secure than doing an IIS/SQLServer web solution.  We haven't fully decided yet, but are 80% sure this will be our direction

Our ERP software is backended by SQL Server and it performs very nicely and no I wouldn't trust Mysql to do that task.  If you want a quick database backend to a website mysql is the ticket, especially if you load it from an enterprise class db such as Oracle, SQL Server, DB2.  If on the other hand you want a system to do transactions pick something else over mysql.

Mike
Thursday, September 25, 2003

What are VLDB solutions?

is it Very Large DataBase?

Ossie
Thursday, September 25, 2003

Mike - SQL Server now has a per-processor license...

Philo

Philo
Thursday, September 25, 2003

Philo, is your mother on the SQL/Server design team? Your (increasingly) passionate advocacy for this product must have either family or religious roots!

HeWhoMustBeConfused
Thursday, September 25, 2003

"We wish to take sales, web access data, email clickthrough, personal profile data from hundreds of data sources for a dozen clients... plug them into a cheap 5+ server offline db farm.. and be able pull reports (doesn't have to be OLAP.. but that would be a bonus) on large data sets that increases by the gigabytes per week.
The required dataset that make up the input of the regular reports we generate are becoming more than a hundred million rows."


If you have dozens of clients and databases of more than a hundred million rows and can't afford an oracle setup, you aren't charging enough money for your services.

That said, what you are doing IS feasible with postgres, but I would not recommend it.

rz
Friday, September 26, 2003

HeWhoMustBeConfused,

<quote>
Philo, is your mother on the SQL/Server design team? Your (increasingly) passionate advocacy for this product must have either family or religious roots!
</quote>

Thankyou for your insightful contribution. Good to see you also applied the same even handedness against those advocating other solutions.

More likely than your suggestions is that Philo's advocacy is based on use of and appreciation for MS SQL Server. Just like others use and appreciate Oracle, Firebird, mySQL etc etc...

Seeya

See

Matthew
Friday, September 26, 2003

haha, I noticed Philo's undying loyalty to msSql too!  I think its a pretty good database, but I like mysql too.  I've noticed MSSQL will "tune" queries automatically.  We're using it for offline reporting at our office, and we'll mine through 50million rows doing some complex joins, and it will take about 14 minutes (returning a couple hundered thousand).

Vince
Friday, September 26, 2003

> MySQL is solid and good for small, simple database needs,
> and has good support, but it won't scale to your needs.

I'm sorry? We're running a pretty big MySQL database application at work. With over 10-20 database servers, all replicating. Might be the database is not rocket-sience (e.g. simple) but it *does* scale!

Jilles Oldenbeuving
Friday, September 26, 2003

Matthew, thanks for the support!
I've been using MSSQL for a few years, and been living in it for the past two years. It is, quite simply, an exceptionally fine product. While MS may (nay, does) have faults in other products, they've made SQL Server a flagship product.

As for why I "push" it so fervently, what I *do* hate is FUD and uninformed decisions. The one I hate most of all is someone who turns away from *any* product as "too expensive" when they haven't even costed it out, much less compared the actual TCO to that of a "free" solution.

I've labored in pain under too many projects whose technology was chosen by a magazine manager for the wrong reasons - I try to fight those kinds of decision-making processes wherever I find them.

So when someone says "I have a mission-critical application, but I can't afford SQL Server," I ask "why? how much would SQL Server cost?" and when I get the answer "I don't know. A lot" I get miffed. :-)

Philo

Philo
Friday, September 26, 2003

Well... here comes a flame war...

I think Matthew's comments re: Philo's posts are on the money personally. The person who originated the thread asked specifically about OSS/freeware products. Philo's initial comment was warranted, but his other comments in this thread seem redundant/noisy.

Can we all respect the fact that Philo thinks MS SQL Server is REALLY REALLY good; Philo can you please respect the fact that MS SQL might not be a feasible solution for the original poster and can well all then move on to answering the original question.

Walter Rumsby
Friday, September 26, 2003

You really need to think hard about what you'll be doing with your data going forward to make a good decision.

MySQL is blazingly fast and scales well enough that it will handle all your data needs.  And if you buy a license (not expensive) you get a real vote on what the developers do next and support from experts too.  If you're not looking for ACID but you do need to retrieve information quickly, MySQL is a great tool.  That said, I don't believe sub-selects are implemented yet, necessitating the creation of a view to join on.  That's a pain if you do a lot of ad-hoc queries that are relatively complex and require them.  If you're just pulling data basec on simple criteria, its probably one of the better choices.

If you're looking for inexpensive, ACID, referential integrity, scalability, and can suffer a bit of slowness compared to MySQL (which turns out to be fast still) then PostgreSQL is a good choice.  I've used it on a few projects, and while we aren't tossing around huge sets of data, we did build a transaction server on top of it.

SAPDB is a gread database that SAP R|1 and 2 ran on.  I think R|3 will still run on it, but its more at home on Oracle or DB2.  The one comment I keep hearing about it from developers is that the code is poorly documented (as in barely) and that its pretty much a rat's nest.  That hampers quick development and will be a stumbling block for a while.  That's important to coders, to a developer you should know that if it doesn't quite meet your requirements now, its probably not a good solution because it won't meet them any time soon.

As for leveraging common SQL commands. The only problem I find with that is that Oracle (and SQL Server until recently, and DB2 until a few years ago) had very different ways of performing joins (syntax).  It was really a bear to convert (I've done Oracle to DB2 and back a few too many times).  So while a lot of things will translate well, a few things take a bit of thought to translate across from one to another.

So write down your requirements.  How much data, how large are the tables (width and length), what types of data are you storing.  Do you need quick indices on large text fields (LOBs)?  How important is it that the database returns a result in 10% less time.  What types of joins and queries will you be writing?  Will there be a lot of complex ad-hoc queries? 

I think that should help you choose your DB, or at least narrow it down to two.

* I didn't talk about Firebird as its only come to my attention in the last few months with the Mozilla naming problems

Lou
Friday, September 26, 2003

Walter, I was questioning some assumptions the original poster made on which he (or she) based his/her decision that they couldn't afford SQL/Oracle.

Specifically - if a free RDBMS solution requires 1.5 DBA/Devs, while SQL only requlres .25 DBA's, that's a net cost of 1.25 DBA/year, or about $60k/year (or more).

I can create a really, really good SQL installation for $50k which would then be paying for itself inside of twelve months.

Now I don't *know* that an OSS solution would require 1.5 DBA/Devs to do the job, nor do I know what an MSSQL install would cost for the original poster's requirements - only the originator of the thread knows that, which is why I was so adamant that they actually do the math before betting their business on a solution (ANY solution).

I have personal issues with major corporate decisions made based on insufficient data or FUD. So it becomes a personal issue for me. I apologize.

Philo

Philo
Friday, September 26, 2003

It's not Philo's statements that are out of place, but the criticisms of them.

To anyone who has worked in large corporations, with high transaction loads, Philo's views resonate strongly.

JM
Friday, September 26, 2003

JM: not me.

I am amazed it takes so many admins for whatever database always, frankly.  Get a better admins.

i like i
Friday, September 26, 2003

"Avoid postgres - it has good SQL support and transactions, but it is a dog for performance and it is not very good in production."

Gentlemen supports those kinds of claims with numbers. Where are yours?

Passater
Friday, September 26, 2003

I think this is a really good Discussion.  I know that MSSQL does some queries faster then MySQL does.  I've copied databases over from one to another, on almost identical machines, and found a HUGE difference in performance. (now, this wasn't the most optimized query.  When I optimized it, the performance edge still went to MSSQL, but not by a lot).  I will agree that MSSQL is a better DB then MySQL.  But by how much?  I don't think that it will take any more time to Administer MySQL then it will to administer MSSQL, but is the extra cost paid for by the slight edge in performance?

Vince
Friday, September 26, 2003

My coworkers and I would like to thank everyone for your inputs! We will take your information and try to come up with a plan to analyse each server's benefits in more depth. Thanks! We really appreciate it.

Database on the cheap
Friday, September 26, 2003

>>"Avoid postgres - it has good SQL support and transactions, but it is a dog for performance and it is not very good in production."

> Gentlemen supports those kinds of claims with numbers

I think this is a bit wrong-headed: both the initial statement and the response.

Generally, we look for *success stories* to show that a particular technology does what it says it does. A massive amount of failure stories are also an interesting data point ("almost nobody can actually get this thing to work"), but individual failure stories are not, and, for the most part, neither are individual failures to scale.

Some of the reasons for this approach are:

*  It's easy to screw things up by using complex technology unintelligently
* Sometimes technology is the scapegoat for project failures
* Some projects may actually want the technology to fail (eg, because they prefer a different technology and want a reason to use it)

Portabella
Friday, September 26, 2003

Philo,

I appreciate the point you are trying to make, I think your reasons are both sincere and sound, but I feel you made that point with your intial point and your subsequent answers don't really add anything to that point.

We're all guilty of that sometimes. That's all I want to say about this - it's certainly not a personal attack.

Walter Rumsby
Friday, September 26, 2003

Point taken, and my apologies to the deceased equine. :)

Philo

Philo
Friday, September 26, 2003

Hello!

I've been reading the posts with great interest.

I really have problems with all of this talk about success stories, this RDBMS scales better then that RDBMS, or has faster transactions. Measureing  the overall performance of an RDBMS is a complex and non-trivial task.

In the cyclical development scheme of constructing a large DB the choice of which RDBMS to go with comes after the visualization of the involved data, only at this point can the developers make an informed decision based on the specs of what the DB actually is going to be used for. Opting for what ever the DBA knows does not hold, there are online docs and free trail downloads.

My 2 cent's is this. Hold the decision of which RDBMS to go with until you have a very good picture of the structure and use of the DB, test parallel RDBMS with key "real life" implemenations of the DB (put SQL to good use). After this go with the system that best holds up for your needs.

It's likely going to be something OpenSource ;-)   

"parden my French" (English)

Hans Lövhaga
Saturday, September 27, 2003

*  Recent Topics

*  Fog Creek Home